### Step1: Download data from azure blob storage

This code downloads the Msft financial transcripts from a blob storage. If you do not have access to blob, then download "Microsoft Earning Call Transcripts" for four quarters for year 2023 and put it in "Data" folder. Make sure to rename the file similar to "MSFTTranscriptFY23Q4.docx"

Msft Earning Call Transcripts for 2023-Q4
https://www.fool.com/earnings/call-transcripts/2023/07/25/microsoft-msft-q4-2023-earnings-call-transcript/



In [4]:
from azure.storage.blob import BlobServiceClient
import os
from pathlib import Path

# Name of the container in the Blob Storage
container_name = "public"

# Local directory path to save the downloaded files
local_directory = Path("DATA/")

def download_files_from_blob_storage(container_name, local_directory):
    # Create a BlobServiceClient using the default credentials (public access)
    blob_service_client = BlobServiceClient.from_connection_string("DefaultEndpointsProtocol=https;AccountName=appliedaipublicdata;EndpointSuffix=core.windows.net")

    # Get a reference to the container
    container_client = blob_service_client.get_container_client(container_name)

    # List all blobs in the container
    blob_list = container_client.list_blobs()

    for blob in blob_list:
        blob_name = blob.name
        print(blob_name)
        
        # Check if the blob has a .docx extension (Word document)
        if blob_name.lower().endswith(".docx"):
            blob_client = container_client.get_blob_client(blob_name)
            
            # Construct the local file path to save the blob
            local_file_path = os.path.join(local_directory, blob_name.split("/")[-1])  # Use only the last part of the blob path
            
            # Download the blob to the local directory
            with open(local_file_path, "wb") as local_file:
                blob_data = blob_client.download_blob()
                local_file.write(blob_data.readall())
            
            print(f"Downloaded: {blob_name}")


download_files_from_blob_storage(container_name, local_directory)


MicrosoftEarningReports/MSFTTranscriptFY23Q1.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q1.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q1.pdf
MicrosoftEarningReports/MSFTTranscriptFY23Q2.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q2.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q2.pdf
MicrosoftEarningReports/MSFTTranscriptFY23Q3.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q3.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q3.pdf
MicrosoftEarningReports/MSFTTranscriptFY23Q4.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q4.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q4.pdf
MicrosoftEarningReports/README.md


### Step 2: Convert .docx to .pdf format

In [5]:
from docx2pdf import convert
import os

directory = Path('DATA')
docx_files = [filename for filename in os.listdir(directory) if filename.endswith('.docx')]
name_len_docx = []
name_len_pdf = []
print(len(docx_files))

4


  from .autonotebook import tqdm as notebook_tqdm


In [6]:
for filename in docx_files:
    
    docx_path = os.path.join(directory, filename)
    # if len(filename)>35:
    #     filename = filename[:35]
    pdf_path = os.path.join(directory, f"{os.path.splitext(filename)[0]}.pdf")

    # Check if PDF already exists
    if os.path.exists(pdf_path):
        print(f"Skipping conversion for {filename}. PDF already exists.")
        continue

    name_len_docx.append(len(docx_path))
    print(filename, name_len_docx)
    name_len_pdf.append(len(pdf_path))
    print(name_len_pdf)
    try: 
        convert(docx_path, pdf_path)
    except:
        print('Error in converting file, retrying...')
        try:
            convert(docx_path, pdf_path)
        except:
            Exception("Error in converting file")


MSFTTranscriptFY23Q1.docx [30]
[29]


100%|██████████| 1/1 [00:09<00:00,  9.19s/it]


MSFTTranscriptFY23Q2.docx [30, 30]
[29, 29]


100%|██████████| 1/1 [00:04<00:00,  4.07s/it]


MSFTTranscriptFY23Q3.docx [30, 30, 30]
[29, 29, 29]


  0%|          | 0/1 [00:00<?, ?it/s]

Error in converting file, retrying...


100%|██████████| 1/1 [00:03<00:00,  3.64s/it]
  0%|          | 0/1 [00:06<?, ?it/s]


MSFTTranscriptFY23Q4.docx [30, 30, 30, 30]
[29, 29, 29, 29]


  0%|          | 0/1 [00:00<?, ?it/s]

Error in converting file, retrying...


100%|██████████| 1/1 [00:04<00:00,  4.08s/it]
  0%|          | 0/1 [00:06<?, ?it/s]
