# Notebook for experimenting with chunk sizes
This notebook is structured so that you can adjust the code to work with your scenario and tune it for different combinations of chunking parameters.
The values we'll be focusing on here are chunking size (in tokens) and overlap between consecutive chunks (also in tokens). As these values are used throughout the code, we recommend setting them in the beginning of the notebook, as part of step 0.

The notebook is tructured so that you need to run it from Step 0, but if you already successfully completed any steps between 1 and 4 as part of previous runs, they can be skipped. This is convenient for time saving when you work with data volumes that require substantial amount of time for processing.

### Step 0. Initialization of the workspace

This is the part where we read configuration and import modules that will be needed throughout the notebook. 

In [225]:
from dotenv import dotenv_values
import json
from pathlib import Path
import os
import pandas as pd
import re
import numpy as np
import time 
from azure.core.credentials import AzureKeyCredential

# specify the name of the .env file name 
env_name = "../../llm.env" # change to your own .env file name
config = dotenv_values(env_name)

Next, we set up configuration for the experiment via 3 variables. For considerations on picking most reasonable values for your scenarios, check aka.ms/chunk_size.

In [226]:
dataset_name = "mft"
chunk_size=2048
chunk_overlap=10
top_n=2

Below we're printing a "nickname" of the experiment derived from the parameters configured above. It will be used in the rest of the notebook as part of filenames, configuration of generated indices, etc. It allows to immediately say which parameters were used in the run (e.g. here it's "Microsoft financial transcripts" dataset, with chunksize of 1024 tokens and overlak of 10 tokens between them). Preserving it for tracking purposes and benchmarking is useful

In [227]:
experiment_name = f"{dataset_name}_top{top_n}n_s{chunk_size}t_o{chunk_overlap}t"
print(experiment_name)

mft_top2n_s2048t_o10t


### Step1: Download and transform data

The purpose of this step is to collect data from wherever it is and put in one place, then transform into text format augmented with positional information in the source (the latter is optional, may be removed if citation capability is not important for the scenario) 

Here we get data from azure blob storage and dump it locally. This code downloads the Msft financial transcripts from a blob storage. If you do not have access to blob, then download "Microsoft Earning Call Transcripts" for four quarters for year 2023 and put it in "Data" folder. Make sure to rename the file similar to "MSFTTranscriptFY23Q4.docx"

Msft Earning Call Transcripts for 2023-Q4
https://www.fool.com/earnings/call-transcripts/2023/07/25/microsoft-msft-q4-2023-earnings-call-transcript/


In [228]:
from azure.storage.blob import BlobServiceClient
import os
from pathlib import Path

# Name of the container in the Blob Storage
container_name = "public"

# Local directory path to save the downloaded files
local_directory = Path("DATA/")

def download_files_from_blob_storage(container_name, local_directory):
    # Create a BlobServiceClient using the default credentials (public access)
    blob_service_client = BlobServiceClient.from_connection_string("DefaultEndpointsProtocol=https;AccountName=appliedaipublicdata;EndpointSuffix=core.windows.net")

    # Get a reference to the container
    container_client = blob_service_client.get_container_client(container_name)

    # List all blobs in the container
    blob_list = container_client.list_blobs()

    for blob in blob_list:
        blob_name = blob.name
        print(blob_name)
        
        # Check if the blob has a .docx extension (Word document)
        if blob_name.lower().endswith(".docx"):
            blob_client = container_client.get_blob_client(blob_name)
            
            # Construct the local file path to save the blob
            local_file_path = os.path.join(local_directory, blob_name.split("/")[-1])  # Use only the last part of the blob path
            
            # Download the blob to the local directory
            with open(local_file_path, "wb") as local_file:
                blob_data = blob_client.download_blob()
                local_file.write(blob_data.readall())
            
            print(f"Downloaded: {blob_name}")


download_files_from_blob_storage(container_name, local_directory)

MicrosoftEarningReports/MSFTTranscriptFY23Q1.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q1.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q1.pdf
MicrosoftEarningReports/MSFTTranscriptFY23Q2.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q2.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q2.pdf
MicrosoftEarningReports/MSFTTranscriptFY23Q3.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q3.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q3.pdf
MicrosoftEarningReports/MSFTTranscriptFY23Q4.docx
Downloaded: MicrosoftEarningReports/MSFTTranscriptFY23Q4.docx
MicrosoftEarningReports/MSFTTranscriptFY23Q4.pdf
MicrosoftEarningReports/README.md


Downloaded data is in DOCX, and below we convert it to PDF as an intermediate step for transforming it into text. Note that this code requires Windows machine to run; if you're using Linux distribution, you need alternative ways to do DOCX->PDF

In [229]:
from docx2pdf import convert
import os

directory = Path('DATA')
docx_files = [filename for filename in os.listdir(directory) if filename.endswith('.docx')]
name_len_docx = []
name_len_pdf = []
print(f"File count: {len(docx_files)}")

for filename in docx_files:
    
    docx_path = os.path.join(directory, filename)
    # if len(filename)>35:
    #     filename = filename[:35]
    pdf_path = os.path.join(directory, f"{os.path.splitext(filename)[0]}.pdf")

    # Check if PDF already exists
    if os.path.exists(pdf_path):
        print(f"Skipping conversion for {filename}. PDF already exists.")
        continue

    name_len_docx.append(len(docx_path))
    print(filename, name_len_docx)
    name_len_pdf.append(len(pdf_path))
    print(name_len_pdf)
    try: 
        convert(docx_path, pdf_path)
    except  Exception as e: print(e)

File count: 7
~$FTTranscriptFY23Q1.docx [30]
[29]
docx2pdf is not implemented for linux as it requires Microsoft Word to be installed
Skipping conversion for MSFTTranscriptFY23Q1.docx. PDF already exists.
~$FTTranscriptFY23Q4.docx [30, 30]
[29, 29]
docx2pdf is not implemented for linux as it requires Microsoft Word to be installed
~$FTTranscriptFY23Q2.docx [30, 30, 30]
[29, 29, 29]
docx2pdf is not implemented for linux as it requires Microsoft Word to be installed
Skipping conversion for MSFTTranscriptFY23Q3.docx. PDF already exists.
Skipping conversion for MSFTTranscriptFY23Q2.docx. PDF already exists.
Skipping conversion for MSFTTranscriptFY23Q4.docx. PDF already exists.


Finally, convert the PDF files into text with positional metadata and dump intermediate results as json files locally (this way you don't have to run the notebook from scratch every time, you can start from the step where you left off). 

This code sample shows Prebuilt Document operations with the Azure Form Recognizer client library. 
The async versions of the samples require Python 3.6 or later.

To learn more, please visit the documentation - Quickstart: Form Recognizer Python client library SDKs
https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-v3-python-sdk

In [230]:
# Extract line number and page number
def create_tuples_with_position_metadata(result, file_name):
    '''
    Input: result of form recognizer analyze_pdf function
    Output: Create list of tuples of the form (line, position) 
    This will keep reference of the line number and page number of each line in the document.
    '''
    line_page_tuples = []

    total_pages = len(result.pages)
    for page_num in range(total_pages):
        lines = result.pages[page_num].lines
        total_lines = len(lines)

        for line_num in range(total_lines):
            line = lines[line_num].content
            line_page_tuples.append((line, f"{file_name}, P.{page_num + 1}, line {line_num + 1}"))

    return line_page_tuples

In [231]:
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = config["AZURE_FORM_RECOGNIZER_ENDPOINT"]
key = config["AZURE_FORM_RECOGNIZER_KEY"]

# Read pdf files
def analyze_pdf(doc_path):  
    with open(doc_path, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            "prebuilt-document", document=f
        )
    result = poller.result()
                
    return result

# Define document analysis client
document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, credential=AzureKeyCredential(key)
    )

doc_dir = Path("DATA/")
pdf_files = [filename for filename in os.listdir(doc_dir) if filename.endswith('.pdf')]

for file_name in pdf_files:
     
    file_path_in = os.path.join(doc_dir, f"{os.path.splitext(file_name)[0]}.pdf")
    
    # analyze the pdf using form recognizer
    result = analyze_pdf(file_path_in)
    
    line_position_tuples = create_tuples_with_position_metadata(result, file_name)

     # Saving results to csv files
    if not os.path.exists(f"AnalyzedPDF/{experiment_name}"):
        os.makedirs(f"AnalyzedPDF/{experiment_name}")

    print('writing the results of: \n' + file_name)  
    file_path_out=f"AnalyzedPDF/{experiment_name}/{file_name[0:-4]}.json"
    if not os.path.exists(file_path_out):
         # Convert and write JSON object to file
        with open(file_path_out, "w") as outfile: 
            json.dump(line_position_tuples, outfile)
    else:
        print(f'{file_path_out} already exists, skipping...')

writing the results of: 
MSFTTranscriptFY23Q3.pdf
writing the results of: 
MSFTTranscriptFY23Q2.pdf
writing the results of: 
MSFTTranscriptFY23Q4.pdf
writing the results of: 
MSFTTranscriptFY23Q1.pdf


### Step 2. Preprocessing: Chunking, Embedding, and Filtering

In this stage we take text segments with positional references that was generated in stage 1 and dumped as json, and convert this input according to the specificaiton for chunking set in the beginning of this notebook. We also have custom code for adding filtering at later stage. If filtering is not needed, this code can be removed or simplified (instead of actual values for filtering, dummy value can be added)

In [232]:

# Extract stock symbol, year, and quarter from filename
def extract_info_from_filename(filename):
    '''
    Input: filename ("MSFTTranscriptFY23Q4")
    Output: Extract stock symbol, year and quarter from filename
    '''
    pattern = r'([A-Z]+)TranscriptFY(\d{2})Q(\d)'
    match = re.search(pattern, filename)
    
    if match:
        symbol = match.group(1)
        fiscal_year = match.group(2)
        fiscal_quarter = match.group(3)
        return symbol, fiscal_year, fiscal_quarter
    else:
        return None
    
# Extract line number and page number
def chunk_in_tokens(lines_with_position_metadata, chunk_length, chunk_overlap_length):
    return chunk_in_chars(lines_with_position_metadata, chunk_length*4, chunk_overlap_length*4)

def chunk_in_chars(lines_with_position_metadata, chunk_length, chunk_overlap_length):
    chunks = []
    current_chunk = ""
    current_position = 0
    for line, position in lines_with_position_metadata:
        while len(line) > chunk_length:
            chunks.append((line[:chunk_length], position))
            line = line[chunk_length - chunk_overlap_length:]
            current_position += chunk_length - chunk_overlap_length
        current_chunk = current_chunk + line + " "
        if len(current_chunk) >= chunk_length:
            chunks.append((current_chunk[:chunk_length], position))
            current_chunk = current_chunk[chunk_length - chunk_overlap_length:]
            current_position =  chunk_length - chunk_overlap_length
        else:
            current_chunk += " "
    if current_chunk:
        chunks.append((current_chunk, position))
    return chunks


In [233]:
doc_dir = Path(f"AnalyzedPDF/{experiment_name}") 
files = [filename for filename in os.listdir(doc_dir) if filename.endswith('.json')]

dfs = []

for file_name in files:
    
    file_path = os.path.join(doc_dir, f"{os.path.splitext(file_name)[0]}.json")
    # Opening JSON file
    f = open(file_path)
    
    # returns JSON object as 
    # a dictionary
    data = json.load(f)

    # get the chunks in a tuple of the form (chunk, page_number, line_number)
    chunks = chunk_in_tokens(data, chunk_size, chunk_overlap)   

    # Write results to dataframe 
    df_chunks = pd.DataFrame(chunks, columns = ['Chunk', 'Position'])  

    # todo: one field called 'Filter' instead of individual fields
    df_chunks["Ticker"], df_chunks["Year"], df_chunks["Quarter"]  = "NULL", "NULL", "NULL"

    values = extract_info_from_filename(file_name) # symbol, fiscal_year, fiscal_quarter
   
    if values:
        symbol, fiscal_year, fiscal_quarter = values
        df_chunks["Ticker"], df_chunks["Year"], df_chunks["Quarter"]  = symbol, fiscal_year, fiscal_quarter
        
    # Reorder dataframe column name
    new_column_order = ['Ticker', 'Year', 'Quarter', 'Chunk', 'Position']
    df_chunks = df_chunks[new_column_order]
        
    # Add all datframe to list
    dfs.append(df_chunks)
 

    print('writing the results of: \n' + file_name)  
    file_path=f"AnalyzedPDF/{experiment_name}/Chunks_{experiment_name}_{file_name[0:-4]}.csv"
    if not os.path.exists(file_path):
        df_chunks.to_csv(file_path, index=False)
    else:
        print(f'{file_path} already exists, skipping...')
        
## Combine all the files
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
df.insert(0, 'Id', [i for i in range(1, df.shape[0]+1)]) # Add 'Id' column

## Save to csv
df.to_csv(f"AnalyzedPDF/{experiment_name}/Chunks_{experiment_name}.csv", index=False)

df.head()


writing the results of: 
MSFTTranscriptFY23Q2.json
writing the results of: 
MSFTTranscriptFY23Q3.json
writing the results of: 
MSFTTranscriptFY23Q1.json
writing the results of: 
MSFTTranscriptFY23Q4.json


Unnamed: 0,Id,Ticker,Year,Quarter,Chunk,Position
0,1,MSFT,23,2,Microsoft FY23 Second Quarter Earnings Confere...,"MSFTTranscriptFY23Q2.pdf, P.6, line 5"
1,2,MSFT,23,2,mizing its operations. Investec is closing dea...,"MSFTTranscriptFY23Q2.pdf, P.11, line 14"
2,3,MSFT,23,2,Xbox Game Studios. And we'll be sharing detai...,"MSFTTranscriptFY23Q2.pdf, P.16, line 4"
3,4,MSFT,23,2,ross margin percentage declined roughly 3 poi...,"MSFTTranscriptFY23Q2.pdf, P.20, line 17"
4,5,MSFT,23,2,"roducts and cloud services, on a strong prior ...","MSFTTranscriptFY23Q2.pdf, P.25, line 8"


#### Load the chunks and create embedding
In this section, we will load the data into a pandas dataframe, use select columns, and create vector embedding using azure open ai. 

In [234]:
import openai
from openai import AzureOpenAI
from tenacity import retry, wait_random_exponential, stop_after_attempt  

openai.api_type = config["OPENAI_API_TYPE"] 
openai.api_key = config["OPENAI_API_KEY"]
openai.api_base = config["OPENAI_API_BASE"] 
openai.api_version = config["OPENAI_API_VERSION"] 

"""
def createEmbeddings(text):
    response = openai.Embedding.create(input=text , engine=config["OPENAI_DEPLOYMENT_EMBEDDING"])
    embeddings = response['data'][0]['embedding']
    return embeddings
"""
    
client = AzureOpenAI(
  api_key =  config["OPENAI_API_KEY"],  
  api_version = "2023-05-15",
  azure_endpoint = config["OPENAI_API_BASE"]
)
 
model: str = "text-embedding-ada-002" 
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
# Function to generate embeddings for title and content fields, also used for query embeddings
def createEmbeddings(text, model=model):
    return client.embeddings.create(input = [text], model=model).data[0].embedding


df = pd.read_csv(f"AnalyzedPDF/{experiment_name}/Chunks_{experiment_name}.csv")


# Create a new column called 'embedding' in the DataFrame
df['Embedding'] = np.empty((len(df),), dtype=object)

# Iterate over each row in the DataFrame and assign the concatenation and embeddings
for index, row in df.iterrows():
    text = row['Chunk']
    
    # Concatenate the desired columns
    concat_text = f"{text}"
    
    # Create embeddings using the provided function
    embeddings = createEmbeddings(concat_text)
    #print(embeddings)
    
    # Assign the embeddings to the 'embedding' column
    df.at[index, 'Embedding'] = embeddings
    time.sleep(0.1)


In [235]:
df.head(10)

Unnamed: 0,Id,Ticker,Year,Quarter,Chunk,Position,Embedding
0,1,MSFT,23,2,Microsoft FY23 Second Quarter Earnings Confere...,"MSFTTranscriptFY23Q2.pdf, P.6, line 5","[-0.02180245891213417, -0.013332000002264977, ..."
1,2,MSFT,23,2,mizing its operations. Investec is closing dea...,"MSFTTranscriptFY23Q2.pdf, P.11, line 14","[-0.009237000718712807, -0.017168618738651276,..."
2,3,MSFT,23,2,Xbox Game Studios. And we'll be sharing detai...,"MSFTTranscriptFY23Q2.pdf, P.16, line 4","[0.004660172387957573, -0.0066885207779705524,..."
3,4,MSFT,23,2,ross margin percentage declined roughly 3 poi...,"MSFTTranscriptFY23Q2.pdf, P.20, line 17","[0.004314188379794359, -0.014822079800069332, ..."
4,5,MSFT,23,2,"roducts and cloud services, on a strong prior ...","MSFTTranscriptFY23Q2.pdf, P.25, line 8","[-0.0022679090034216642, -0.015557303093373775..."
5,6,MSFT,23,2,"g it, and then also being a bit more cautious...","MSFTTranscriptFY23Q2.pdf, P.29, line 17","[0.005065751262009144, -0.02279762178659439, 0..."
6,7,MSFT,23,2,"d about the place that puts us in, as we exit...","MSFTTranscriptFY23Q2.pdf, P.32, line 19","[-0.0041900137439370155, -0.02702300250530243,..."
7,8,MSFT,23,3,Microsoft FY23 Third Quarter Earnings Conferen...,"MSFTTranscriptFY23Q3.pdf, P.8, line 7","[-0.020944802090525627, -0.015942450612783432,..."
8,9,MSFT,23,3,providers. We showcased the first fully AI-a...,"MSFTTranscriptFY23Q3.pdf, P.15, line 8","[-0.008117384277284145, -0.008914312347769737,..."
9,10,MSFT,23,3,our first party titles. And I've never been ...,"MSFTTranscriptFY23Q3.pdf, P.21, line 17","[-0.0044199684634804726, -0.018141914159059525..."


Use the following code to save the embeddings and processed data for future use or skip the previous part of the code and load the processed data to save into the db. 

In [236]:
# save CSV for future use. 
df.to_csv(f"AnalyzedPDF/{experiment_name}/ChunksEmbedding_{experiment_name}.csv", index=False)

### Store chunks into Vector Database using Azure Cognitive Search (ACS)

In [237]:
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import Vector  
from azure.search.documents.indexes.models import (  
    SearchIndex,  
    SearchField,  
    SearchFieldDataType,  
    SimpleField,  
    SearchableField,  
    SearchIndex,  
    SemanticConfiguration,  
    PrioritizedFields,  
    SemanticField,  
    SearchField,  
    SemanticSettings,  
    VectorSearch,  
    HnswVectorSearchAlgorithmConfiguration
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from ast import literal_eval

Load environment variables and keys:

In [238]:
## Cog Search
cogsearch_name = config["COGSEARCH_NAME"]
index_name = config["COGSEARCH_INDEX_NAME"]
key = config["COGSEARCH_API_KEY"]
service_endpoint = config["COGSEARCH_NAME"] #+ ".search.windows.net"
#service_endpoint = "https://"+config["COGSEARCH_NAME"] + ".search.windows.net"

credential = AzureKeyCredential(key)

Load data prepared at previous steps:

In [239]:
df_chunks_embedding = pd.read_csv(f"AnalyzedPDF/{experiment_name}/ChunksEmbedding_{experiment_name}.csv")
df_chunks_embedding.head(3)

Unnamed: 0,Id,Ticker,Year,Quarter,Chunk,Position,Embedding
0,1,MSFT,23,2,Microsoft FY23 Second Quarter Earnings Confere...,"MSFTTranscriptFY23Q2.pdf, P.6, line 5","[-0.02180245891213417, -0.013332000002264977, ..."
1,2,MSFT,23,2,mizing its operations. Investec is closing dea...,"MSFTTranscriptFY23Q2.pdf, P.11, line 14","[-0.009237000718712807, -0.017168618738651276,..."
2,3,MSFT,23,2,Xbox Game Studios. And we'll be sharing detai...,"MSFTTranscriptFY23Q2.pdf, P.16, line 4","[0.004660172387957573, -0.0066885207779705524,..."


Created index configurations:

In [240]:
index_name=f"index__{experiment_name}"
# Create a search index
#experiment_name = "scifacts_s128t_o10t"
vs_config_name=f"{experiment_name}-vector-config"


index_client = SearchIndexClient(
    endpoint=service_endpoint, credential=credential)
fields = [
    SimpleField(name="Id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="Filter", type=SearchFieldDataType.String, filterable=True),
    SearchableField(name="Chunk", type=SearchFieldDataType.String, searchable=True),
    SearchableField(name="Position", type=SearchFieldDataType.String, filterable=True),
    SearchField(name="Embedding", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_configuration=vs_config_name),
]


vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name=vs_config_name,
           # name="my-vector-config",
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine"
            }
        )
    ]
)

ss_config_name=f"{experiment_name}-semantic-config"
semantic_config = SemanticConfiguration(
    name=ss_config_name,
    #name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="Filter"),
        prioritized_content_fields=[SemanticField(field_name="Chunk")]
    )
)

# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

 index__mft_top2n_s2048t_o10t created


Finally, upload prepared data to index:

In [241]:

## Upload data to Index
def batch_append_payload(df, search_client):
    """append payload for batch insertion (note: max 1000 rows per insertion) of embeddings to Cognitive Search"""
    value_list = []
    for index, row in df.iterrows():
        value_list.append(
            {
                "Id": str(index),
                "Filter": str(row["Ticker"])+"FY"+str(row["Year"])+"Q"+str(row["Quarter"]),
                "Chunk": row["Chunk"],
                "Position": str(row["Position"]),
                "Embedding": literal_eval(row['Embedding']),
            }
        )
        
#         print(len(value_list))
        
        if len(value_list) >= 1000:
            result = search_client.upload_documents(value_list)
            print(f"Uploaded {len(value_list)} payload")
            value_list = []
    result = search_client.upload_documents(value_list)
    print(f"Uploaded {len(value_list)} payload")
    
            
            
#     print('payload of size {}'.format(len(value_list)))

    return value_list


search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
payload = batch_append_payload(df_chunks_embedding, search_client)
 
# print(f"Uploaded {len(payload)} payload") 


Uploaded 29 payload


Ok, done with setting things up - now let's have fun and evaluate how well we're doing with this set of experimental parameters! :) 

### Step 3. Evaluation

#### Search Types 1: Pure Vector Search

In [242]:

def extract_time_context(line: str) -> str:
    match = re.search(r"\bFY\d{2}Q\d\b", line)
    if match:
        return match.group()
    else:
        return "No match found."
    
# Example usage
sentence1 = "How many monthly active users does Power BI have in FY23Q1?"
sentence2 = "What are some examples of companies using Azure OpenAI Service in FY23Q2?"
print(extract_time_context(sentence1))  # Output: FY23Q1
print(extract_time_context(sentence2))  # Output: FY23Q2

FY23Q1
FY23Q2


In [243]:
query = "How many monthly active users does Power BI have in FY23Q1?"
filter_value='MSFT'+extract_time_context(query)
    
search_client = SearchClient(service_endpoint, index_name, credential=credential)  
vector = Vector(value=createEmbeddings(query), k=5, fields="Embedding")  

results = search_client.search(  
    search_text=None,  
    vectors=[vector],
    filter=f"(Filter eq '{filter_value}')" 
)  

output=[] 
for result in results: 
  print(result)

{'Embedding': [-0.018450594, -0.017268559, -0.010298638, -0.013491487, 0.0009578548, 0.003766883, -0.013355621, -0.031031778, 0.0006169162, -0.05013453, 0.0505693, 0.02149399, -0.0062634195, -0.00048147488, -0.013573007, -0.005488984, 0.0056961793, -0.0019343912, 0.007173721, -0.029211175, -0.016643576, -0.011833923, -0.01735008, 0.01411647, -0.028559018, -0.004734928, 0.005475397, -0.006963129, 0.01889895, -0.015841968, -0.0036174303, -0.017268559, -0.0042356206, 0.010305431, -0.012445319, -0.011283666, -0.0024438885, 0.020828247, 0.024333589, 0.008715799, 0.011378772, -0.004313743, -0.011120627, 0.005699576, -0.019958705, 0.022499397, 0.006637051, 0.0062871957, 0.007486213, 0.019605454, 0.015488716, 0.026290057, -0.033069767, -0.028586192, -0.0013510168, -0.001576894, 0.009992939, 0.018966883, -0.002211218, 0.0074386597, -0.0054991734, -0.00020687711, -0.021847242, 0.009014704, -0.022757543, -0.014673521, -0.021833654, -0.00036874862, -2.8314247e-05, 0.0055501233, 0.04442816, 0.02763

In [244]:
def fetch_relevant_chunks(query, n=1):
    filter_value='MSFT'+extract_time_context(query)
    
    search_client = SearchClient(service_endpoint, index_name, credential=credential)  
    vector = Vector(value=createEmbeddings(query), k=5, fields="Embedding")  

    results = search_client.search(  
        search_text=None,  
        vectors=[vector],
        filter=f"(Filter eq '{filter_value}')" 
    )  

    output=[] 
    count=0;
    for result in results: 
        if query==1:
            output.append((result['Position'], result['Chunk'], result['Filter'], result['@search.score']))
        else:
            output.append((result['Position'], result['Chunk'], result['Filter'], result['@search.score'], query))              
        
        count+=1
        if count>= n: 
            break
  
    return output    

def fetch_relevant_chunk(query):
    output=fetch_relevant_chunks(query, 1)
    return output[0]

fetch_relevant_chunks("How many monthly active users does Power BI have in FY23Q1?", 3)

[('MSFTTranscriptFY23Q1.pdf, P.9, line 6',
  'Microsoft FY23 First Quarter Earnings Conference Call  Brett Iversen, Satya Nadella, Amy Hood  Tuesday, October 25, 2022  BRETT IVERSEN:  Good afternoon and thank you for joining us today. On the call with me are  Satya Nadella, chairman and chief executive officer, Amy Hood, chief  financial officer, Alice Jolla, chief accounting officer, and Keith Dolliver,  deputy general counsel.  On the Microsoft Investor Relations website, you can find our earnings  press release and financial summary slide deck, which is intended to  supplement our prepared remarks during today\'s call and provides the  reconciliation of differences between GAAP and non-GAAP financial  measures.  On this call we will discuss certain non-GAAP items. The non-GAAP financial  measures provided should not be considered as a substitute for or superior  to the measures of financial performance prepared in accordance with  GAAP. They are included as additional clarifying ite

In [245]:
def fetch_relevant_chunks_as_df(query, n=1):
    filter_value='MSFT'+extract_time_context(query)
    
    search_client = SearchClient(service_endpoint, index_name, credential=credential)  
    vector = Vector(value=createEmbeddings(query), k=5, fields="Embedding")  

    results = search_client.search(  
        search_text=None,  
        vectors=[vector],
        filter=f"(Filter eq '{filter_value}')" 
    )  

    df_output = pd.DataFrame(columns=['position', 'chunk', 'filter', 'score', 'question'])

    count=0
    for result in results: 
        row_dict = {
            'position': result['Position'],
            'chunk': result['Chunk'],
            'filter': result['Filter'],
            'score': result['@search.score'],
            'question': query
        }

        df_output = pd.concat([df_output, pd.DataFrame(row_dict, index=[0])], ignore_index=True)

        count+=1
        if count>= n: 
            break
  
    return df_output    

def fetch_relevant_chunks_for_multiple_questions(queries, n=1):

    df_output = pd.DataFrame(columns=['position', 'chunk', 'filter', 'score', 'question'])

    print(n)
    for query in queries: 
        relevant_chunks = fetch_relevant_chunks_as_df(query, n)
        df_output = pd.concat([df_output, pd.DataFrame(relevant_chunks)], ignore_index=True)

    return df_output  

fetch_relevant_chunks_for_multiple_questions(["How many monthly active users does Power BI have in FY23Q1?", "Which companies run SAP on Azure in FY23Q1?"], 2)

2


Unnamed: 0,position,chunk,filter,score,question
0,"MSFTTranscriptFY23Q1.pdf, P.9, line 6",Microsoft FY23 First Quarter Earnings Conferen...,MSFTFY23Q1,0.829985,How many monthly active users does Power BI ha...
1,"MSFTTranscriptFY23Q1.pdf, P.18, line 11",cial Services to deliver new digital experien...,MSFTFY23Q1,0.828992,How many monthly active users does Power BI ha...
2,"MSFTTranscriptFY23Q1.pdf, P.9, line 6",Microsoft FY23 First Quarter Earnings Conferen...,MSFTFY23Q1,0.843204,Which companies run SAP on Azure in FY23Q1?
3,"MSFTTranscriptFY23Q1.pdf, P.18, line 11",cial Services to deliver new digital experien...,MSFTFY23Q1,0.841985,Which companies run SAP on Azure in FY23Q1?


In [246]:
benchmark_file="evalset.csv"

In [247]:
doc_dir = Path("DATA/") 
file_path = os.path.join(doc_dir, benchmark_file)
benchmark=pd.read_csv(file_path) 
benchmark

Unnamed: 0,chat_history,question,answer
0,[],What is the quarterly revenue of Microsoft Clo...,"$25 billion in quarterly revenue, up 24 percen..."
1,[],How many Azure Arc customers does Microsoft cu...,"More than 8,500 Arc customers, more than doubl..."
2,[],Which companies run SAP on Azure in FY23Q1?,"Companies like Chobani, Munich RE, Sodexo, Vol..."
3,[],What is the annual recurring revenue of GitHub...,GitHub is now at $1 billion in annual recurrin...
4,[],How many monthly active users does Power BI ha...,Power BI is the market leader in business inte...
5,[],What is the quarterly revenue of Microsoft Clo...,"$27 billion in quarterly revenue, up 22 percen..."
6,[],How many Azure Arc customers does Microsoft cu...,"More than 12,000 Arc customers, double the num..."
7,[],What are some examples of companies using Azur...,"Over 200 customers, including KPMG and Al Jaze..."
8,[],How many developers are using GitHub Copilot a...,More than one million people have used GitHub ...
9,[],What is the growth rate of Azure Machine Learn...,Azure ML revenue alone has increased more than...


In [248]:
eval1= benchmark.apply(lambda x : fetch_relevant_chunks(x['question'], 2), axis=1)
#answers= qa_with_context.apply(lambda x: get_answer_from_chunk(x['question'], x['context']), axis=1, result_type='expand')
#df.explode('column1').reset_index(drop=True)
eval1[1]

[('MSFTTranscriptFY23Q1.pdf, P.18, line 11',
  "cial Services to deliver new  digital experiences for its customers. Our Cloud for Sustainability is off to a fast start, as organizations like Telstra use the solution to track  their environmental footprint.  New updates provide insights on hard to measure Scope 3 carbon  emissions.  And we're seeing record growth in healthcare, driven in part by our Nuance  DAX ambient intelligence solution, which automatically documents patient  encounters at the point of care.  Physicians tell us DAX dramatically improves their productivity, and it's  quickly becoming an on-ramp to our broader healthcare offerings.  Now, on to new systems of work.  Microsoft 365, Teams, and Viva uniquely enable employees to thrive in  today's digitally connected, distributed world of work.  Microsoft 365 is the cloud-first platform that supports all the ways people  work and every type of worker, reducing cost and complexity for IT.  The new Microsoft 365 app brings 

In [249]:
questions_list = benchmark['question'].tolist()
top_n=2
relevant_chunks=fetch_relevant_chunks_for_multiple_questions(questions_list, top_n)
merged_df = pd.merge(relevant_chunks, benchmark, on='question')
merged_df

2


Unnamed: 0,position,chunk,filter,score,question,chat_history,answer
0,"MSFTTranscriptFY23Q1.pdf, P.25, line 16",d afternoon everyone. Our first quarter revenu...,MSFTFY23Q1,0.887558,What is the quarterly revenue of Microsoft Clo...,[],"$25 billion in quarterly revenue, up 24 percen..."
1,"MSFTTranscriptFY23Q1.pdf, P.33, line 12","arge Hololens deal, partially offset by low d...",MSFTFY23Q1,0.875804,What is the quarterly revenue of Microsoft Clo...,[],"$25 billion in quarterly revenue, up 24 percen..."
2,"MSFTTranscriptFY23Q1.pdf, P.18, line 11",cial Services to deliver new digital experien...,MSFTFY23Q1,0.850723,How many Azure Arc customers does Microsoft cu...,[],"More than 8,500 Arc customers, more than doubl..."
3,"MSFTTranscriptFY23Q1.pdf, P.9, line 6",Microsoft FY23 First Quarter Earnings Conferen...,MSFTFY23Q1,0.847148,How many Azure Arc customers does Microsoft cu...,[],"More than 8,500 Arc customers, more than doubl..."
4,"MSFTTranscriptFY23Q1.pdf, P.9, line 6",Microsoft FY23 First Quarter Earnings Conferen...,MSFTFY23Q1,0.843204,Which companies run SAP on Azure in FY23Q1?,[],"Companies like Chobani, Munich RE, Sodexo, Vol..."
5,"MSFTTranscriptFY23Q1.pdf, P.18, line 11",cial Services to deliver new digital experien...,MSFTFY23Q1,0.841985,Which companies run SAP on Azure in FY23Q1?,[],"Companies like Chobani, Munich RE, Sodexo, Vol..."
6,"MSFTTranscriptFY23Q1.pdf, P.25, line 16",d afternoon everyone. Our first quarter revenu...,MSFTFY23Q1,0.843039,What is the annual recurring revenue of GitHub...,[],GitHub is now at $1 billion in annual recurrin...
7,"MSFTTranscriptFY23Q1.pdf, P.40, line 3",n Xbox Game Pass subscribers. We expect Xbox ...,MSFTFY23Q1,0.837179,What is the annual recurring revenue of GitHub...,[],GitHub is now at $1 billion in annual recurrin...
8,"MSFTTranscriptFY23Q1.pdf, P.9, line 6",Microsoft FY23 First Quarter Earnings Conferen...,MSFTFY23Q1,0.829985,How many monthly active users does Power BI ha...,[],Power BI is the market leader in business inte...
9,"MSFTTranscriptFY23Q1.pdf, P.18, line 11",cial Services to deliver new digital experien...,MSFTFY23Q1,0.828992,How many monthly active users does Power BI ha...,[],Power BI is the market leader in business inte...


In [255]:
#eval= benchmark['question'].apply(fetch_relevant_chunks).apply(pd.Series)
#eval.columns = ['position','context', 'filter', 'rel_score']
#qa_with_context = pd.concat([benchmark, eval], axis=1)
#qa_with_context


In [None]:
template_1shot = """
        You are given a question and a relevant piece of context that's already pre-filtered for given financial year (FY) and quarter (Q).
        To answer it, try to provide the best factual-grounded answer.  If answer is not available in the text, just write N/A.
               
        An example output answer for question "How many developers are using GitHub Copilot as of FY23Q2?": 
        "More than one million people have used GitHub Copilot to date."

        Go!
        question:
        {question}
        context: 
        {chunk_text1} 
        """

In [None]:
chunk1="sales motions and commitments to our platform should drive solid growth on a moderately growing expiry base, against a strong prior year comparable which included a significant volume of large, long-term Azure contracts. As a reminder, the "
question1="What is the growth rate of Azure Machine Learning revenue in FY23Q1?"

In [None]:
def get_answer_from_chunk(question1, chunk1):
    response = client.chat.completions.create(
      #  model="gpt-4", # model = "deployment_name".
        model="gpt-35-turbo", # model = "deployment_name".
        messages=[
            {"role": "system", "content":"You are an expert in finance and  need to provide accurate answers for the given text." },
            {"role": "user", "content": template_1shot.format(chunk_text1=chunk1, question=question1)}
        ]
    )

    #print(response)
    # print(response.model_dump_json(indent=2))
    return response.choices[0].message.content

get_answer_from_chunk("What is the quarterly revenue of Microsoft Cloud in FY23Q1?", "e. With that, I'll hand it over to Amy. AMY HOOD: Thank you, Satya, and good afternoon everyone. Our first quarter revenue was $50.1 billion, up 11 percent and 16 percent in constant currency. Earnings per share was $2.35 - and increased 4 pe...")

'The text does not provide specific information on the quarterly revenue of Microsoft Cloud in FY23Q1.'

In [None]:
get_answer_from_chunk(chunk1, question1)

'N/A'

In [251]:
#grouped_df = merged_df.groupby('question')['chunk'].agg(lambda x: '****'.join(x)).reset_index()
#grouped_df = merged_df.groupby('question').agg({'chunk': lambda x: '&'.join(x.str[:5])}).reset_index()
#grouped_df = merged_df.groupby(['question'])['chunk'].apply('\n*****\n'.join).reset_index()
grouped_df = merged_df.groupby('question').apply(lambda x: '\n*****\n'.join("From " + x['position'] + ":\n" + x['chunk'])).reset_index()
#print(grouped_df.iloc[0]['chunk'])
grouped_df

Unnamed: 0,question,0
0,How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q1.pdf, P.18, line 11:\..."
1,How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM..."
2,How many developers are using GitHub Copilot a...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM..."
3,How many monthly active users does Power BI ha...,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM..."
4,What are some examples of companies using Azur...,"From MSFTTranscriptFY23Q2.pdf, P.11, line 14:\..."
5,What is the annual recurring revenue of GitHub...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\..."
6,What is the growth rate of Azure Machine Learn...,"From MSFTTranscriptFY23Q1.pdf, P.40, line 3:\n..."
7,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\..."
8,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q2.pdf, P.20, line 17:\..."
9,Which companies run SAP on Azure in FY23Q1?,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM..."


In [252]:
grouped_df = grouped_df.rename(columns={0: 'context'})
print(grouped_df.iloc[0]['context'])

From MSFTTranscriptFY23Q1.pdf, P.18, line 11:
cial Services to deliver new  digital experiences for its customers. Our Cloud for Sustainability is off to a fast start, as organizations like Telstra use the solution to track  their environmental footprint.  New updates provide insights on hard to measure Scope 3 carbon  emissions.  And we're seeing record growth in healthcare, driven in part by our Nuance  DAX ambient intelligence solution, which automatically documents patient  encounters at the point of care.  Physicians tell us DAX dramatically improves their productivity, and it's  quickly becoming an on-ramp to our broader healthcare offerings.  Now, on to new systems of work.  Microsoft 365, Teams, and Viva uniquely enable employees to thrive in  today's digitally connected, distributed world of work.  Microsoft 365 is the cloud-first platform that supports all the ways people  work and every type of worker, reducing cost and complexity for IT.  The new Microsoft 365 app brings to

In [253]:
qa_with_context = pd.merge(grouped_df, benchmark, on='question')
qa_with_context

Unnamed: 0,question,context,chat_history,answer
0,How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q1.pdf, P.18, line 11:\...",[],"More than 8,500 Arc customers, more than doubl..."
1,How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM...",[],"More than 12,000 Arc customers, double the num..."
2,How many developers are using GitHub Copilot a...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM...",[],More than one million people have used GitHub ...
3,How many monthly active users does Power BI ha...,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM...",[],Power BI is the market leader in business inte...
4,What are some examples of companies using Azur...,"From MSFTTranscriptFY23Q2.pdf, P.11, line 14:\...",[],"Over 200 customers, including KPMG and Al Jaze..."
5,What is the annual recurring revenue of GitHub...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\...",[],GitHub is now at $1 billion in annual recurrin...
6,What is the growth rate of Azure Machine Learn...,"From MSFTTranscriptFY23Q1.pdf, P.40, line 3:\n...",[],Azure ML revenue alone has increased more than...
7,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\...",[],"$25 billion in quarterly revenue, up 24 percen..."
8,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q2.pdf, P.20, line 17:\...",[],"$27 billion in quarterly revenue, up 22 percen..."
9,Which companies run SAP on Azure in FY23Q1?,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM...",[],"Companies like Chobani, Munich RE, Sodexo, Vol..."


In [254]:
answers= qa_with_context.apply(lambda x: get_answer_from_chunk(x['question'], x['context']), axis=1, result_type='expand')
answers.columns = ['response']
output = pd.concat([answers, qa_with_context], axis=1)

Unnamed: 0,0,question,context,chat_history,answer
0,"Microsoft currently has more than 8,500 Azure ...",How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q1.pdf, P.18, line 11:\...",[],"More than 8,500 Arc customers, more than doubl..."
1,"Microsoft currently has more than 12,000 Azure...",How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM...",[],"More than 12,000 Arc customers, double the num..."
2,"""More than one million people have used GitHub...",How many developers are using GitHub Copilot a...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM...",[],More than one million people have used GitHub ...
3,Power BI has nearly 15 million monthly active ...,How many monthly active users does Power BI ha...,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM...",[],Power BI is the market leader in business inte...
4,Some examples of companies using Azure OpenAI ...,What are some examples of companies using Azur...,"From MSFTTranscriptFY23Q2.pdf, P.11, line 14:\...",[],"Over 200 customers, including KPMG and Al Jaze..."
5,,What is the annual recurring revenue of GitHub...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\...",[],GitHub is now at $1 billion in annual recurrin...
6,,What is the growth rate of Azure Machine Learn...,"From MSFTTranscriptFY23Q1.pdf, P.40, line 3:\n...",[],Azure ML revenue alone has increased more than...
7,The quarterly revenue of Microsoft Cloud in FY...,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\...",[],"$25 billion in quarterly revenue, up 24 percen..."
8,The quarterly revenue of Microsoft Cloud in FY...,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q2.pdf, P.20, line 17:\...",[],"$27 billion in quarterly revenue, up 22 percen..."
9,"Companies like Chobani, Munich RE, Sodexo, and...",Which companies run SAP on Azure in FY23Q1?,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM...",[],"Companies like Chobani, Munich RE, Sodexo, Vol..."


In [256]:
output = output.rename(columns={'answer': 'ground_truth'})
output = output.rename(columns={0: 'answer'})
output

Unnamed: 0,answer,question,context,chat_history,ground truth
0,"Microsoft currently has more than 8,500 Azure ...",How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q1.pdf, P.18, line 11:\...",[],"More than 8,500 Arc customers, more than doubl..."
1,"Microsoft currently has more than 12,000 Azure...",How many Azure Arc customers does Microsoft cu...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM...",[],"More than 12,000 Arc customers, double the num..."
2,"""More than one million people have used GitHub...",How many developers are using GitHub Copilot a...,"From MSFTTranscriptFY23Q2.pdf, P.6, line 5:\nM...",[],More than one million people have used GitHub ...
3,Power BI has nearly 15 million monthly active ...,How many monthly active users does Power BI ha...,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM...",[],Power BI is the market leader in business inte...
4,Some examples of companies using Azure OpenAI ...,What are some examples of companies using Azur...,"From MSFTTranscriptFY23Q2.pdf, P.11, line 14:\...",[],"Over 200 customers, including KPMG and Al Jaze..."
5,,What is the annual recurring revenue of GitHub...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\...",[],GitHub is now at $1 billion in annual recurrin...
6,,What is the growth rate of Azure Machine Learn...,"From MSFTTranscriptFY23Q1.pdf, P.40, line 3:\n...",[],Azure ML revenue alone has increased more than...
7,The quarterly revenue of Microsoft Cloud in FY...,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q1.pdf, P.25, line 16:\...",[],"$25 billion in quarterly revenue, up 24 percen..."
8,The quarterly revenue of Microsoft Cloud in FY...,What is the quarterly revenue of Microsoft Clo...,"From MSFTTranscriptFY23Q2.pdf, P.20, line 17:\...",[],"$27 billion in quarterly revenue, up 22 percen..."
9,"Companies like Chobani, Munich RE, Sodexo, and...",Which companies run SAP on Azure in FY23Q1?,"From MSFTTranscriptFY23Q1.pdf, P.9, line 6:\nM...",[],"Companies like Chobani, Munich RE, Sodexo, Vol..."


In [257]:
# Saving results to csv files
if not os.path.exists(f"AnalyzedPDF/{experiment_name}"):
    os.makedirs(f"AnalyzedPDF/{experiment_name}")


# save CSV for future use. 
output.to_csv(f"AnalyzedPDF/{experiment_name}/eval_benchmark_{experiment_name}.csv", index=False)