## Benchmark Question-Answer Generation

This notebook demonstrates how we can generate a set of Questions and Answers based on chunks from a database. Documents chunked before insertion to a database and saved to `.csv` for MSFT transcripts sample. It is beneficial to generate questions and answers at this stage, as doing a search from a database has an added cost. Randomly picking chunks for specified filter parameters ensures diversity of the questions and answers. 

### SET UP AND CONFIGURATION

Load Environment File

In [1]:
from dotenv import dotenv_values
from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential
from azure.ai.resources.client import AIClient
from azure.ai.generative.evaluate import evaluate
import openai
from prompt_templates import gen_prompt, eval_prompt, eval_prompt2

# specify the name of the .env file name 
env_name = "../../.env" # change to your own .env file name
config = dotenv_values(env_name)

In [2]:
"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see 
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""

if config['KEYS_FROM'] == "KEYVAULT":
    print('keyvault was selected.')
    keyVaultName = config["KEY_VAULT_NAME"]
    KVUri = f"https://{keyVaultName}.vault.azure.net"

    credential = DefaultAzureCredential()
    client = SecretClient(vault_url=KVUri, credential=credential)
    openai.api_type = client.get_secret("OPENAI-API-TYPE").value
    openai.api_key = client.get_secret("OPENAI-API-KEY").value
    openai.api_base = client.get_secret("OPENAI-API-BASE").value
    openai.api_version = client.get_secret("OPENAI-API-VERSION").value
    
else:
    print('.env was selected.')
    openai.api_type = config["OPENAI_API_TYPE"] 
    openai.api_key = config["OPENAI_API_KEY"]
    openai.api_base = config["OPENAI_API_BASE"] 
    openai.api_version = config["OPENAI_API_VERSION"] 

.env was selected.


Read Chunks from csv (see step2 notebook from preprocessing subdirectory)

In [3]:
import numpy as np
import pandas as pd
df = pd.read_csv('AnalyzedPDF/ChunksEmbedding.csv')

In [4]:
df[df['Quarter']==2]

Unnamed: 0,Id,Ticker,Year,Quarter,Chunk,PageNumber,LineNumber,Embedding
113,114,MSFT,23,2,Microsoft FY23 Second Quarter Earnings Confere...,1,1,"[-0.022043932, -0.023832329, -0.015447599, -0...."
114,115,MSFT,23,2,"On the Microsoft Investor Relations website, y...",1,9,"[-0.023697682, -0.005627374, -0.0051322975, -0..."
115,116,MSFT,23,2,GAAP. They are included as additional clarifyi...,1,17,"[-0.012550131, -0.0020706053, 0.007283737, -0...."
116,117,MSFT,23,2,"same in constant currency, we will refer to th...",1,25,"[-0.01768585, -0.02943631, -0.00054391, -0.015..."
117,118,MSFT,23,2,"predictions, projections, or other statements ...",2,6,"[-0.009156934, -0.019673413, -0.0082705645, -0..."
...,...,...,...,...,...,...,...,...
216,217,MSFT,23,2,"BRETT IVERSEN: Thanks, Brad. Joe, we have time...",31,13,"[0.0011864604, -0.04014092, 0.009777045, -0.01..."
217,218,MSFT,23,2,the coming quarters? Thank you. SATYA NADELLA:...,31,21,"[0.009347212, -0.01008223, 0.015255094, -0.009..."
218,219,MSFT,23,2,"going to be an Al app. That's, I think, the be...",32,1,"[0.005495851, -0.003575635, 0.013053961, 0.000..."
219,220,MSFT,23,2,"Sometimes, you will have ISVs who are differen...",32,9,"[-0.004339527, -0.028531296, 0.017532898, -0.0..."


In [5]:
df['Chunk'].iloc[11]

"Cosmos DB now supports PostgreSQL, making Azure the first cloud provider to offer a database service that supports both relational and NoSQL workloads. And, in Al, we are turning the world's most advanced models into platforms for customers. Earlier this month, we brought the power of Dall-E to Azure OpenAI service, helping customers like Mattel apply the breakthrough image generation model to commercial use cases for the first time. And Azure Machine Learning provides industry leading MLOps, helping organizations like 3M deploy, manage, and govern models. "

#### Prompt Template
##### Write a Prompt Template. The prompt template should include all filter keys, so they can be referenced and input.

In [6]:
# template = """
#         You are given two chunks of text, a ticker e.g. MSFT, Quarter, Year, as input. You will generate 10 relevant questions and answers pairs based on the input.
#         The question should be formed based on information in both the chunks of text.
#         The answers should be available in the two chunks of text. Do not generate answers on your own.  If answer is not available in the text, just write N/A.
               
#         An example output for this example is: 

#         Question: For {ticker} FY{year} Q{quarter}, what is the <question goes here>?
#         Answer: example answer paraphrased from the relevant information in the given text goes here 

#         Based on ticker, quarter, year, the question can be phrased in different ways e.g. MSFT FY23 Q1, MSFT FY2023 1st quarter, e.t.c.
#         In case the text question is not relevant, please skip the question and answer pair.
#         input_text1: 
#         {chunk_text1}
#         input_text2:
#         {chunk_text2}
#         ticker: {ticker}
#         quarter: {quarter}
#         year: {year}
#         """

#### Randomly pick filter parameters:

Add Filter Parameters to randomly pick them for extracting context (chunks). This will help diversify generating questions and answers. 
In the MSFT Financial Transcripts use-case, the Ticker name is MSFT, but you can easily add other ticker labels for a larger Financial dataset. `Year`,`Quarter`, and `Id` are the key parameters used in this use-case. In this notebook, we are using two random chunks (chunk ids) from specified filter keys (year, quarter).

###### TODO: Add Tools (an updated version of function calls) once it is available for working with newer models. Function Calls are currently deprecated for gpt models with versions beyond 07-01-2023.

In [7]:
Ticker = np.random.choice(df['Ticker'].unique())
Year = np.random.choice(df[df['Ticker']==Ticker]['Year'].unique())
Quarter = np.random.choice(df[(df['Ticker']==Ticker) & (df['Year']==Year)]['Quarter'].unique())
# Id = np.random.choice(df[(df['Quarter']==Quarter) & (df['Year']==Year)]['Id'].unique())
# Id2 = np.random.choice(df[(df['Quarter']==Quarter) & (df['Year']==Year) & (df['Id']!=Id)]['Id'].unique())

In [8]:
pagenum1 = np.random.choice(df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter)]['PageNumber'].unique())
pagenum2 = np.random.choice(df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter)]['PageNumber'].unique())

In [9]:
chunk1 = df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter) & (df['PageNumber']==pagenum1)]['Chunk'].str.cat(sep=' ')
chunk2 = df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter) & (df['PageNumber']==pagenum2)]['Chunk'].str.cat(sep=' ')

In [10]:
print(Ticker, Year, Quarter, pagenum1, pagenum2)

MSFT 23 3 18 31


#### Generate Questions (using Azure OpenAI only)

In [11]:
import os
from openai import AzureOpenAI

client = AzureOpenAI(
  api_key = openai.api_key,  
  api_version = openai.api_version,
  azure_endpoint = openai.api_base
)

response = client.chat.completions.create(
    model="gpt-35-turbo", # model = "deployment_name".
    messages=[
        {"role": "system", "content":"You are a generator of questions and answers for the given text." },
        {"role": "user", "content": gen_prompt.format(chunk_text1=chunk1, chunk_text2=chunk2, 
                                                    ticker=Ticker, year=str(Year), quarter=str(Quarter))}
    ]
)

#print(response)
# print(response.model_dump_json(indent=2))
print(response.choices[0].message.content)

#TODO: Add cells showing adding tools to the chat completion. It is an update to functionc calling feature. Function calling feature is not available in the current version of the API.OpenAI version > 1.0.0.

Question 1: What was the percentage increase in operating income at the end of March compared to a year ago for MSFT FY23 Q3?
Answer: Operating income increased 10% and 15% in constant currency, including 4 points due to the change in accounting estimate.

Question 2: How did the headcount at the end of March in MSFT FY23 Q3 compare to a year ago?
Answer: Headcount at the end of March was 9% higher than a year ago.

Question 3: What was the revenue from Productivity and Business Processes in MSFT FY23 Q3?
Answer: Revenue from Productivity and Business Processes was $17.5 billion and grew 11% and 15% in constant currency.

Question 4: What was the growth rate of Office commercial revenue in MSFT FY23 Q3?
Answer: Office commercial revenue grew 13% and 17% in constant currency.

Question 5: How much did the Office 365 commercial revenue increase in MSFT FY23 Q3?
Answer: Office 365 commercial revenue increased 14% and 18% in constant currency.

Question 6: What was the growth rate of paid 

In [12]:
print(response.choices[0].message.content)


Question 1: What was the percentage increase in operating income at the end of March compared to a year ago for MSFT FY23 Q3?
Answer: Operating income increased 10% and 15% in constant currency, including 4 points due to the change in accounting estimate.

Question 2: How did the headcount at the end of March in MSFT FY23 Q3 compare to a year ago?
Answer: Headcount at the end of March was 9% higher than a year ago.

Question 3: What was the revenue from Productivity and Business Processes in MSFT FY23 Q3?
Answer: Revenue from Productivity and Business Processes was $17.5 billion and grew 11% and 15% in constant currency.

Question 4: What was the growth rate of Office commercial revenue in MSFT FY23 Q3?
Answer: Office commercial revenue grew 13% and 17% in constant currency.

Question 5: How much did the Office 365 commercial revenue increase in MSFT FY23 Q3?
Answer: Office 365 commercial revenue increased 14% and 18% in constant currency.

Question 6: What was the growth rate of paid 

In [13]:
import re
_pattern = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s')
retrieved_sentences=_pattern.split(chunk1)
retrieved_sentences

['roughly 2 points from the Nuance and Xandr acquisitions, as well as investments in cloud engineering and LinkedIn. At a total company level, headcount at the end of March was 9% higher than a year ago.',
 'Operating income increased 10% and 15% in constant currency, including 4 points due to the change in accounting estimate.',
 'Operating margins increased roughly 1 point year-over-year to 42%.',
 'Excluding the impact of the change in accounting estimate, operating margins decreased slightly and increased slightly in constant currency.',
 'Now to our segment results.',
 'Revenue from Productivity and Business Processes was $17.5 billion and  Now to our segment results.',
 'Revenue from Productivity and Business Processes was $17.5 billion and grew 11% and 15% in constant currency, ahead of expectations primarily driven by better-than-expected results in Office commercial.',
 'Office commercial revenue grew 13% and 17% in constant currency.',
 'Office 365 commercial revenue increase

In [14]:
retrieved_sentences2=_pattern.split(chunk2)
retrieved_sentences2

['comments.',
 "And it's also important, I think, to distinguish between what I would say is macro or absolute performance and relative performance, because I think that's perhaps a good way to think about how we manage our business.",
 'First is optimizations do continue.',
 'In fact, we are focused on it.',
 "We incent our people to help our customers with optimization, because we believe, in the long run, that's the best way to secure the loyalty and long-term contracts with customers, when they know that they can count on a cloud provider like us to help them continuously optimize their workflow.",
 "That's sort of the fundamental benefit of public cloud, and we're taking every  provider like us to help them continuously optimize their workflow.",
 "That's sort of the fundamental benefit of public cloud, and we're taking every opportunity to prove that out with customers in real time.",
 "The second thing I'd say is we do have new workloads started, because if you think about it, d

### Save Eval Set to a csv


**Note on text structuring and format:** We could chain another llm call to convert to a format suitable to be saved to csv. The prompt can be modified to provide answer in this form directly. But, it is left as an exercise for the user to update prompt to work with whatever format they want to use. Here we split and reorganize the format output by the model using python, for the promptflow sample.

**Note on csv dataset format choice:** csv is a customer requirement, and a proper way can be to log them to a database and populate them. 

In [15]:
qa_string = response.choices[0].message.content
# Parse the string into rows
rows = [row.strip() for row in qa_string.split('\n') if row.strip()]

# Separate questions and answers
questions = [row.split(": ", 1)[1] for i, row in enumerate(rows) if i % 2 == 0]
answers = [row.split(": ", 1)[1] for i, row in enumerate(rows) if (i - 1) % 2 == 0]

# Combine into a list of dictionaries
qa_data = [{"chat_history": "[]", "question": q, "answer": a} for q, a in zip(questions, answers)]

In [16]:
qa_data

[{'chat_history': '[]',
  'question': 'What was the percentage increase in operating income at the end of March compared to a year ago for MSFT FY23 Q3?',
  'answer': 'Operating income increased 10% and 15% in constant currency, including 4 points due to the change in accounting estimate.'},
 {'chat_history': '[]',
  'question': 'How did the headcount at the end of March in MSFT FY23 Q3 compare to a year ago?',
  'answer': 'Headcount at the end of March was 9% higher than a year ago.'},
 {'chat_history': '[]',
  'question': 'What was the revenue from Productivity and Business Processes in MSFT FY23 Q3?',
  'answer': 'Revenue from Productivity and Business Processes was $17.5 billion and grew 11% and 15% in constant currency.'},
 {'chat_history': '[]',
  'question': 'What was the growth rate of Office commercial revenue in MSFT FY23 Q3?',
  'answer': 'Office commercial revenue grew 13% and 17% in constant currency.'},
 {'chat_history': '[]',
  'question': 'How much did the Office 365 co

In [17]:
csv_file_path = "../datasets/generated_eval_set.csv"
import csv
# Writing to CSV file
with open(csv_file_path, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=["chat_history", "question", "answer"])
    
    # Write headers
    writer.writeheader()
    
    # Write data
    writer.writerows(qa_data)

print(f"Data has been saved to {csv_file_path}.")

Data has been saved to ../datasets/generated_eval_set.csv.


AI as a Judge: Adding a superior AI model as a judge to evaluate the generated questions and answers. 

In [21]:
def get_ai_eval_score(question_answer_set, chunk1 = chunk1, chunk2 = chunk2):
    
    
    # eval_prompt= f""" For the given set of questions and answers, determine if the answer is relevant to either of the contexts.
    # Set of questions and answers: {question_answer_set}
    # Context1: {chunk1}
    # Context2: {chunk2}
    # Evaluate the accuracy of the question answer extraction model.
    # Given a set of 10 questions and answers, the context sources Context1 and Context2,  determine if the model's question and answer is relevant to one of the contexts.
    # Return only a single score of 0 or 1 indicating whether the model's question and answer is grounded in either of the contexts. Rewrite each of the questions and answers, and add a line for score you give. For example:
    # Question: <question goes here>
    # Answer: <answer goes here>
    # Score: <score goes here>
    # """ 
    response_eval = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content":"You are a generator of questions and answers for the given text." },
            {"role": "user", "content": eval_prompt2.format(question_answer_set=question_answer_set, chunk1=chunk1, chunk2=chunk2)}
        ]
    )
        
    
    return response_eval.choices[0].message.content

In [22]:
response_eval_score = get_ai_eval_score(response.choices[0].message.content, chunk1, chunk2)


In [23]:
print(response_eval_score)

Question 1: What was the percentage increase in operating income at the end of March compared to a year ago for MSFT FY23 Q3?
Answer: Operating income increased 10% and 15% in constant currency, including 4 points due to the change in accounting estimate.
Score: 4
Description: This question is directly answered by Context1, which contains the specific percentages and reasons for the increase in operating income.

Question 2: How did the headcount at the end of March in MSFT FY23 Q3 compare to a year ago?
Answer: Headcount at the end of March was 9% higher than a year ago.
Score: 4
Description: The question is answered by Context1 indicating the specific percentage by which the headcount increased.

Question 3: What was the revenue from Productivity and Business Processes in MSFT FY23 Q3?
Answer: Revenue from Productivity and Business Processes was $17.5 billion and grew 11% and 15% in constant currency.
Score: 4
Description: Context1 provides direct information about revenue from Produ