## Benchmark Question-Answer Generation

This notebook demonstrates how we can generate a set of Questions and Answers based on chunks from a database. Documents chunked before insertion to a database and saved to `.csv` for MSFT transcripts sample. It is beneficial to generate questions and answers at this stage, as doing a search from a database has an added cost. Randomly picking chunks for specified filter parameters ensures diversity of the questions and answers. 

### SET UP AND CONFIGURATION

Load Environment File

In [1]:
from dotenv import dotenv_values
from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential
import openai

# specify the name of the .env file name 
env_name = "../../.env" # change to your own .env file name
config = dotenv_values(env_name)

In [2]:
"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see 
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""

if config['KEYS_FROM'] == "KEYVAULT":
    print('keyvault was selected.')
    keyVaultName = config["KEY_VAULT_NAME"]
    KVUri = f"https://{keyVaultName}.vault.azure.net"

    credential = DefaultAzureCredential()
    client = SecretClient(vault_url=KVUri, credential=credential)
    openai.api_type = client.get_secret("OPENAI-API-TYPE").value
    openai.api_key = client.get_secret("OPENAI-API-KEY").value
    openai.api_base = client.get_secret("OPENAI-API-BASE").value
    openai.api_version = client.get_secret("OPENAI-API-VERSION").value
    
else:
    print('.env was selected.')
    openai.api_type = config["OPENAI_API_TYPE"] 
    openai.api_key = config["OPENAI_API_KEY"]
    openai.api_base = config["OPENAI_API_BASE"] 
    openai.api_version = config["OPENAI_API_VERSION"] 

keyvault was selected.


Read Chunks from csv (see step2 notebook from preprocessing subdirectory)

In [3]:
import numpy as np
import pandas as pd
df = pd.read_csv('AnalyzedPDF/ChunksEmbedding.csv')

In [4]:
df[df['Quarter']==2]

Unnamed: 0,Id,Ticker,Year,Quarter,Chunk,PageNumber,LineNumber,Embedding
113,114,MSFT,23,2,Microsoft FY23 Second Quarter Earnings Confere...,1,1,"[-0.022043932, -0.023832329, -0.015447599, -0...."
114,115,MSFT,23,2,"On the Microsoft Investor Relations website, y...",1,9,"[-0.023697682, -0.005627374, -0.0051322975, -0..."
115,116,MSFT,23,2,GAAP. They are included as additional clarifyi...,1,17,"[-0.012550131, -0.0020706053, 0.007283737, -0...."
116,117,MSFT,23,2,"same in constant currency, we will refer to th...",1,25,"[-0.01768585, -0.02943631, -0.00054391, -0.015..."
117,118,MSFT,23,2,"predictions, projections, or other statements ...",2,6,"[-0.009156934, -0.019673413, -0.0082705645, -0..."
...,...,...,...,...,...,...,...,...
216,217,MSFT,23,2,"BRETT IVERSEN: Thanks, Brad. Joe, we have time...",31,13,"[0.0011864604, -0.04014092, 0.009777045, -0.01..."
217,218,MSFT,23,2,the coming quarters? Thank you. SATYA NADELLA:...,31,21,"[0.009347212, -0.01008223, 0.015255094, -0.009..."
218,219,MSFT,23,2,"going to be an Al app. That's, I think, the be...",32,1,"[0.005495851, -0.003575635, 0.013053961, 0.000..."
219,220,MSFT,23,2,"Sometimes, you will have ISVs who are differen...",32,9,"[-0.004339527, -0.028531296, 0.017532898, -0.0..."


In [5]:
df['Chunk'].iloc[11]

"Cosmos DB now supports PostgreSQL, making Azure the first cloud provider to offer a database service that supports both relational and NoSQL workloads. And, in Al, we are turning the world's most advanced models into platforms for customers. Earlier this month, we brought the power of Dall-E to Azure OpenAI service, helping customers like Mattel apply the breakthrough image generation model to commercial use cases for the first time. And Azure Machine Learning provides industry leading MLOps, helping organizations like 3M deploy, manage, and govern models. "

#### Prompt Template
##### Write a Prompt Template. The prompt template should include all filter keys, so they can be referenced and input.

In [6]:
template = """
        You are given two chunks of text, a ticker e.g. MSFT, Quarter, Year, as input. You will generate 10 relevant questions and answers pairs based on the input.
        The question should be formed based on information in both the chunks of text.
        The answers should be available in the two chunks of text. Do not generate answers on your own.  If answer is not available in the text, just write N/A.
               
        An example output for this example is: 

        Question: For {ticker} FY{year} Q{quarter}, what is the <question goes here>?
        Answer: example answer paraphrased from the relevant information in the given text goes here 

        Based on ticker, quarter, year, the question can be phrased in different ways e.g. MSFT FY23 Q1, MSFT FY2023 1st quarter, e.t.c.
        In case the text question is not relevant, please skip the question and answer pair.
        input_text1: 
        {chunk_text1}
        input_text2:
        {chunk_text2}
        ticker: {ticker}
        quarter: {quarter}
        year: {year}
        """

#### Randomly pick filter parameters:

Add Filter Parameters to randomly pick them for extracting context (chunks). This will help diversify generating questions and answers. 
In the MSFT Financial Transcripts use-case, the Ticker name is MSFT, but you can easily add other ticker labels for a larger Financial dataset. `Year`,`Quarter`, and `Id` are the key parameters used in this use-case. In this notebook, we are using two random chunks (chunk ids) from specified filter keys (year, quarter).

###### TODO: Add Tools (an updated version of function calls) once it is available for working with newer models. Function Calls are currently deprecated for gpt models with versions beyond 07-01-2023.

In [7]:
Ticker = np.random.choice(df['Ticker'].unique())
Year = np.random.choice(df[df['Ticker']==Ticker]['Year'].unique())
Quarter = np.random.choice(df[(df['Ticker']==Ticker) & (df['Year']==Year)]['Quarter'].unique())
# Id = np.random.choice(df[(df['Quarter']==Quarter) & (df['Year']==Year)]['Id'].unique())
# Id2 = np.random.choice(df[(df['Quarter']==Quarter) & (df['Year']==Year) & (df['Id']!=Id)]['Id'].unique())

In [8]:
pagenum1 = np.random.choice(df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter)]['PageNumber'].unique())
pagenum2 = np.random.choice(df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter)]['PageNumber'].unique())

In [9]:
chunk1 = df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter) & (df['PageNumber']==pagenum1)]['Chunk'].str.cat(sep=' ')
chunk2 = df[(df['Ticker']==Ticker) & (df['Year']==Year) & (df['Quarter']==Quarter) & (df['PageNumber']==pagenum2)]['Chunk'].str.cat(sep=' ')

In [10]:
len(chunk1)

1929

In [11]:
chunk1

"one question. Joe, can you please repeat your instructions? (Operator Direction.) KEITH WEISS, Morgan Stanley: Excellent. Thank you guys for taking the question and very nice end to a great fiscal year. Satya, you started out your comments talking about how every customer conversation has the customer asking you about utilizing generative AI technology and how fast they could utilize that generative AI technology. What's the answer? What do you tell them in terms of the pace with which that could get into the marketplace and your customers could start using it?  that could get into the marketplace and your customers could start using it? And then for Amy, how should investors think about just the fundamental gross margins behind these generative AI technologies? We understand there's going to be a lot of CapEx to ramp up underneath these, but what should we expect in terms of what the ultimate gross margin looks like underneath all these new generative AI solutions? Thank you. SATYA N

In [12]:
print(Ticker, Year, Quarter, pagenum1, pagenum2)

MSFT 23 4 29 12


#### Generate Questions (using Azure OpenAI only)

In [13]:
import os
from openai import AzureOpenAI

client = AzureOpenAI(
  api_key = openai.api_key,  
  api_version = openai.api_version,
  azure_endpoint = openai.api_base
)

response = client.chat.completions.create(
    model="gpt-35-turbo", # model = "deployment_name".
    messages=[
        {"role": "system", "content":"You are a generator of questions and answers for the given text." },
        {"role": "user", "content": template.format(chunk_text1=chunk1, chunk_text2=chunk2, 
                                                    ticker=Ticker, year=str(Year), quarter=str(Quarter))}
    ]
)

#print(response)
# print(response.model_dump_json(indent=2))
print(response.choices[0].message.content)

#TODO: Add cells showing adding tools to the chat completion. It is an update to functionc calling feature. Function calling feature is not available in the current version of the API.OpenAI version > 1.0.0.

Question 1: What are some AI-powered solutions that Microsoft is currently offering to customers?
Answer 1: Microsoft is offering GitHub Copilot, M365 Copilot, and Sales Copilot to customers.

Question 2: Are customers interested in using generative AI technology and how fast do they want to utilize it?
Answer 2: Yes, customers are interested in utilizing generative AI technology. However, no specific timeframe was mentioned in the given text.

Question 3: What AI-powered collaborative articles are driving the most traffic to LinkedIn?
Answer 3: LinkedIn's AI-powered collaborative articles are the fastest-growing traffic driver to the platform, but no specific type or topic was mentioned.

Question 4: How is LinkedIn using AI to ensure trust and authenticity amongst its members?
Answer 4: LinkedIn is helping its members verify their identities and places of work through integration with Microsoft Entra, as well as CLEAR and Hyperverge. 

Question 5: What are some new AI-powered feature

In [14]:
print(response.choices[0].message.content)


Question 1: What are some AI-powered solutions that Microsoft is currently offering to customers?
Answer 1: Microsoft is offering GitHub Copilot, M365 Copilot, and Sales Copilot to customers.

Question 2: Are customers interested in using generative AI technology and how fast do they want to utilize it?
Answer 2: Yes, customers are interested in utilizing generative AI technology. However, no specific timeframe was mentioned in the given text.

Question 3: What AI-powered collaborative articles are driving the most traffic to LinkedIn?
Answer 3: LinkedIn's AI-powered collaborative articles are the fastest-growing traffic driver to the platform, but no specific type or topic was mentioned.

Question 4: How is LinkedIn using AI to ensure trust and authenticity amongst its members?
Answer 4: LinkedIn is helping its members verify their identities and places of work through integration with Microsoft Entra, as well as CLEAR and Hyperverge. 

Question 5: What are some new AI-powered feature

In [16]:
import re
_pattern = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s')
retrieved_sentences=_pattern.split(chunk1)
retrieved_sentences

['one question.',
 'Joe, can you please repeat your instructions?',
 '(Operator Direction.) KEITH WEISS, Morgan Stanley: Excellent.',
 'Thank you guys for taking the question and very nice end to a great fiscal year.',
 'Satya, you started out your comments talking about how every customer conversation has the customer asking you about utilizing generative AI technology and how fast they could utilize that generative AI technology.',
 "What's the answer?",
 'What do you tell them in terms of the pace with which that could get into the marketplace and your customers could start using it?',
 ' that could get into the marketplace and your customers could start using it?',
 'And then for Amy, how should investors think about just the fundamental gross margins behind these generative AI technologies?',
 "We understand there's going to be a lot of CapEx to ramp up underneath these, but what should we expect in terms of what the ultimate gross margin looks like underneath all these new genera

In [17]:
retrieved_sentences2=_pattern.split(chunk2)
retrieved_sentences2

['fourth consecutive quarter.',
 'We continue to use AI to help our members and customers connect to opportunities and tap into experiences of experts on the platform.',
 "Our AI- powered collaborative articles are now the fastest-growing traffic driver to LinkedIn. And finally, we're helping LinkedIn stay trusted and authentic.",
 'More than 7 million members have verified who they are or where they work, many using new integrations with Microsoft Entra, as well as CLEAR and Hyperverge.',
 'Now, on to search, advertising, and news.',
 ' Hyperverge.',
 'Now, on to search, advertising, and news.',
 "While it's early in our journey, we are reshaping daily search and web habits with our Copilot for the web.",
 'This quarter, we introduced new AI-powered features, including multimodal capabilities with visual search in Bing Chat.',
 "We're expanding to businesses, with Bing Chat Enterprise, which offers commercial data protection, providing an easy on-ramp for any organization looking to g

### Save Eval Set to a csv


**Note on text structuring and format:** We could chain another llm call to convert to a format suitable to be saved to csv. The prompt can be modified to provide answer in this form directly. But, it is left as an exercise for the user to update prompt to work with whatever format they want to use. Here we split and reorganize the format output by the model using python, for the promptflow sample.

**Note on csv dataset format choice:** csv is a customer requirement, and a proper way can be to log them to a database and populate them. 

In [19]:
qa_string = response.choices[0].message.content
# Parse the string into rows
rows = [row.strip() for row in qa_string.split('\n') if row.strip()]

# Separate questions and answers
questions = [row.split(": ", 1)[1] for i, row in enumerate(rows) if i % 2 == 0]
answers = [row.split(": ", 1)[1] for i, row in enumerate(rows) if (i - 1) % 2 == 0]

# Combine into a list of dictionaries
qa_data = [{"chat_history": "[]", "question": q, "answer": a} for q, a in zip(questions, answers)]

In [20]:
qa_data

[{'chat_history': '[]',
  'question': 'What are some AI-powered solutions that Microsoft is currently offering to customers?',
  'answer': 'Microsoft is offering GitHub Copilot, M365 Copilot, and Sales Copilot to customers.'},
 {'chat_history': '[]',
  'question': 'Are customers interested in using generative AI technology and how fast do they want to utilize it?',
  'answer': 'Yes, customers are interested in utilizing generative AI technology. However, no specific timeframe was mentioned in the given text.'},
 {'chat_history': '[]',
  'question': 'What AI-powered collaborative articles are driving the most traffic to LinkedIn?',
  'answer': "LinkedIn's AI-powered collaborative articles are the fastest-growing traffic driver to the platform, but no specific type or topic was mentioned."},
 {'chat_history': '[]',
  'question': 'How is LinkedIn using AI to ensure trust and authenticity amongst its members?',
  'answer': 'LinkedIn is helping its members verify their identities and places

In [21]:
csv_file_path = "../datasets/generated_eval_set.csv"
import csv
# Writing to CSV file
with open(csv_file_path, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=["chat_history", "question", "answer"])
    
    # Write headers
    writer.writeheader()
    
    # Write data
    writer.writerows(qa_data)

print(f"Data has been saved to {csv_file_path}.")

Data has been saved to ../datasets/generated_eval_set.csv.
