## Benchmark Question-Answer Generation

This notebook demonstrates how we can generate a set of Questions and Answers based on chunks from a database. Data sampled from WikiHow dataset is used as example

### SET UP AND CONFIGURATION

Load Environment File

In [6]:
from dotenv import dotenv_values
from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential
import openai

# specify the name of the .env file name 
env_name = "../../../.env" # change to your own .env file name
config = dotenv_values(env_name)

In [7]:
"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see 
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""

if config['KEYS_FROM'] == "KEYVAULT":
    print('keyvault was selected.')
    keyVaultName = config["KEY_VAULT_NAME"]
    KVUri = f"https://{keyVaultName}.vault.azure.net"

    credential = DefaultAzureCredential()
    client = SecretClient(vault_url=KVUri, credential=credential)
    openai.api_type = client.get_secret("OPENAI-API-TYPE").value
    openai.api_key = client.get_secret("OPENAI-API-KEY").value
    openai.api_base = client.get_secret("OPENAI-API-BASE").value
    openai.api_version = client.get_secret("OPENAI-API-VERSION").value
    
else:
    print('.env was selected.')
    openai.api_type = config["OPENAI_API_TYPE"] 
    openai.api_key = config["OPENAI_API_KEY"]
    openai.api_base = config["OPENAI_API_BASE"] 
    openai.api_version = config["OPENAI_API_VERSION"] 

.env was selected.


In [8]:
openai.api_base

'https://appliedaistudio-aiservices1658415937.openai.azure.com'

Read Chunks from csv (see step2 notebook from preprocessing subdirectory)

In [14]:
import os, ast
import pandas as pd 

df = pd.read_csv(f'{os.getcwd()}\..\\data\\wikihow1000.csv')

df.head()

Unnamed: 0,headline,title,text
0,\nPlace skis with bases up on a solid surface....,How to Repair Gouges in a Ski's Base,"If you have access to a vice, secure skis in ..."
1,"\nUse “raise” as instructed.,\nUse ""rise” as i...",How to Use Commonly Misused Words12,"\n\n\nRaise means ""lift"" and is a transitive v..."
2,\nWalk to a place with obstacles or enemies th...,How to Die in Mario Games5,",,"
3,\nSlide your feet into the back pockets of the...,How to Make Jean Slippers,The pockets should fit your feet comfortably....
4,"\nUtilize the Firefox add-on/Extensions page.,...",How to Install Gmail Notifier Add on for Firefox,"Click on the link with add-ons\n\n, You get a..."


In [15]:
df.shape

(1000, 3)

#### Prompt Template
##### Write a Prompt Template. The prompt template should include all filter keys, so they can be referenced and input.

In [16]:
template = """
        You are given a chunk of text, which represents a part WikiHow knowledge base article, as input. 
        You will generate {n} relevant questions and answers pairs based on the input. Question may be high-level or specific.
        The question should be formed based on information in both the chunks of text. But don't use exact phrasing from the given text, assume that the person who asks question has never seen the article and wondering about some topic that's covered in it.
        The answers should be available in the text. Do not generate answers on your own.  If answer is not available in the text, just write N/A.
               
        An example is:   "How do I plant a cherry blossom tree in my garden?"
       
        You should analyze content of the article and think of possible questions that might lead user to it. Questions may be of different types: problem-solving, open,  reasoning, etc.
        input_text: {chunk_text1}
        """

#### Randomly pick subset of data 

In [34]:
df_elements = df.sample(n=50)
df_elements.head()

Unnamed: 0,headline,title,text,qa
239,"\nQuit the program.,\nDrag the icon off the Do...",How to Add and Remove a Program Icon From the ...,All programs will appear on the Dock while th...,"[{'chat_history': '[]', 'question': 'How can I..."
703,"\nHold your breath.When you hold your breath, ...",How to Get Rid of Hiccups When You Are Drunk1,Since hiccups seem to be associated with a re...,
169,"\nPrep the chocolate.,\nBoil the cream.,\nBlen...",How to Make Chocolate Whipped Cream3,Chop 4 ounces dark chocolate into small piece...,
887,"\nMake a home for it!,\nGet it some clothes!,\...",How to Care for a Webkinz Dreamy Sheep,First get a box decorate with stickers and co...,
676,"\nStart your Xbox 360.\n\n,\nWhen it is starti...",How to Delete the Cache on Your Xbox 3602,",, The Xbox should run faster and have improve...",


In [19]:
chunk1 = df_elements['text'].iloc[0]
chunk1

' To create perfect candy cane lips, you want the white to be very pigmented and opaque. In order to achieve this, you want to conceal the natural redness of your lips that could show through the white lipstick. Use a makeup sponge to dab concealer over your lips, making sure to cover the entire surface.\n\n\n\n\n\n\n\n, After applying you concealer, it’s important to set it with a powder. Otherwise, you’ll be applying lip product over a creamy base, and you’ll end up with a smudged mess. Use your makeup sponge to press the powder into your lips, covering all of the concealer.\n\n\n\n\n\n\n\n\nBrush off any excess powder with a fluffy brush.\n\n, You can purchase a white lip liner at any beauty supply store. First, outline your lips entirely. This is a helpful first step, because it will define the border that has been concealed. Instead of just lining your lips, however, you’re going to use the lip liner to fill in the entirety of your lips.\n\n\n\n\n\n\n\n, The white lip liner create

In [20]:
df_elements.shape

(50, 3)

#### Generate Questions (using Azure OpenAI only)

In [21]:
import os
from openai import AzureOpenAI

client = AzureOpenAI(
  api_key = openai.api_key,  
  api_version = openai.api_version,
  azure_endpoint = openai.api_base
)

response = client.chat.completions.create(
    model="gpt-4-32k", # model = "deployment_name".
    messages=[
        {"role": "system", "content":"You are a creative generator of questions and answers for the given text." },
        {"role": "user", "content": template.format(chunk_text1=chunk1, n = 1)}
    ]
)

#print(response)
# print(response.model_dump_json(indent=2))
print(response.choices[0].message.content)

#TODO: Add cells showing adding tools to the chat completion. It is an update to functionc calling feature. Function calling feature is not available in the current version of the API.OpenAI version > 1.0.0.

Question: How can I create a candy cane lip makeup look?

Answer: To create a candy cane lip makeup look, begin by concealing the natural redness of your lips with a concealer, making it very pigmented and opaque. Set the concealer with a powder to avoid a smudged mess. Next, purchase a white lip liner and use it to outline and fill your lips completely. Then, apply a white lipstick over the lip liner. If the lipstick is creamy, use translucent powder, but this is unnecessary for matte lipstick. The final step involves applying red liquid lipstick in diagonal stripes alternated between thick and thin to mimic the pattern of candy canes.


In [22]:
def generate_qa(chunk1, n1=1, source =''):   
    response = client.chat.completions.create(
        model="gpt-4-32k", # model = "deployment_name".
        messages=[
            {"role": "system", "content":"You are a generator of questions and answers for the given text." },
            {"role": "user", "content": template.format(chunk_text1=chunk1, n = n1)}
        ]
    )

    qa_string = response.choices[0].message.content
    # Parse the string into rows
    rows = [row.strip() for row in qa_string.split('\n') if row.strip()]

    # Separate questions and answers
    questions = [row.split(": ", 1)[1] for i, row in enumerate(rows) if i % 2 == 0]
    answers = [row.split(": ", 1)[1] for i, row in enumerate(rows) if (i - 1) % 2 == 0]

    # Combine into a list of dictionaries
    qa_data = [{"chat_history": "[]", "question": q, "answer": a, "source": source} for q, a in zip(questions, answers)]

    return qa_data

In [36]:
import time

max_retries=3

df_elements = df_elements.reset_index(drop=True)

for index, row in df_elements.iterrows():
        print(f"start working on {index}")
        retries = 0
        while retries < max_retries:
            try:
                df_elements.at[index, 'qa'] = generate_qa(row['text'],2,row['title'])
                print(f"successfully completed processing {index}")       
                break  # Successfully applied function, exit loop
            except Exception as e:
                print(f"Error applying function to  row {index}: {e}")
                print("context: ", row['text'])
                retries += 1
                time.sleep(1)  # Optional: Add a delay between retries

start working on 0
successfully completed processing 0
start working on 1
successfully completed processing 1
start working on 2
successfully completed processing 2
start working on 3
successfully completed processing 3
start working on 4
successfully completed processing 4
start working on 5
successfully completed processing 5
start working on 6
successfully completed processing 6
start working on 7
successfully completed processing 7
start working on 8
successfully completed processing 8
start working on 9
successfully completed processing 9
start working on 10
successfully completed processing 10
start working on 11
successfully completed processing 11
start working on 12
successfully completed processing 12
start working on 13
successfully completed processing 13
start working on 14
successfully completed processing 14
start working on 15
successfully completed processing 15
start working on 16
successfully completed processing 16
start working on 17
successfully completed processi

Print output

In [37]:
df_elements

Unnamed: 0,headline,title,text,qa
0,"\nQuit the program.,\nDrag the icon off the Do...",How to Add and Remove a Program Icon From the ...,All programs will appear on the Dock while th...,"[{'chat_history': '[]', 'question': 'How can I..."
1,"\nHold your breath.When you hold your breath, ...",How to Get Rid of Hiccups When You Are Drunk1,Since hiccups seem to be associated with a re...,"[{'chat_history': '[]', 'question': 'What are ..."
2,"\nPrep the chocolate.,\nBoil the cream.,\nBlen...",How to Make Chocolate Whipped Cream3,Chop 4 ounces dark chocolate into small piece...,"[{'chat_history': '[]', 'question': 'What is t..."
3,"\nMake a home for it!,\nGet it some clothes!,\...",How to Care for a Webkinz Dreamy Sheep,First get a box decorate with stickers and co...,"[{'chat_history': '[]', 'question': 'How can I..."
4,"\nStart your Xbox 360.\n\n,\nWhen it is starti...",How to Delete the Cache on Your Xbox 3602,",, The Xbox should run faster and have improve...","[{'chat_history': '[]', 'question': 'What are ..."
5,\nTake a deep breath when you are angry: Thoug...,How to Have Anger and Be Christian,";\n, If angry at a person, it generally helps ...","[{'chat_history': '[]', 'question': 'What can ..."
6,"\nGet a small cooler.,\nPlace your ice tray, m...",How to Make Clear Ice2,"Just a regular cooler is fine, like the one y...","[{'chat_history': '[]', 'question': 'What is t..."
7,"\nPosition the back light.,\nAdjust the intens...",How to Use Three Point Lighting3,The back light's role is to provide a glowing...,"[{'chat_history': '[]', 'question': 'What is t..."
8,"\nRemove your old countertop.,\nMeasure the co...",How to Make a DIY Countertop,Work slowly and carefully to ensure that you ...,"[{'chat_history': '[]', 'question': 'What are ..."
9,\nSauté sliced onion in olive oil until starti...,How to Cook Zucchini and Tomato Tian,Add red pepper slices and cook until tender. ...,"[{'chat_history': '[]', 'question': 'Why is eg..."


We have to clean data, because at least one of the rows was invalid.

In [39]:
qa1 = df_elements['qa'].iloc[0]
qa1

[{'chat_history': '[]',
  'question': 'How can I identify if a program is open on the Dock?',
  'answer': 'A program is open if it has a small dot next to the Dock icon, even if no windows are open.',
  'source': 'How to Add and Remove a Program Icon From the Dock of a Mac Computer2'},
 {'chat_history': '[]',
  'question': 'What steps should I follow to remove a program from the Dock?',
  'answer': 'You can remove a program from the Dock by right-clicking the icon (or holding Control and clicking) and selecting "Quit" or "Force Quit" to close the program. Then, click and hold the program icon and drag it at least a third of the way across the screen, away from the Dock. Make sure not to release the mouse button too soon as the program will just jump back to the Dock. Wait until the program icon turns transparent. Then you can release it and see an animation resembling a poof of smoke, indicating the program icon has been removed from the Dock. Or use a drop down menu by right-clicking 

In [41]:
df_clean = df_elements[df_elements['qa'].notna()]
df_clean.shape

(49, 4)

In [44]:
# Explode the 'qa' column into separate rows
df_expanded = df_clean.explode('qa')

# Extract 'chat_history', 'question', and 'answer' from the exploded 'qa' column
df_expanded['chat_history'] = df_expanded['qa'].apply(lambda x: x['chat_history'])
df_expanded['question'] = df_expanded['qa'].apply(lambda x: x['question'])
df_expanded['answer'] = df_expanded['qa'].apply(lambda x: x['answer'])

# Drop the original 'qa' column
df_expanded.drop(columns=['qa'], inplace=True)


In [47]:
df_expanded = df_expanded.rename(columns={'text': 'context', 'title': 'source'})
df_expanded.head()

Unnamed: 0,headline,source,context,chat_history,question,answer
0,"\nQuit the program.,\nDrag the icon off the Do...",How to Add and Remove a Program Icon From the ...,All programs will appear on the Dock while th...,[],How can I identify if a program is open on the...,A program is open if it has a small dot next t...
0,"\nQuit the program.,\nDrag the icon off the Do...",How to Add and Remove a Program Icon From the ...,All programs will appear on the Dock while th...,[],What steps should I follow to remove a program...,You can remove a program from the Dock by righ...
1,"\nHold your breath.When you hold your breath, ...",How to Get Rid of Hiccups When You Are Drunk1,Since hiccups seem to be associated with a re...,[],What are some recommended techniques to stop h...,The text suggests several techniques such as h...
1,"\nHold your breath.When you hold your breath, ...",How to Get Rid of Hiccups When You Are Drunk1,Since hiccups seem to be associated with a re...,[],What should one do if the hiccups persist for ...,The text suggests that if hiccups last longer ...
2,"\nPrep the chocolate.,\nBoil the cream.,\nBlen...",How to Make Chocolate Whipped Cream3,Chop 4 ounces dark chocolate into small piece...,[],What is the process of making chocolate ganach...,"To make chocolate ganache with dark chocolate,..."


In [48]:
df_out = df_expanded[["chat_history", "question", "answer", "source", "context"]]
df_out.head(3)

Unnamed: 0,chat_history,question,answer,source,context
0,[],How can I identify if a program is open on the...,A program is open if it has a small dot next t...,How to Add and Remove a Program Icon From the ...,All programs will appear on the Dock while th...
0,[],What steps should I follow to remove a program...,You can remove a program from the Dock by righ...,How to Add and Remove a Program Icon From the ...,All programs will appear on the Dock while th...
1,[],What are some recommended techniques to stop h...,The text suggests several techniques such as h...,How to Get Rid of Hiccups When You Are Drunk1,Since hiccups seem to be associated with a re...


**Note on text structuring and format:** We could chain another llm call to convert to a format suitable to be saved to csv. The prompt can be modified to provide answer in this form directly. But, it is left as an exercise for the user to update prompt to work with whatever format they want to use. Here we split and reorganize the format output by the model using python, for the promptflow sample.

**Note on csv dataset format choice:** csv is a customer requirement, and a proper way can be to log them to a database and populate them. 

In [49]:
df_out.to_csv(f'{os.getcwd()}\..\\data\\wikihow_eval_qa.csv', index=False)