# Demo: Question Answering on your own data using the OpenAI SDK with Azure OpenAI embeddings

## Summary

In this demo, we use the [OpenAI SDK](https://github.com/openai/openai-python) with the [Azure OpenAI service](https://learn.microsoft.com/azure/cognitive-services/openai/overview) to tailor the model to answer questions specifically about the [meals](https://us.pycon.org/2023/onsite/meal-ingredients/) planned at Pycon.

> Note: access to the Azure OpenAI service is by approval only. Please see [How do I get access to Azure OpenAI?](https://learn.microsoft.com/azure/cognitive-services/openai/overview#how-do-i-get-access-to-azure-openai) for more information.

## Prequisites and install instructions

- Python 3.7+
- An Azure OpenAI resource (or alternatively an OpenAI account)

To install the necessary requirements to run the demo, run:

`pip install -r requirements.txt`

## Authentication

In this demo, we will configure the library to use the Azure OpenAI service and authenticate using Azure Active Directory (AAD). Alternatively, you can use an API key (see [here](https://github.com/openai/openai-python#microsoft-azure-endpoints)).

In [1]:
import os
import openai
import azure.identity
import dotenv

env_file = '/.env'
dotenv.load_dotenv(env_file, override=True)

openai.api_type = "azure_ad"  # using azure endpoints with AAD auth
openai.api_base = os.environ["OPENAI_API_BASE"]  # the endpoint value
openai.api_version = "2023-03-15-preview"  # API version, subject to change

credential = azure.identity.DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

openai.api_key = token.token  # set the key to the value of the AccessToken

## Code

In [9]:
"""We start by scraping the meal table from the Pycon webpage and converting it into a Dataframe."""

import pandas as pd
import numpy as np

# read the table from the Pycon meal webpage
html = pd.read_html('https://us.pycon.org/2023/onsite/meal-ingredients/')
df = html[0]

# remove all the empty rows at the bottom
df = df.dropna(how='all')
df = df.reset_index()

# need to fix ingredients which drop to second row
# df["INGREDIENTS_SHIFTED"] = df["INGREDIENTS"].shift(-1)
# df["INGREDIENTS_SHIFTED"] = df["INGREDIENTS_SHIFTED"].fillna('')

df['OPTION'].ffill(inplace=True)
df = df.dropna(how='any')

# group the row text together under a new column
df["text"] = df.apply(lambda x: f"{x['OPTION']}, {x['MEAL']}, {x['MENU ITEM']}, {x['INGREDIENTS']}", axis=1)
df.to_csv('data.csv') 

In [10]:
"""The text is tokenized using `tiktoken` and the number of tokens per text chunk is saved to the Dataframe."""

import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

In [15]:
"""An embedding is generated for each text chunk. API calls are subject to rate limits so we use backoff library (as an example) to implement exponential backoff."""

import openai
import backoff

@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def get_embeddings(x):
    return openai.Embedding.create(input=x, engine='text-embedding-ada-002-2')['data'][0]['embedding']

df['embeddings'] = df.text.apply(lambda x: get_embeddings(x))
df.to_csv('embeddings.csv')

In [17]:
"""The embeddings are loaded into a numpy array and we can visualize the data."""

import pandas as pd
import numpy as np


df=pd.read_csv('embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

df.head()

Unnamed: 0,index,OPTION,MEAL,MENU ITEM,INGREDIENTS,text,n_tokens,embeddings,distances
1,1,MEAT,Wednesday Tutorial Lunch,Greek Chicken Power Bowl,"Grilled Chicken, Mixed Greens, Cucumber, Tomat...","MEAT, Wednesday Tutorial Lunch, Greek Chicken ...",27,"[-0.013481014408171177, -0.02020479552447796, ...",1.032127
3,5,MEAT,Thursday Tutorial Lunch,Beef Stir Fry,"Beef, Stir Fried Vegetables, Rice w/Peas & Car...","MEAT, Thursday Tutorial Lunch, Beef Stir Fry, ...",27,"[-0.008427430875599384, -0.019243421033024788,...",1.022146
5,9,MEAT,Friday Conference Breakfast,Turkey Sausage & Egg Burrito,"Turkey Sausage, Egg, Cheese, Flour Tortilla Wrap","MEAT, Friday Conference Breakfast, Turkey Saus...",29,"[-0.015813421458005905, -0.015404226258397102,...",1.062437
6,12,MEAT,Friday Conference Lunch,Grilled Chicken Pasta Bowl,"Grilled Chicken, Pasta, Broccoli, Alfredo Sauce","MEAT, Friday Conference Lunch, Grilled Chicken...",25,"[-0.00466617988422513, -0.018455589190125465, ...",1.04607
7,15,MEAT,Saturday Conference Breakfast,Bacon & Egg English Muffin,"Bacon, Egg, Cheese, English Muffin","MEAT, Saturday Conference Breakfast, Bacon & E...",25,"[-0.006820477079600096, -0.014287382364273071,...",1.065281


In [18]:
"""The user can ask a question, and an embedding is created for the question. A text similarity search is performed that finds the text chunk from the given context (our meal data) that is most similar to the question using cosine similarity.
The generative model is prompted with the question and the relevant text chunks as context. It will answer the question if the answer is found in the context."""

from openai.embeddings_utils import distances_from_embeddings

def create_context(
    question, df, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)

def answer_question(
    df,
    model="gpt-35-turbo",
    question="what is for lunch on wednesday?",
    max_len=1800,
    size="ada",
    debug=False,
):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = create_context(
        question,
        df,
        max_len=max_len,
        size=size,
    )

    if debug:
        print("Context:\n" + context)
        print("\n\n")

    messages = [{"role": "system", "content": "Answer the question in your own words based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\n"}]
    messages.append({"role": "user", "content": f"Context: {context}\n\n---\n\nQuestion: {question}\nAnswer:"})
    response = openai.ChatCompletion.create(
        messages=messages,
        engine=model,
    )
    answer = response['choices'][0]['message']['content'].strip()
    return answer


In [19]:
answer_question(df, question="I hate butternut squash. What meals should I avoid?")

'You should avoid the Vegan Power Bowl served on Wednesday Tutorial Lunch as it contains butternut squash.'