<a href="https://colab.research.google.com/github/kayblevision/Multilingual-ITS-using-GPT/blob/main/GPT_answer_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# starting with accessing the API

begin with necessary import statements. Ensure you have your key (don't share it with anyone) in order to access the API.

In [None]:
!pip install --upgrade openai

In [None]:
import os
import openai
from openai import OpenAI

In [None]:
os.environ["OPENAI_API_KEY"]='your api key goes here'
openai.api_key = os.getenv("OPENAI_API_KEY")

client = OpenAI()

In [None]:
# testing access
response = client.embeddings.create(
  model="text-embedding-ada-002",
  input="The food was delicious and the waiter..."
)

print(response)

# embeddings

Retrieve embeddings for a file you are working with. I'm using a .csv file for this project.

In [None]:
!pip install openai[embeddings]
!pip install tiktoken
!pip install utils

In [None]:
import pandas as pd
import numpy as np
import tiktoken
import utils
import matplotlib.pyplot as plt

specify which embedding model you want to use-- OpenAI has a few to choose from, linked [here](https://https://platform.openai.com/docs/guides/embeddings)

In [None]:
def get_embedding(text, model="text-embedding-3-large"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

In [None]:
embedding_model = "text-embedding-3-large"
embedding_encoding = "cl100k_base"
max_tokens = 8000  # the maximum for text-embedding-3-small is 8191. Choose the right number for you

Loading the data (if using Google Colab)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
data = '/path/to/data/goes/here/data.csv'
df = pd.read_csv(data, header=None)

Obtain embeddings for your text data and save as in a new column in your dataframe.

In [None]:
import ast  # for converting embeddings saved as strings back to arrays
from scipy import spatial  # for calculating vector similarities for search

In [None]:
df['embedding'] = df['column_to_receive_embeddings'].apply(lambda x: get_embedding(x, model=embedding_model))
df.to_csv('/file/save/path/name.csv', index=False) # save the file so you can access/load it later

In [None]:
embed = pd.read_csv('/file/save/path/name.csv')
embed.head()

In [None]:
embed['embedding'] = embed['embedding'].apply(ast.literal_eval)

# GPT modeling

Define functions to introduce new data to the gpt model for question answering purposes. This includes a search function to rank text based on distance between embeddings, and a query message function to receive a user query, retrieve relevant texts and give a message to GPT.

Reference code and helpful article found [here](https://cookbook.openai.com/examples/question_answering_using_embeddings).

In [None]:
GPT_MODEL='gpt-3.5-turbo' # you can use this one or one of the newer gpt models

In [None]:
# search function for the dataset
def strings_ranked_by_relatedness(query: str, df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=embedding_model,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["column_with_original_text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in embed.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame = embed,
    model: str = GPT_MODEL,
    token_budget: int = 8000,
) -> str:
    """Return a message for GPT."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df2)
    introduction = '''Use the information from the (text containing information) column to answer the subsequent question as succinctly
    as possible. If the answer cannot be found in the data, write "I could not find an answer."'''
    question = f"\n\nQuestion: {query}"
    message = introduction
    return message + question

Now an ask function is defined to return GPT's answer. This code assumes input as a dataframe with relevant text, questions, and embeddings in the same row.

In [None]:
def ask(
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 8000, # specify your own token budget
    print_message: bool = False,
) -> pd.DataFrame:
    """Answers questions based on (informational text) provided in the same row."""
    responses = []
    for index, row in df.iterrows():
        context = row['column_with_informational_text'] # replace with column names from your dataframe
        question = row['question_column'] # replace with column names from your dataframe

        context_message = query_message(context, model=model, token_budget=token_budget)
        question_message = query_message(question, model=model, token_budget=token_budget)

        if print_message:
            print("Context message:", context_message)
            print("Question message:", question_message)

        messages = [
            {"role": "system", "content": "You answer questions about (relevant topic) based on the (relevant text)."},
            {"role": "user", "content": context_message},
            {"role": "user", "content": question_message}
        ]
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0
        )
        response_message = response.choices[-1].message.content  # assuming the last message is the response to the question
        responses.append(response_message)

    df['response'] = responses
    return df

For a dataframe containing questions to be answered, this code gives output of the same dataframe, now with a 'response
column with answers from GPT based on your provided data. However this function can also be adjusted to produce single answers, found at the reference code linked above.

Access the dataframe with answers by calling the ask() function.

In [None]:
answer = ask()
answer.head(10)

In [None]:
answer.to_csv('/file/path/here/filename.csv') # save the new dataframe to access it later