# Custom Chatbot Project

For implementation of a chatbot with the help foundation modeol like 'GPT-3.5-turbo-instruct', a database needed to able us to implement RAG.
To this end, "Character Description" database is selected for the following:
* Clear separation of data category
* Clean Format
* Contextually appropirate to create context for a prompt

# Imports and Config

In [8]:
import openai
import pandas as pd
import tiktoken
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, distances_from_embeddings

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

## Data Wrangling

In [9]:
# Set opanai api key
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-3536329941266772997668678e47bca8fda1.32828252"

In [17]:
# Read the csv file from local path
df = pd.read_csv("./data/character_descriptions.csv")

In [18]:
# Create a new 'text' column by combining all relevant columns
df['text'] = df.apply(lambda row: f"{row['Name']}: {row['Description']} This is a {row['Medium']} set in {row['Setting']}.", axis=1)

# Keep only the 'text' column
df_transformed = df[['text']]

In [19]:
# Save the transformed data to a new csv file
df_transformed.to_csv("./data/character_transformed.csv", index=False)

## Custom Query Completion

### Creating an Embeddings Index with `openai.Embedding`

In [33]:
def get_costum_embedding(text):
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response["data"][0]["embedding"]

In [34]:
# Generate embeddings for each character description
df_transformed["embeddings"] = df_transformed["text"].apply(get_costum_embedding)
# Save embeddings for future use
df_transformed.to_csv("./data/character_embeddings.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_transformed["embeddings"] = df_transformed["text"].apply(get_costum_embedding)


### Finding Relevant Data with Cosine Similarity

In [6]:
def get_rows_sorted_by_relevance(prompt, df):   
	# Get embeddings for the prompt text
	prompt_embeddings = get_embedding(prompt, engine=EMBEDDING_MODEL_NAME)

	df_copy = df.copy()
	df_copy["distances"] = distances_from_embeddings(
			prompt_embeddings,
			df_copy["embeddings"].values,
			distance_metric="cosine"
	)

	# Sort the copied dataframe by the distances and return it
	# (shorter distance = more relevant so we sort in ascending order)
	df_copy.sort_values("distances", ascending=True, inplace=True)
	return df_copy

In [3]:
# Load the embeddings from the csv file
df_embedding = pd.read_csv("./data/character_embeddings.csv")
# Make sure the embeddings are in the correct format
df_embedding["embeddings"] = df_embedding["embeddings"].apply(eval).apply(np.array)

### Tokenizing with `tiktoken` and compsoing a prompt

In [2]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

## Custom Performance Demonstration

In [4]:
def answer_question(
    question, df, max_prompt_tokens=2000, max_answer_tokens=500
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

### Question 1

In [10]:
custom_answer = answer_question("who old is jack from england?", df_embedding)

In [11]:
custom_answer

'40s'

### Question 2

In [55]:
custom_answer = answer_question("Give me an acter from USA with Muscial medium?", df_embedding)

In [56]:
custom_answer

'Donna, Johnny, Dolly, Crystal, Karma, Sable, Olivia'

In [None]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit", "bye"]:
        print("Chatbot: Goodbye!")
        break
    response = answer_question(user_input, df_embedding)
    print("Chatbot:", response)


Chatbot: Jack is described as a middle-aged man in his 40s, a successful businessman, and Sarah's boss. He has a no-nonsense attitude and is fiercely loyal to his friends and family. He is married to Alice and the play is set in England.
Chatbot: Donna could be considered an actor from USA with Musical medium as she is a seasoned performer on stage in a Musical set in USA.
