# Custom Chatbot Project

In this project, I used a dataset filled with made-up character profiles created by OpenAI which is provided by Udacity. I picked this dataset because I want to make a tool that acts like an expert and can answer questions about these imaginary characters.

To make my tool better at answering questions, I used a technique called Retrieval Augmented Generation without langchain. This method adds extra information from the dataset to the question, helping the model give more accurate and relevant answers.

This dataset makes sense because it's all about characters that aren't real. So, it fits with what I'm trying to do – test if the model can give good answers when faced with made-up data or questions that can't be answered well with the data it learned from.

## Data Wrangling


In [1]:
import pandas as pd

data = pd.read_csv('character_descriptions.csv')

In [2]:
data.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [3]:
unique_values = data['Setting'].unique()

for value in unique_values:
    print(value)

England
Texas
Australia
USA
Italy
Ancient Greece


In [4]:
unique_values = data['Medium'].unique()

for value in unique_values:
    print(value)

Play
Movie
Limited Series
Musical
Reality Show
Opera
Sitcom


In [5]:
def concatenate_with_column_names(row):
    return ''.join([f"Actor: {row['Name']} \n Details of this actor: {row['Description']} \n Medium or Industry of the actor: {row['Medium']} \n Country of the actor: {row['Setting']}"])

In [6]:
data['text']= data.apply(concatenate_with_column_names, axis=1)

In [7]:
data.head(5)

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,Actor: Emily \n Details of this actor: A young...
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,Actor: Jack \n Details of this actor: A middle...
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,Actor: Alice \n Details of this actor: A woman...
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,Actor: Tom \n Details of this actor: A man in ...
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,Actor: Sarah \n Details of this actor: A woman...


# Model

In [8]:
import openai
from typing import List, Union, Dict
from scipy.spatial.distance import cosine

In [9]:
OPENAI_API_KEY = "YOUR API KEY"
EMBEDDING_MODEL = 'text-embedding-3-small'
COMPLETION_MODEL = 'gpt-3.5-turbo'
BATCH_SIZE = 2

openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)

In [10]:
def get_embeddings(prompt: Union[str, List[str]], embedding_model: str) -> List[List[float]]:
    response = openai_client.embeddings.create(
        input=prompt if isinstance(prompt, list) else [prompt],
        model=embedding_model
    )
    return [row.embedding for row in response.data]

def create_embeddings(df: pd.DataFrame, embedding_model_name: str = EMBEDDING_MODEL, batch_size: int = BATCH_SIZE) -> List[List[float]]:
    embeddings_output = []
    for idx in range(0, len(df), batch_size):
        batch = df.iloc[idx:idx+batch_size]['text'].tolist()
        embeddings = get_embeddings(batch, embedding_model_name)
        embeddings_output.extend(embeddings)
    return embeddings_output


## Custom Query Completion



In [11]:
def build_simple_prompt(question):
    return [
        {
            'role': 'user',
            'content': question
        }
    ]

In [12]:
def build_custom_context(question, database_df, n=5):
    question_embedding = get_embeddings(question, EMBEDDING_MODEL)[0]
    
    df = database_df.copy()
    df["distances"] = df['embedding'].apply(lambda embedding: cosine(embedding, question_embedding))

    df.sort_values("distances", ascending=True, inplace=True)
    return df.iloc[:n]['text'].tolist()

In [13]:
def handle_question(prompt, client, model_name=COMPLETION_MODEL):
    response = client.chat.completions.create(
        model=model_name,
        messages=prompt,
        max_tokens=100
    )
    return response.choices[0].message.content

In [31]:
def build_custom_prompt(question, database_df):
    print("\n \n Found context :: {} ".format(build_custom_context(question, database_df)))
    return [
        {
            'role': 'system',
            'content': """
                Provide an answer based on the context provided below.
                If the question cannot be answered using the provided context, kindly respond with "I don't know the answer."
                The information pertains to the 2022/2023 season of the English Premier League.
                Each fact is annotated with a date and separated by lines.
            Context: 
                {}
            """.format('\n\n'.join(build_custom_context(question, database_df)))
        },
        {
            'role': 'user',
            'content': question
        }
    ]


In [32]:
import ast

def convert_embedding(embedding_string):
    return ast.literal_eval(embedding_string)

## Custom Performance Demonstration



### Question 1

In [33]:
question_1 = 'Tell me about Emily and which country she belongs to?'

print('Answer without Context: \n', handle_question(build_simple_prompt(question_1), openai_client))

print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_1, data), openai_client))

Answer without Context: 
 I'm sorry, but I don't have any specific information about Emily or which country she belongs to as it would require personal details that I don't have access to. Can you provide more context or details about Emily?

 
 Found context :: ["Actor: Emily \n Details of this actor: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George. \n Medium or Industry of the actor: Play \n Country of the actor: England", "Actor: Alice \n Details of this actor: A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack. \n Medium or Industry of the actor: Play \n Country of the actor: England", "Actor: George \n Details of this actor: A man in his early 30s, George is

### Question 2

In [36]:
question_1 = 'Tell me about a retired soldier who is in his 50s and which country he belongs to?'

print('Answer without Context: \n', handle_question(build_simple_prompt(question_1), openai_client))

print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_1, data), openai_client))

Answer without Context: 
 John Richards is a retired soldier in his 50s who served in the United States Army for over 25 years. He joined the army straight out of high school and rose through the ranks to become a highly decorated officer. After multiple tours of duty in Iraq and Afghanistan, John decided to retire and settle down in his hometown in Texas.

As a retired soldier, John spends his days enjoying his well-earned peaceful retirement. He is an active member of the local veteran community, volunteering his time to

 
 Found context :: ["Actor: Tom \n Details of this actor: A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel. \n Medium or Industry of the actor: Play \n Country of the actor: England", "Actor: John \n Details of this actor: A man in his 60s, John is a retired professor and Tom's father. He has a dry wit and a l

### Question 3

In [38]:
question_1 = 'Tell me about some actors from Opera industry  and which country they belong to?'

print('Answer without Context: \n', handle_question(build_simple_prompt(question_1), openai_client))

print('\nAnswer with Context: \n', handle_question(build_custom_prompt(question_1, data), openai_client))

Answer without Context: 
 1. Plácido Domingo - Spain: Domingo is a world-renowned Spanish opera singer known for his versatile voice and magnetic stage presence. He has performed in leading roles at major opera houses around the world and has won numerous awards for his performances.

2. Anna Netrebko - Russia: Netrebko is a highly acclaimed Russian operatic soprano known for her powerful and expressive voice. She has performed at leading opera houses in Europe and the United States and has received numerous awards for

 
 Found context :: ['Actor: Baron Gustavo \n Details of this actor: A wealthy and arrogant nobleman who loves to flaunt his wealth and status. Baron Gustavo is competitive and ruthless, and his singing voice is powerful and commanding. He is not above using his influence and resources to get what he wants, regardless of who he hurts in the process. \n Medium or Industry of the actor: Opera \n Country of the actor: Italy', 'Actor: Signora Rosa \n Details of this actor: 

### Explanation

 In the first question, the model performs as expected. In the second question, it deviates slightly or hallucinate from the desired response, but with context, the model provides the correct answer. Similarly, in the third question, while the model initially provides information about famous opera singers, it aligns with expectations when given context.