<a href="https://colab.research.google.com/github/heber-augusto/udacity-generative-ai-nanodegree/blob/main/custom_chatbot/custom_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom Chatbot Project

The dataset used for this task is the "Character Descriptions" csv. It is a synthetic content created by OpenAI model and represents a private dataset that is not available to OpenAI usage to train their models. So, I will use it to extract additional context and pass it used together the user question to the LLM model to asnwer the question.

## Libraries instalation

In [43]:
!pip install openai tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


## Read OpenAI API Key from secret

In [2]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('openai_api_key')

In [7]:
from openai import OpenAI
openai_client = OpenAI(
    api_key = OPENAI_API_KEY)

## Data Wrangling

load dataset into a `pandas` dataframe with a column named `"text"`.

In [49]:
import pandas as pd

# Load Character from the repository

url = 'https://raw.githubusercontent.com/heber-augusto/udacity-generative-ai-nanodegree/main/custom-chatbot/dataset/character_descriptions.csv'
df = pd.read_csv(url)

In [50]:
# Initial Columns
initial_columns = df.columns

# Create one column named text and remove all other columns
df['text'] = 'Name of the Character: ' + df['Name'] + '\nDescription: ' + df['Description'] + '\nMedium: ' + df['Medium'] + '\nSetting: ' + df['Setting']
df.drop(columns=initial_columns, inplace=True)

df.head()

Unnamed: 0,text
0,Name of the Character: Emily\nDescription: A y...
1,Name of the Character: Jack\nDescription: A mi...
2,Name of the Character: Alice\nDescription: A w...
3,Name of the Character: Tom\nDescription: A man...
4,Name of the Character: Sarah\nDescription: A w...


In [51]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

def create_file_with_embeddings(df, embedding_model_name):
    batch_size = 100
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Send text data to OpenAI model to get embeddings
        response = openai_client.embeddings.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            model=embedding_model_name,
            encoding_format="float"
        )

        # Add embeddings to list
        embeddings.extend([data.embedding for data in response.data])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings

    df.to_csv(
        'character_with_embeddings.csv',
        index=False)

# Use the following line to create file with embeddings
# create_file_with_embeddings(df , EMBEDDING_MODEL_NAME)

In [56]:
import numpy as np
url = 'https://raw.githubusercontent.com/heber-augusto/udacity-generative-ai-nanodegree/main/custom-chatbot/dataset/character_with_embeddings.csv'
df_with_embeddings = pd.read_csv(url)
df_with_embeddings["embeddings"] = df_with_embeddings["embeddings"].apply(eval).apply(np.array)
df_with_embeddings

Unnamed: 0,text,embeddings
0,Name of the Character: Emily\nDescription: A y...,"[-0.011612175, -0.008270356, -0.012926883, -0...."
1,Name of the Character: Jack\nDescription: A mi...,"[0.0085274605, -0.018019298, -0.0009495139, -0..."
2,Name of the Character: Alice\nDescription: A w...,"[0.009447108, -0.0021701923, -0.016225243, -0...."
3,Name of the Character: Tom\nDescription: A man...,"[0.017111635, -0.014908363, -0.0020845323, -0...."
4,Name of the Character: Sarah\nDescription: A w...,"[-0.0147309955, -0.020379819, -0.011906583, -0..."
5,Name of the Character: George\nDescription: A ...,"[-0.017609598, -0.0067947214, -0.0035818922, -..."
6,Name of the Character: Rachel\nDescription: A ...,"[-0.0016035989, -0.005048875, 0.0041514407, -0..."
7,Name of the Character: John\nDescription: A ma...,"[0.022090495, -0.0101275025, -0.02634314, -0.0..."
8,Name of the Character: Maria\nDescription: A m...,"[-0.0069917967, -0.007726369, -0.010384187, -0..."
9,Name of the Character: Caleb\nDescription: A y...,"[0.0035903126, -0.018488646, 0.0075842505, -0...."


## Custom Query Completion

In the cells below, we define functions to retrieve contents best associated with the query inside the dataframe with embeddings

In [34]:
import numpy as np
from scipy import spatial
from typing import List, Optional

def get_embedding(text, model="text-embedding-ada-002"):
    response = openai_client.embeddings.create(
        input=text,
        model=model,
        encoding_format="float"
    )
    return response.data[0].embedding


def distances_from_embeddings(
    query_embedding: List[float],
    embeddings: List[List[float]],
    distance_metric="cosine",
) -> List[List]:
    """Return the distances between a query embedding and a list of embeddings."""
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances

In [41]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(
        question,
        model=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [47]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context:

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n#########\n\n".join(context), question)

In [71]:
COMPLETION_MODEL_NAME = "gpt-4o"
def get_completion_answer(prompt, model = COMPLETION_MODEL_NAME):
    completion = openai_client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "user", "content": prompt}
      ]
    )
    return completion.choices[0].message.content

In [72]:
def answer_question(
    question,
    df,
    max_prompt_tokens=1800,
    max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)
    #print(prompt)

    try:
        return get_completion_answer(prompt)
    except Exception as e:
        print(e)
        return ""

## Testing functions

In [73]:
print(answer_question(
    "Can you describe Alice character?",
    df_with_embeddings,
    max_prompt_tokens=200)
    )

Alice is a woman in her late 30s who is a warm and nurturing mother of two, including her daughter, Emily. She is kind-hearted and empathetic but tends to be overly protective of her children and is prone to worrying. She is married to Jack. The story is set in England and takes place within the context of a play.


## Custom Performance Demonstration

Cells below, demonstrate the performance of your custom query using 2 questions. For each question, it is show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [75]:
question1 = "Can you describe Alice character?"

In [76]:
print(get_completion_answer(question1))

Alice is a well-known and beloved character from Lewis Carroll's classic novels "Alice's Adventures in Wonderland" (published in 1865) and its sequel "Through the Looking-Glass, and What Alice Found There" (published in 1871). She is a curious, imaginative, and adventurous young girl who finds herself in a series of surreal and absurd situations.

1. **Curiosity and Imagination**: Alice is deeply curious, which often leads her into new and peculiar situations. Her sense of wonder and inquisitiveness drive the narrative forward, as she explores the bizarre and fantastical worlds she encounters.

2. **Practical and Sensible Nature**: Despite the surrealism around her, Alice often maintains a practical mindset. She frequently questions the logic and rules of Wonderland and tries to apply her own sense of order to chaotic situations. This sometimes makes her appear a bit stern or critical, especially in the whimsical and nonsensical world she navigates.

3. **Bravery and Resourcefulness**:

In [77]:
print(answer_question(
    question1,
    df_with_embeddings,
    max_prompt_tokens=200)
    )

Alice is a woman in her late 30s who is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.


### Question 2

In [78]:
question2 = "Can you tell what were the female characters that participated on the Reality Show with Setting USA?"

In [79]:
print(get_completion_answer(question2))

Certainly! However, you haven't specified which reality show you're referring to. There are numerous reality shows set in the USA, each with its own set of participants. Here are a few popular ones, and some key female participants from each:

### 1. **The Real World**
   - **Season 1 (New York)**: Julie Gentry
   - **Season 2 (Los Angeles)**: Tami Akbar (later Roman)
   - **Season 3 (San Francisco)**: Pam Ling 

### 2. **Survivor**
   - **Season 1 (Borneo)**: Susan Hawk, Kelly Wiglesworth
   - **Season 2 (Australian Outback)**: Tina Wesson (winner), Jerri Manthey

### 3. **The Bachelor/The Bachelorette**
   - **First Bachelorette (2003)**: Trista Rehn
   - **Bachelor Season 1**: Amanda Marsh (winner)

### 4. **Big Brother USA**
   - **Season 1**: Jamie Kern
   - **Season 2**: Nicole Nilson

### 5. **Keeping Up With The Kardashians**
   - **Main Cast**: Kim Kardashian, Kourtney Kardashian, Khloé Kardashian, Kris Jenner

### 6. **American Idol**
   - **Season 1**: Kelly Clarkson (winner

In [80]:
print(answer_question(
    question2,
    df_with_embeddings,
    max_prompt_tokens=500)
    )

Yes, the female characters that participated in the Reality Show with Setting USA are Chloe, Sophia, Maya, and Olivia.
