# Custom Chatbot Project

## Table of Contents
1. [Project Overview and Rationale](#rationale)
2. [Dataset Preparation](#preparation)
3. [Custom Query Process](#custom_queries)
4. [Demo Questions](#demo)

## Project Overview and Rationale
<a id="rationale"></a>
This project will use the provided data "Fashion Trends" to improve the answers of ChatGPT about fashion.

The fictional scenario is as follows: The year is 2023, and we are operating an e-commerce web shop specializing in fashion. We want to offer our customers a text based assistant which helps them choosing their next outfit.

The reasons to use the fashion trends data set are as follows:
- The data is real life data. This is a plus compared to other data generated by the AI once we want to compare the responses of ChatGPT with vs without custom data
- The task fits well into my current actual job (building web shop and order management back ends) 

## Data Wrangling

<a id="preparation"></a>

In [1]:
# imports
import pandas as pd

In [2]:
# read the data
df_full = pd.read_csv("data/2023_fashion_trends.csv")
print(df_full.head)
# => labels are URL, trends, source, size of set is 82 rows (without headers)

<bound method NDFrame.head of                                                   URL  \
0   https://www.refinery29.com/en-us/fashion-trend...   
1   https://www.refinery29.com/en-us/fashion-trend...   
2   https://www.refinery29.com/en-us/fashion-trend...   
3   https://www.refinery29.com/en-us/fashion-trend...   
4   https://www.refinery29.com/en-us/fashion-trend...   
..                                                ...   
77  https://www.whowhatwear.com/spring-summer-2023...   
78  https://www.whowhatwear.com/spring-summer-2023...   
79  https://www.whowhatwear.com/spring-summer-2023...   
80  https://www.whowhatwear.com/spring-summer-2023...   
81  https://www.whowhatwear.com/spring-summer-2023...   

                                               Trends  \
0   2023 Fashion Trend: Red. Glossy red hues took ...   
1   2023 Fashion Trend: Cargo Pants. Utilitarian w...   
2   2023 Fashion Trend: Sheer Clothing. "Bare it a...   
3   2023 Fashion Trend: Denim Reimagined. From dou...   


In [3]:
# convert into single column set with label "text"
df_text = pd.DataFrame(df_full["Trends"]).rename(columns={'Trends': 'text'}) # source for rename: Udacity GPT chatbot

print(type(df_text))
print(df_text.head)

<class 'pandas.core.frame.DataFrame'>
<bound method NDFrame.head of                                                  text
0   2023 Fashion Trend: Red. Glossy red hues took ...
1   2023 Fashion Trend: Cargo Pants. Utilitarian w...
2   2023 Fashion Trend: Sheer Clothing. "Bare it a...
3   2023 Fashion Trend: Denim Reimagined. From dou...
4   2023 Fashion Trend: Shine For The Daytime. The...
..                                                ...
77  If lime green isn't your vibe, rest assured th...
78  "As someone who can clearly (not fondly) remem...
79  "Combine this design shift with the fact that ...
80  Thought party season ended at the stroke of mi...
81  "This season, we saw the revival of the bubble...

[82 rows x 1 columns]>


## Custom Query Completion

<a id="custom_queries"></a>
[OpenAI API reference](https://platform.openai.com/docs/api-reference/chat)

In [4]:
# key to access OpenAI
OPENAI_API_KEY = "YOUR_API_KEY"

# select some configuration settings
DEFAULT_TEMPERATURE = 0.1

# Pick one of the models available. Should use only combinations which have the same encoding (cl100k_base in this case)
# THE_COMPLETION_MODEL = "text-davinci-003" # uses p50k_base (source: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken)
THE_COMPLETION_MODEL = "gpt-3.5-turbo-instruct"
# THE_EMBEDDING_MODEL = "text-embedding-ada-002"
THE_EMBEDDING_MODEL = "text-embedding-3-small"  # OpenAI doc says this embedding is cheaper and up to date
THE_TOKEN_ENCODING = "cl100k_base"

In [5]:
# Create embeddings for this session. We don't have that much data to require iterations

import openai

# Send text data to OpenAI model to get embeddings
openai.api_key = OPENAI_API_KEY
response_for_embeddings = openai.Embedding.create(
    input  = df_text["text"].to_list(),
    engine = THE_EMBEDDING_MODEL
)

embeddings = response_for_embeddings["data"]

In [7]:
# save the embeddings to save cost if we need them again (in a real world application this should be a vector database)
import numpy as np

# we want the embeddings as 1) DataFrame (to save) and 2) as list of numpy arrays (for the OpenAI nearest neighbor search)
# conversion mayhem..., I bet this is possible a lot easier
print(type(embeddings))
print(type(embeddings[0]))
# print(embeddings[0])
print(type(embeddings[0].get("embedding")))

# calculate embeddings as list of ndarray
embeddings_np = [ np.array(e.get('embedding')) for e in embeddings ]

# add it to the existing DataFrame and save that
df_text["embeddings"] = embeddings_np
df_text.to_csv("embeddings.csv")

<class 'list'>
<class 'openai.openai_object.OpenAIObject'>
<class 'list'>
<class 'list'>
<class 'numpy.ndarray'>


In [8]:
# read (restart point)
#df_embeddings = pd.read_csv("embeddings.csv", index_col=0)
#df_embeddings["embeddings"] = df_embeddings["embeddings"].apply(eval).apply(np.array)
#df_embeddings

In [30]:
# create a search for the most relevant statements
# (again, in a real world application, the would be a vector database query, have to try this later)
# this implementation taken from prior course material

from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=THE_EMBEDDING_MODEL)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        embeddings_np,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [40]:
import tiktoken

# computes the embedding for a single string, returns a list
# should match the openai.embeddings_utils function get_embedding
def get_embedding_for_question(question):
    response_for_embeddings = openai.Embedding.create(
        input  = question,
        engine = THE_EMBEDDING_MODEL
    )
    return response_for_embeddings.data[0]["embedding"]

test_question = 'Which kind of jacket should I wear?'
rs = get_embedding_for_question(test_question)
print(type(rs))
# print(rs)
# rs2 = get_embedding(test_question, engine=THE_EMBEDDING_MODEL)
# print(type(rs2))
# print(rs2)


# embeddings = response_for_embeddings["data"]
def create_prompt_with_context(question, df):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding(THE_TOKEN_ENCODING)

    # Count the number of tokens in the prompt template and question
    prompt_template = """
You are a fashion advisor. It is the year 2023.
Answer the question based on the context of the current year's trends below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    max_token_count = 1024
    max_relevant_facts = 8

    context = []
    # add facts to the context as long as we don't exceed the token count and we don't add more than max_relevant_facts
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count and len(context) < max_relevant_facts:
            context.append(text)
        else:
            break

    print(f'Used {len(context)} trend information records')
    return prompt_template.format("\n\n###\n\n".join(context), question)

def create_standard_prompt(question, df):
    prompt_template = """
You are a fashion advisor. Answer the following question:
Question: {}
Answer:"""
    return prompt_template.format(question)


<class 'list'>


In [44]:
# Define a method to answer a customer's question.
# The caller must provide the customer's question, and can optionally also set a temperature, because for fashion there is
# usually no right or wrong, because a lot depends on personal preference. Therefore a bit of variety is good. It also
# allows the customer to repeat the question and get a different opionion.
#
# The method uses the following globally defined data:
# - DEFAULT_TEMPERATURE: a value for the creativity of the bot
# - embeddings: the embeddings which hold the additional knowledge past 2021

def chat_completion(prompt, temperature=DEFAULT_TEMPERATURE):
    # print(computed_prompt)
    response = openai.Completion.create(
        model       = THE_COMPLETION_MODEL,
        prompt      = prompt,
        temperature = temperature,
        max_tokens  = 256,
        top_p       = 1.0 if temperature == 0.0 else 0.9
    )
    return response.choices[0].text.strip().strip("\n")

def compare_results(question):
    print(f'Question: {question}')
    print(f'Answer of base model: {chat_completion(create_standard_prompt(question, df_text), 0.0)}')
    print(f'Answer of custom model: {chat_completion(create_prompt_with_context(question, df_text), 0.0)}')

## Custom Performance Demonstration

<a id="demo"></a>

### Question 1

In [45]:
compare_results('What kind of bag should I use?')

Question: What kind of bag should I use?
Answer of base model: It depends on the occasion and your personal style. For a casual day out, a crossbody or tote bag would be a practical and stylish choice. For a more formal event, a clutch or structured handbag would be more appropriate. If you prefer a more edgy look, a backpack or fanny pack could be a fun option. Ultimately, choose a bag that complements your outfit and makes you feel confident.
Used 8 trend information records
Answer of custom model: Oversized bags are currently trending in 2023, so a big tote or shoulder bag would be a great choice.


Result: Both models give a meaningful result, but clearly the one without the custom facts is quite generic, while the one with the extra information is very specific and takes the context into account.

### Question 2

In [46]:
compare_results('I need some more shoes, can you recommend some slippers?')

Question: I need some more shoes, can you recommend some slippers?
Answer of base model: Sure, there are many great slipper options available. For a casual and comfortable option, I would recommend a pair of fuzzy or plush slippers. If you're looking for something more stylish, you could try a pair of mules or slides with a fun print or embellishment. For a more formal option, you could go for a pair of velvet or satin slippers. Ultimately, it depends on your personal style and the occasion you will be wearing them for.
Used 7 trend information records
Answer of custom model: For a comfortable yet stylish option, I would recommend the Khaite Marcy shearling flat. For a more glamorous option, the Alaïa Coeur mule would be a great choice. And for a fun and youthful option, the platform slip-ons mentioned by personal stylist Andie Sobrato would be a great addition to your shoe collection.


Result: Similar to question 1, the custom bot follows the 2023 fashion trends much closer.