# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I selected one of the 3 datasets provided by this project in the directory `data`, the dataset `data/2023_fashion_trends.csv`.

The dataset "2023 Fashion Trends" is appropriate for this task because it contains newer information which did not exist in 2021, the moment in which the LLM was trained. The LLM `gpt-3.5-turbo-instruct` was trained in 2021. As a result, this LLM cannot answer questions related to events occurred in 2023. And in fact, the LLM cannot answer these questions:

> QUESTION: What did Naina Singla say about maxi skirts in 2023?
> 
> ANSWER: As a language model AI, I cannot predict what a human might say about a topic in a specific year, especially since it is currently only 2021.
However, based on current fashion trends, I can provide some possible statements about maxi skirts that may align with Naina Singla's fashion style and preferences.
> 
> QUESTION: What did Andie Sobrato say about platform slip-ons in 2023?
> 
> ANSWER: It is not possible to accurately answer this question as it refers to a specific statement made by Andie Sobrato about platform slip-ons in a hypothetical future year (2023). Since we do not have access to future events and statements made by individuals, it is impossible to provide a detailed explanation of what Andie Sobrato might say about platform slip-ons in 2023. It is important to note that any information or predictions about the future should be taken with caution, as it is subject to change and uncertainty.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import pandas as pd
import openai
import os
import numpy as np
import json
import tiktoken

In [2]:
def read_openai_key(key_file):
 with open(key_file, "r") as f:
  return f.read().strip()

openai.api_key = read_openai_key("openai_key.txt")
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
ENCODING = "cl100k_base"
GPT_MODEL = "gpt-3.5-turbo-instruct"
MAX_TOKENS = 300
MAX_TOKEN_COUNT = 4096

In [3]:
def create_vector_db(dataset_file, vector_db_file):
 if os.path.isfile(vector_db_file): 
  print(f'Processed file "{vector_db_file}" already exists. Creating the vector database was skipped.')
  return
 fashion_df = pd.read_csv(dataset_file)
 df = pd.DataFrame()
 df["text"] = fashion_df["Trends"]
 texts = df["text"].tolist()
 response = openai.Embedding.create(input = texts, model = EMBEDDING_MODEL_NAME)
 embeddings = [data["embedding"] for data in response["data"]]
 df["embeddings"] = embeddings
 df.to_csv(vector_db_file, index = False)

In [4]:
dataset_file = "data/2023_fashion_trends.csv"
vector_db_file = "fashion_vector_db.csv"
create_vector_db(dataset_file, vector_db_file)

Processed file "fashion_vector_db.csv" already exists. Creating the vector database was skipped.


In [5]:
vector_db = pd.read_csv(vector_db_file)
vector_db["embeddings"] = [json.loads(embeddings_str) for embeddings_str in vector_db["embeddings"].values]
vector_db

Unnamed: 0,text,embeddings
0,2023 Fashion Trend: Red. Glossy red hues took ...,"[-0.020833317190408707, -0.022008126601576805,..."
1,2023 Fashion Trend: Cargo Pants. Utilitarian w...,"[-0.001784870750270784, -0.02892744168639183, ..."
2,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...","[-0.01045029703527689, -0.01917460933327675, 0..."
3,2023 Fashion Trend: Denim Reimagined. From dou...,"[-0.01555274985730648, -0.005349182989448309, ..."
4,2023 Fashion Trend: Shine For The Daytime. The...,"[-0.004937837366014719, 0.0018045977922156453,..."
...,...,...
77,"If lime green isn't your vibe, rest assured th...","[-0.002775615779682994, -0.018237359821796417,..."
78,"""As someone who can clearly (not fondly) remem...","[-0.01475916150957346, -0.0064850859344005585,..."
79,"""Combine this design shift with the fact that ...","[-0.02079174853861332, -0.025052573531866074, ..."
80,Thought party season ended at the stroke of mi...,"[-0.01981187053024769, -0.022380324080586433, ..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [6]:
def cosine_similarity(a, b):
 a = np.array(a)
 b = np.array(b)
 return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embeddings(text):
 return openai.Embedding.create(input = [text], model = EMBEDDING_MODEL_NAME).data[0].embedding

def find_most_relevant_texts(vector_db, question):
 df = vector_db.copy()
 question_embeddings = get_embeddings(question)
 df["similarities"] = df["embeddings"].apply(lambda data: cosine_similarity(data, question_embeddings))
 df.sort_values("similarities", ascending = False, inplace = True)
 return df

In [7]:
rag_prompt_template = \
"""Answer the question with a detailed explanation based on the context below, and if the question can't be answered based on the context, say "I don't know".

Context: 

{}

---

Question: {}
Answer:"""

def create_rag_prompt(question, vector_db, max_token_count = MAX_TOKEN_COUNT):
    tokenizer = tiktoken.get_encoding(ENCODING)
    context_separator = "\n\n###\n\n"
    current_token_count = len(tokenizer.encode(rag_prompt_template)) + len(tokenizer.encode(question)) + MAX_TOKENS
    context = []
    relevant_texts = find_most_relevant_texts(vector_db, question)["text"].values
    for text in relevant_texts:
        text_token_count = len(tokenizer.encode(text)) + len(tokenizer.encode(context_separator))
        current_token_count += text_token_count
        if current_token_count < max_token_count:
            context.append(text)
        else:
            break
    return rag_prompt_template.format(context_separator.join(context), question)

In [8]:
prompt_template = \
"""Provide detailed explanation for the following question.
Question: {}
Answer:"""

def create_prompt(question):
    return prompt_template.format(question)

In [9]:
def execute_prompt(prompt):
 print(f"===== PROMPT: =====\n\n{prompt}\n\n")
 text_completion = openai.Completion.create(model = GPT_MODEL, prompt = prompt, max_tokens = MAX_TOKENS)
 answer = text_completion["choices"][0]["text"]
 # answer = text_completion
 print(f"===== ANSWER: =====\n\n{answer}\n\n")

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [10]:
question1 = "What did Naina Singla say about maxi skirts in 2023?"

In [11]:
execute_prompt(create_rag_prompt(question1, vector_db))

===== PROMPT: =====

Answer the question with a detailed explanation based on the context below, and if the question can't be answered based on the context, say "I don't know".

Context: 

More Maxi Skirt Moments. Fashion expert and stylist Naina Singla tells InStyle that if you’re going to invest in something new this spring, let it be a statement maxi skirt. “They are a versatile wardrobe staple worth investing in and are perfect for every season,” Singla gushes via email. Plus, with the surge of Y2K influence in the fashion sphere, Singla says long maxi styles with skirts fit the bill.

“A maxi skirt is a piece that easily transitions from day to night and comes in a range of styles,” she explains. “Right now, we are seeing denim, satin slip skirts, and knitted skirts in maxi length.”

###

2023 Fashion Trend: Maxi Skirts. In response to the ultra unpractical mini skirts of 2022, maxi skirts are here to dominate the year. In line with the aforementioned cargo and denim trends, expec

In [12]:
execute_prompt(create_prompt(question1))

===== PROMPT: =====

Provide detailed explanation for the following question.
Question: What did Naina Singla say about maxi skirts in 2023?
Answer:


===== ANSWER: =====

 It is impossible to accurately answer this question as it is referring to a future date (2023) and the statement or viewpoint of Naina Singla can change over time. Without any specific context or evidence, it is purely speculative to make a statement about what Naina Singla may say about maxi skirts in 2023. Furthermore, fashion trends and opinions are constantly evolving and can vary greatly from person to person. As such, it is best to seek out current and relevant information about fashion trends and opinions instead of making assumptions about future statements.




**It's clear that by using RAG and enough context searched in the vector database, it's possible to answer this question. However, by using normal prompting without context, this question cannot be answered properly.**

### Question 2

In [13]:
question2 = "What did Andie Sobrato say about platform slip-ons in 2023?"

In [14]:
execute_prompt(create_rag_prompt(question2, vector_db))

===== PROMPT: =====

Answer the question with a detailed explanation based on the context below, and if the question can't be answered based on the context, say "I don't know".

Context: 

Platform Slip-Ons. Though the platform slip-ons you know and love from the 2000s have been in style for a while now (thanks, in part, to the almost-revival of Lizzie McGuire), personal stylist and image consultant Andie Sobrato tells InStyle the nostalgic footwear will be especially popular for spring 2023, with plenty of bold, printed pairs to choose from.

“I am excited to throw on a daisy, polka dot, or bright platform slip-on with a cropped jean and button-up shirt,” Sobrato says via email.” These relaxed and fun shoes will be the standout of any outfit and bring a youthful energy that we all need.”

###

There are plenty of '90s things to take note of in the fashion space for spring, and ballet shoes just happen to be one of them. I'm always surprised at just how divisive these shoes prove to be

In [15]:
execute_prompt(create_prompt(question2))

===== PROMPT: =====

Provide detailed explanation for the following question.
Question: What did Andie Sobrato say about platform slip-ons in 2023?
Answer:


===== ANSWER: =====

 It is not possible to provide a specific explanation for this question as it is referencing a specific event in the future. Without more context or information, it is not possible to accurately answer this question.




**It's clear that by using RAG and enough context searched in the vector database, it's possible to answer this question. However, by using normal prompting without context, this question cannot be answered properly.**