# Custom Chatbot Project

### Dataset
- 2023_fashion_trends.csv (Provided with the Project)
## Comprehensive Explanation: 
This source contains data from events from the year 2023, so the gpt-3.5-turbo-instruct model would not know about this data because it was never trained on it. Additionally, it contains references from different data sources with descriptive content and the names of the articles, which makes it easier to create specific prompts that could return different answers. Also, this dataset is not likely to change, so my chatbot will not break as it would with wikipedia articles.

Other reasons include:

- This dataset does not include numerical calculations or statistics for which the model is not very suitable for.

- The text column was combined with the URL Source and content. This allows the model to cite sources and identify which rows below to the same source such as same url and/or same web article.

- The original model was not train on the fashion trends for 2023 because its last training ended on 2021.

- The content allows the model to identify fashion trends by comparing it to articles from diferent sources and interpret fashion trends during the year 2023 by observing patterns on all articles.

- The results from the original model are very different from the results after training since it does not have the new data that we fed the model.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
# Imports section
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken
import numpy as np
import pandas as pd

# Define Constants
OPEN_AI_KEY = "YOUR API KEY"
MAX_TOKENS = 150
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = OPEN_AI_KEY

In [2]:
def load_and_wrangle(file_path):
    # Read File and save data to variable
    df = pd.read_csv(file_path , header=0)

    # Combine Data into a single column named Text
    df['text'] =  df['Source'] + ': '+ df['Trends'] + ' | ' + df['URL']
    # Remove old columns
    df.drop(['URL','Source', 'Trends'], axis=1, inplace=True)
    return df

In [3]:
# Generate Embeddings
def generate_embeddings(df: pd.DataFrame, output_csv_file: str, embedding_model_name: str):
    """Generating Embeddings
    We'll use the `Embedding`
    tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings)
    to create vectors representing each row of our custom dataset."""

    batch_size = 100
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Send text data to OpenAI model to get embeddings
        response = openai.Embedding.create(
            input=df.iloc[i:i + batch_size]["text"].tolist(),
            engine=embedding_model_name
        )

        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings

    # In order to avoid having to run that code again in the future, we'll save the generated embeddings as a CSV file.
    df.to_csv(output_csv_file)
    return df


In [4]:
EMBEDDINGS_FILE = './data/embeddings.csv'
DATASET_SOURCE_FILE = './data/2023_fashion_trends.csv'
try:
    bot_data_frame = pd.read_csv(EMBEDDINGS_FILE, index_col=0)
    bot_data_frame["embeddings"] = bot_data_frame["embeddings"].apply(eval).apply(np.array)
except:
    print("Creating Embedding and saving to CSV")
    # Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
    wrangled_dataset = load_and_wrangle(DATASET_SOURCE_FILE)
    # Generating Embeddings
    bot_data_frame = generate_embeddings(wrangled_dataset, EMBEDDINGS_FILE, EMBEDDING_MODEL_NAME)
else:
    print('"Embedding is loaded to CSV"')

"Embedding is loaded to CSV"


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [6]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [7]:
def answer_question(
        question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [8]:
# Function that answers questions directly from the model without any custom data
def get_model_answer(prompt: str, max_tokens: int):
    answer = openai.Completion.create(
        model=COMPLETION_MODEL_NAME,
        prompt=prompt,
        max_tokens=max_tokens
    )["choices"][0]["text"].strip()
    return answer

In [9]:
QUESTION_1="""
Question: "What jean styles are trending this spring according to fashion experts?"
Answer:
"""

In [10]:
QUESTION_2="""
Question: "What styles are in vogue this summer?"
Answer:
"""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [11]:
# Answer from Basic completion model to Question 1
print(get_model_answer(QUESTION_1, MAX_TOKENS))
# Answer from custom query to Question 1

According to fashion experts, the following jean styles are trending this spring:

1. Wide Leg Jeans: Wide leg jeans are making a comeback this spring, giving a nod to the 70s fashion trend. They offer a relaxed and comfortable fit that is perfect for everyday wear.

2. Straight Leg Jeans: Straight leg jeans continue to be a popular style this spring. They are versatile and can be dressed up or down, making them a wardrobe staple.

3. High-Waisted Jeans: High-waisted jeans are here to stay, as they offer a flattering and slimming effect. They can be found in a variety of cuts and styles, such as skinny, straight leg, and flared.

4. Cropped Jeans: Cropped jeans


In [12]:
# Answer from custom query to Question
print(answer_question(QUESTION_1, bot_data_frame))


Relaxed and wide-leg silhouettes.


### Question 2

In [13]:
# Answer from Basic completion model to Question 2
print(get_model_answer(QUESTION_2, MAX_TOKENS))
# Answer from custom query to Question 2

1. Bright Colors: Bold and vibrant color palettes are in style this summer. Think sunny yellows, bright oranges, hot pinks, and electric blues.

2. Floral Prints: Floral patterns are always a classic for summer fashion, and this year is no exception. Look for dainty, romantic prints or bold and tropical motifs.

3. Retro Vibes: Nostalgic designs from the 60s, 70s, and 80s are making a comeback this summer. Think mini dresses, bell-bottom jeans, and oversized sunglasses.

4. Flowy Silhouettes: Loose, billowy pieces like maxi dresses, flowy skirts, and wide-legged pants are perfect for summer heat. They also allow for easy movement


In [14]:
# Answer from custom query to Question
print(answer_question(QUESTION_2, bot_data_frame))

From this context, we can gather that formfitting trompe l'oeil and cyber prints, painterly ombrés, PVC ruffles, white cotton, perfectly cut trousers, simplicity and everyday dressing, a trend for the inner maximalist, a return to the aesthetics of the '80s and '90s, double floral embellishments, and elevated basics, are all in vogue for summer fashion in 2023.


# Have a continous conversation with the chatbot

Note: This will probably work better outside of Jupyter notebooks

In [1]:
def start_chatting_with_bot():
    KEEP_CHATTING = True

    print('Hello, I am have to answer all your questions. Ask me anything!\n')
    while KEEP_CHATTING:
        new_quetion = input('What is your question?\n')
        print(answer_question(new_quetion, bot_data_frame))
        user_answer = input('Do you want to continue chatting y/n?\n')
        
        if user_answer.lower() == 'n':
            KEEP_CHATTING = False

    print('It was very nice talking to you, good bye!')    

In [None]:
# Start chatting with chatbot!
start_chatting_with_bot()

# Additional thoughts
The data on which we trained the model seems to give more concise answers to our questions compared to the original model. Furthermore, the fashion trend data is specific to a season and year, so the original model has no way to pretict what would be in vogue if it was never trained on it, so this data provides exactly what it needs to give informed answers.