# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

For this exercise, I choose the Fashion Trends for 2023 dataset. Using the Fashion Trends for 2023 dataset ensures the chatbot is relevant, engaging, and valuable for both businesses and users. The combination of retrieval capabilities and generative AI allows the chatbot to offer both precision and creativity, making it an essential tool in the fashion domain.

Below are some potential applications:

1. Personalized Styling Recommendations
Virtual Stylist: Suggests outfits based on user preferences, body type, occasion, or current fashion trends.
Wardrobe Planning: Helps users mix and match clothing they own with trending styles.
Event-Specific Advice: Provides recommendations for events like weddings, parties, or job interviews.

2. E-Commerce Support
Product Discovery: Assists customers in finding products that match trending styles (e.g., "Show me dresses in 2023's trending colors").
Upselling & Cross-Selling: Recommends complementary items to enhance an outfit (e.g., "Pair this jacket with these accessories").
Trend Updates: Educates users about new arrivals that align with 2023 trends.

3. Fashion Education and Awareness
Trend Insights: Explains why specific styles, colors, or patterns are trending in 2023.
Sustainability Advice: Guides users on eco-friendly fashion options based on the dataset.
Cultural Fashion Trends: Highlights how global cultural influences shape trends.

4. Fashion Blogging and Influencer Support
Content Creation: Provides inspiration for blog posts or social media captions related to fashion trends.
Hashtag Suggestions: Recommends popular hashtags or buzzwords associated with 2023 fashion trends for social media posts.
Trend Forecasting: Offers insights into future trend predictions based on current data.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [None]:
import pandas as pd
import openai

In [None]:
# load dataframe
df = pd.read_csv("data/2023_fashion_trends.csv")

# check first rows
df.head()

# get rid of url and source
df.drop(["URL", "Source"], axis=1, inplace=True)

# replace trends with text
df = df.rename(columns={"Trends": "text"})

# check first rows
df.head()

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [None]:
# set up variables for model and max tokens
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
MAX_TOKENS = 2400

In [None]:
#prepare embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 25
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])



# Add embeddings list to dataframe
df["embeddings"] = embeddings
# check cotent and embeddigins
df.head()

In [None]:
#create simple prompt
def create_simple_prompt(question):
   
    prompt_template = """
Question: {}
Answer:"""
    
    return prompt_template.format("\n\n###\n\n", question)
    
# define how to answer to a prompt
def answer_prompt(
    prompt, df, max_answer_tokens=150
):
   
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [None]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

import tiktoken

def create_RAG_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [None]:
q1 = "What are the top 5 fashion trends of 2023??"


In [None]:

# Print answer without RAG
print('\nQ1: Answer without RAG: \n', answer_prompt(create_simple_prompt(q1), df))
print('\nQ1: Answer with RAG: \n', answer_prompt(create_RAG_prompt(q1, df, 2000), df))



### Question 2

In [None]:
q2 = "What colors are in style this spring 2023?"

print('\nQ2: Answer without RAG: \n', answer_prompt(create_simple_prompt(q2), df))
print('\nQ2: Answer with RAG: \n', answer_prompt(create_RAG_prompt(q2, df, 2000), df))
