# Custom Chatbot Project

**Purpose**:GPT models are built by learning from past sentences and other data, so they cannot respond to the most recent information. For example, fashion trends change every year, so GPT models cannot answer such questions or give ambiguous answers. In this project, I decided to use **RAG**(Retrieval Augmented Generation) to use fashion data in the chat responses.

## Load liblary and default value

In [77]:
import glob
import pandas as pd
import tiktoken
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [None]:
openai.api_key = "YOUR API KEY"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [22]:
path = "data/*"
data_path = glob.glob(path)[-1]

df = pd.read_csv(data_path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URL     82 non-null     object
 1   Trends  82 non-null     object
 2   Source  82 non-null     object
dtypes: object(3)
memory usage: 2.0+ KB


**Observation**: Checking the column names in the data frame shows that TEXT does not exist. <br>
**Todo**: Check data to change column names.

In [29]:
df.head()

Unnamed: 0,URL,text,Source
0,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Red. Glossy red hues took ...,7 Fashion Trends That Will Take Over 2023 — Sh...
1,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Cargo Pants. Utilitarian w...,7 Fashion Trends That Will Take Over 2023 — Sh...
2,https://www.refinery29.com/en-us/fashion-trend...,"2023 Fashion Trend: Sheer Clothing. ""Bare it a...",7 Fashion Trends That Will Take Over 2023 — Sh...
3,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Denim Reimagined. From dou...,7 Fashion Trends That Will Take Over 2023 — Sh...
4,https://www.refinery29.com/en-us/fashion-trend...,2023 Fashion Trend: Shine For The Daytime. The...,7 Fashion Trends That Will Take Over 2023 — Sh...


**Observation**: Columns with column name Trend have text information about fashion<br>
**Todo**：Change column name from Trends to text

In [28]:
if "Trends" in df.columns:
    df = df.rename(columns = {"Trends":"text"})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URL     82 non-null     object
 1   text    82 non-null     object
 2   Source  82 non-null     object
dtypes: object(3)
memory usage: 2.0+ KB


**Observation**: Dataframe reading data from csv has more than 20 rows including `text` column in column name.

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Generating Embeddings

In [72]:
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

### Create a Function that Finds Related Pieces of Text for a Given Question

In [74]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [76]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [78]:
def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [99]:
def without_RAG_response(question):
    response = openai.Completion.create(
        model=COMPLETION_MODEL_NAME,
        prompt=question,
        max_tokens=150)
    response = response["choices"][0]["text"].replace("\n","")
    return response

### Question 1

In [100]:
question = "What is type of pants 2023 Fashion Trend "

In [102]:
without_RAG_response(question)

'There is no specific type of pants that is predicted to be a fashion trend in 2023. Fashion trends are constantly changing and evolving, and it is impossible to accurately predict what will be popular five years in advance. However, some general trends in pants that may continue in 2023 include high-waisted styles, wide-leg or flared trousers, and statement prints or patterns.'

In [86]:
answer_question(question, df)

'The cargo pants are a type of pants that are part of the 2023 fashion trend.'

### Question 2

In [103]:
question = "What is type of shirt 2023 Fashion Trend "

In [104]:
without_RAG_response(question)

'I am an AI and do not have the capability to predict future fashion trends for 2023. It would depend on the current fashion trends and consumer preferences at that time. The type of shirt that will be in trend could range from basic t-shirts, to button-down shirts, to more unique and modern styles.'

In [105]:
answer_question(question, df)

'Pinstripe tailoring.'

## Conclusion

The original model avoided explicit responses, but the model using RAG presented specific designs of clothing, such as pinstripe shirts and cargo pants.