# Custom Chatbot Project

### Dataset
- 2023_fashion_trends.csv (Provided with the Project)
#### Reason: 
This source contains data from events from the year 2023,so the gpt-3.5-turbo-instruct model would not know about this data because it was never trained on it.Additionally, it contains references from different data sources with descriptive content and the names of the articles, which makes it would make it easier to create specific prompts that could return different answers. Lastly, this dataset is not likely to change, so my chatbot will not break as it would with wikipedia articles.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [83]:
# Imports section
import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken
import numpy as np
import pandas as pd

# Define Constants
OPEN_AI_KEY = ''
MAX_TOKENS = 150
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = OPEN_AI_KEY

In [84]:
def load_and_wrangle(file_path):
    # Read File and save data to variable
    df = pd.read_csv(file_path , header=0)

    # Combine Data into a single column named Text
    df['text'] =  df['Source'] + ': '+ df['Trends'] + ' | ' + df['URL']
    # Remove old columns
    df.drop(['URL','Source', 'Trends'], axis=1, inplace=True)
    return df

In [85]:
# Generate Embeddings
def generate_embeddings(df: pd.DataFrame, output_csv_file: str, embedding_model_name: str):
    """Generating Embeddings
    We'll use the `Embedding`
    tooling from OpenAI [documentation here](https://platform.openai.com/docs/guides/embeddings/embeddings)
    to create vectors representing each row of our custom dataset."""

    batch_size = 100
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Send text data to OpenAI model to get embeddings
        response = openai.Embedding.create(
            input=df.iloc[i:i + batch_size]["text"].tolist(),
            engine=embedding_model_name
        )

        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings

    # In order to avoid having to run that code again in the future, we'll save the generated embeddings as a CSV file.
    df.to_csv(output_csv_file)
    return df


In [86]:
EMBEDDINGS_FILE = './data/embeddings.csv'
DATASET_SOURCE_FILE = './data/2023_fashion_trends.csv'
try:
    data_f = pd.read_csv(EMBEDDINGS_FILE, index_col=0)
    data_f["embeddings"] = data_f["embeddings"].apply(eval).apply(np.array)
except:
    print("Creating Embedding and saving to CSV")
    # Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
    wrangled_dataset = load_and_wrangle(DATASET_SOURCE_FILE)
    # Generating Embeddings
    dataset_w_embeddings = generate_embeddings(wrangled_dataset, EMBEDDINGS_FILE, EMBEDDING_MODEL_NAME)
else:
    print('"Embedding is loaded to CSV"')

Creating Embedding and saving to CSV


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2