# Custom Chatbot Project

This notebook builds a simple custom chatbot powered by OpenAI.  
It uses the CSV dataset of **2023 fashion trends** and implements a lightweight RAG (retrieve-and-generate) flow without further use of external frameworks.

**What you can find**
- Load and prepare a dataset (into `text` column)
- Create basic retrieval with TF-IDF + cosine similarity
- Compare answers **with** vs. **without** custom context
- a small interactive loop


For this project, I use the 2023 Fashion Trends dataset 2023_fashion_trends.
It contains short text snippets from online articles that describe fashion trends and key style directions in 2023.
This dataset should be a good choice because the texts are concise, descriptive, and focus on one single topic — fashion.
It allows the chatbot to give more specific answers about 2023 trends, such as colors, materials, or design influences.
A general OpenAI model could talk about fashion in general, but by adding this dataset, the chatbot becomes more focused and accurate when answering questions about acutal trends in 2023.

## Setup and Imports

In [9]:
# Basic
import openai
import pandas as pd
import numpy as np

# Similarity search
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Vocareum API base 
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-1622931096126677453626668ffa1b827e0a3.38491922" # SET API KEY

## Data Wrangling

In [12]:
# Load CSV file
df = pd.read_csv(r"C:\Users\P319970\git_delivery\git__project3\data\raw\2023_fashion_trends.csv")

# Check existing columns
print("Columns:", df.columns.tolist())
print("Number of rows:", len(df))

Columns: ['URL', 'Trends', 'Source']
Number of rows: 82


In [27]:
# data quality check on df

# Basic info
print("Shape:", df.shape)

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Check for duplicates
duplicates = df.duplicated(subset=["Trends"]).sum()
print(f"\nNumber of duplicate text rows: {duplicates}")

# Check average and min length of the trend texts
df["text_length"] = df["Trends"].astype(str).apply(len)
print("\nAverage text length:", round(df["text_length"].mean(), 1))
print("Minimum text length:", df["text_length"].min())

# Show very short or empty entries
print("\nshort entries (< 200 characters):")
print(df[df["text_length"] < 200]["Trends"].head())

# Optional: remove bad rows
df = df.dropna(subset=["Trends"])
df = df.drop_duplicates(subset=["Trends"])
df = df[df["text_length"] > 30].copy()

# Check new size
print("\nRemaining rows after cleaning:", len(df))


Shape: (82, 5)

Missing values per column:
URL            0
Trends         0
Source         0
text           0
text_length    0
dtype: int64

Number of duplicate text rows: 0

Average text length: 434.1
Minimum text length: 150

short entries (< 200 characters):
69    "Leather jackets are leading the nouveau grung...
Name: Trends, dtype: object

Remaining rows after cleaning: 82


In [28]:
# Create 'text' column out of the column 'Trends', which contains main text content
df["text"] = df["Trends"].astype(str).str.strip()

# Keep only 'text' column for chatbot context
fashion_df = df[["text"]].copy()

# Preview first 5 rows
fashion_df.head()

Unnamed: 0,text
0,2023 Fashion Trend: Red. Glossy red hues took ...
1,2023 Fashion Trend: Cargo Pants. Utilitarian w...
2,"2023 Fashion Trend: Sheer Clothing. ""Bare it a..."
3,2023 Fashion Trend: Denim Reimagined. From dou...
4,2023 Fashion Trend: Shine For The Daytime. The...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2