# Custom Chatbot Project

This notebook builds a simple custom chatbot powered by OpenAI.  
It uses the CSV dataset of **2023 fashion trends** and implements a lightweight RAG (retrieve-and-generate) flow without further use of external frameworks.

**What you can find**
- Load and prepare a dataset (into `text` column)
- Create basic retrieval with TF-IDF + cosine similarity
- Compare answers **with** vs. **without** custom context
- a small interactive loop


For this project, I use the 2023 Fashion Trends dataset 2023_fashion_trends.
It contains short text snippets from online articles that describe fashion trends and key style directions in 2023.
This dataset should be a good choice because the texts are concise, descriptive, and focus on one single topic — fashion.
It allows the chatbot to give more specific answers about 2023 trends, such as colors, materials, or design influences.
A general OpenAI model could talk about fashion in general, but by adding this dataset, the chatbot becomes more focused and accurate when answering questions about acutal trends in 2023.

## Setup and Imports

In [None]:
# Basic
import openai
import pandas as pd
import numpy as np

# Similarity search
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Vocareum API base 
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY" # SET API KEY

## Data Wrangling

In [2]:
# Load CSV file
df = pd.read_csv(r"C:\Users\P319970\git_delivery\git__project3\data\raw\2023_fashion_trends.csv")

# Check existing columns
print("Columns:", df.columns.tolist())
print("Number of rows:", len(df))

Columns: ['URL', 'Trends', 'Source']
Number of rows: 82


In [3]:
# data quality check on df

# Basic info
print("Shape:", df.shape)

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Check for duplicates
duplicates = df.duplicated(subset=["Trends"]).sum()
print(f"\nNumber of duplicate text rows: {duplicates}")

# Check average and min length of the trend texts
df["text_length"] = df["Trends"].astype(str).apply(len)
print("\nAverage text length:", round(df["text_length"].mean(), 1))
print("Minimum text length:", df["text_length"].min())

# Show very short or empty entries
print("\nshort entries (< 200 characters):")
print(df[df["text_length"] < 200]["Trends"].head())

# Optional: remove bad rows
df = df.dropna(subset=["Trends"])
df = df.drop_duplicates(subset=["Trends"])
df = df[df["text_length"] > 30].copy()

# Check new size
print("\nRemaining rows after cleaning:", len(df))


Shape: (82, 3)

Missing values per column:
URL       0
Trends    0
Source    0
dtype: int64

Number of duplicate text rows: 0

Average text length: 434.1
Minimum text length: 150

short entries (< 200 characters):
69    "Leather jackets are leading the nouveau grung...
Name: Trends, dtype: object

Remaining rows after cleaning: 82


In [4]:
# Create 'text' column out of the column 'Trends', which contains main text content
df["text"] = df["Trends"].astype(str).str.strip()

# Keep only 'text' column for chatbot context
fashion_df = df[["text"]].copy()

# Preview first 5 rows
fashion_df.head()

Unnamed: 0,text
0,2023 Fashion Trend: Red. Glossy red hues took ...
1,2023 Fashion Trend: Cargo Pants. Utilitarian w...
2,"2023 Fashion Trend: Sheer Clothing. ""Bare it a..."
3,2023 Fashion Trend: Denim Reimagined. From dou...
4,2023 Fashion Trend: Shine For The Daytime. The...


## Custom Query Completion

### RAG pipeline:
- TF-IDF to retrieve most relevant trend snippets
- Build custom prompt that injects retrieved context
- Call OpenAI Completion model (while keeping basic (no-context) function for comparison)

In [5]:
# Build vector index for retrieval
assert "fashion_df" in globals()
assert "text" in fashion_df.columns

# TF-IDF vectorizer (unigrams + bigrams for better recall when text is short)
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=1, stop_words="english")
tfidf_matrix = vectorizer.fit_transform(fashion_df["text"].tolist())

def retrieve(query: str, top_k: int = 5):
    """Returns indices and scores of top_k most relevant rows"""
    q_vec = vectorizer.transform([query])
    sims = cosine_similarity(q_vec, tfidf_matrix).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return list(zip(top_idx, sims[top_idx]))

In [6]:
# Prompt builder for basic
def build_basic_prompt(question: str) -> str:
    return (
        "You are a kind and very helpful assistant. Answer in clear, simple English.\n"
        f"Question: {question}\n"
        "Answer:"
    )

# Prompt builder for custom
def build_custom_prompt(question: str, context_chunks: list) -> str:
    context_text = "\n\n".join(f"- {c}" for c in context_chunks)
    return (
        "You are a professional Fashion Trend Assistant. Use only the provided context to answer.\n"
        "Be very concise. If the answer is not in the context, say you don't know the answer to the question.\n\n"
        "Context:\n"
        f"{context_text}\n\n"
        f"Question: {question}\n"
        "Answer:"
    )

In [7]:
# Set openAI model
MODEL = "gpt-3.5-turbo-instruct"   

def ask_basic(question: str, max_tokens: int = 300, temperature: float = 0.2):
    prompt = build_basic_prompt(question)
    resp = openai.Completion.create(
        model=MODEL,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
    )
    return resp.choices[0].text.strip()

def ask_with_context(question: str, top_k: int = 5, max_tokens: int = 300, temperature: float = 0.2):
    hits = retrieve(question, top_k=top_k)
    context = [fashion_df.iloc[i]["text"] for i, _ in hits]
    prompt = build_custom_prompt(question, context)
    resp = openai.Completion.create(
        model=MODEL,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
    )
    return resp.choices[0].text.strip(), context, hits

In [8]:
# Smoke test without API call
q_test = "What were the key colors and materials in 2023 fashion?"
hits = retrieve(q_test, top_k=3)
for i, score in hits:
    print(f"[{score:.3f}] {fashion_df.iloc[i]['text'][:120]}...")

[0.153] 2023 Fashion Trend: Cargo Pants. Utilitarian wear is in for 2023, which sets the stage for the return of the cargo pant....
[0.126] 2023 Fashion Trend: Maxi Skirts. In response to the ultra unpractical mini skirts of 2022, maxi skirts are here to domin...
[0.080] 2023 Fashion Trend: Sheer Clothing. "Bare it all" has been the motto since the end of the lockdown. In 2023,  naked dres...


In [9]:
# interactive loop with API, just type 'exit' to stop. 
while True:
    q = input("\nAsk about 2023 fashion trends (type 'exit' to quit): ")
    if q.strip().lower() in {"exit", "quit"}:
        break
    try:
        ans_custom, ctx, _ = ask_with_context(q, top_k=5)
        ans_basic = ask_basic(q)
        print("\n--- Custom (with context) ---")
        print(ans_custom)
        print("\n--- Basic (no context) ---")
        print(ans_basic)
    except Exception as e:
        print("Error:", e)

## Custom Performance Demonstration

Here we can compare the model’s answers with and without custom context.
For each question the questions shows:
- Basic: model answer without any dataset context
- Custom: model answer using retrieved snippets from the 2023 Fashion Trends dataset
- additional: the retrieved context used for the custom prompt

In [12]:
# Create Function to compare basic vs custom for a given question

def show_comparison(question: str, top_k: int = 5, ctx_preview_chars: int = 180):
    print(f"Q: {question}\n")
    
    # Custom (with context)
    custom_answer, context, hits = ask_with_context(question, top_k=top_k)
    print("=== Custom (with context) ===")
    print(custom_answer.strip(), "\n")
    
    # Basic (without context)
    basic_answer = ask_basic(question)
    print("=== Basic (no context) ===")
    print(basic_answer.strip(), "\n")
    
    # Show retrieved context used
    print("=== Retrieved context (top_k) ===")
    for (i, score), chunk in zip(hits, context):
        preview = chunk[:ctx_preview_chars].replace("\n", " ")
        print(f"[score={score:.3f}] {preview}...")

### Question 1
What does 'quiet luxury' mean in 2023 fashion, and how was it expressed?

In [None]:
q1 = "What does 'quiet luxury' mean in 2023 fashion, and how was it expressed?"
show_comparison(q1, top_k=5, ctx_preview_chars=200)

### Question 2
Which sustainability themes were highlighted in 2023 fashion trends?

In [None]:
q2 = "Which sustainability themes were highlighted in 2023 fashion trends?"
show_comparison(q2, top_k=5, ctx_preview_chars=200)