# üìò Day 1: Text Preprocessing and Embeddings

**üéØ Goal:** Master text preprocessing and understand how words become numbers (embeddings)

**‚è±Ô∏è Time:** 75-90 minutes

**üåü Why This Matters for AI:**
- Text preprocessing is the FIRST step for ChatGPT, Claude, and all LLMs
- Word embeddings power RAG systems, semantic search, and document retrieval
- Understanding embeddings helps you build better chatbots, search engines, and AI assistants
- GPT-4, BERT, and all modern NLP models use embeddings as their foundation
- RAG systems rely on embedding similarity to find relevant documents!

---

## üß† What is Natural Language Processing (NLP)?

**Natural Language Processing** is teaching computers to understand and generate human language!

### üéØ Real-World NLP Applications (2024-2025):

#### ü§ñ **Large Language Models (LLMs)**
- **ChatGPT, Claude, GPT-4**: Generate human-like text
- **Gemini, Llama 3**: Open-source and multimodal models
- **Applications**: Writing assistants, code generation, tutoring

#### üîç **RAG (Retrieval-Augmented Generation)**
- **Process**: Embed documents ‚Üí Find similar ‚Üí Generate answer
- **Applications**: Customer support bots, knowledge bases, Q&A systems
- **Why**: Reduces hallucinations, provides sources, keeps data current

#### üí¨ **Chatbots & Virtual Assistants**
- **Intent Classification**: Understanding what users want
- **Entity Recognition**: Extracting names, dates, locations
- **Applications**: Customer service, booking systems, AI agents

#### üåê **Semantic Search**
- **Traditional Search**: Keyword matching ("apple" only finds "apple")
- **Semantic Search**: Meaning matching ("fruit" finds "apple", "orange")
- **Applications**: Google, document search, recommendation systems

#### üé® **Multimodal AI**
- **GPT-4V**: Processes images + text
- **CLIP**: Connects images with text descriptions
- **Applications**: Image captioning, visual question answering

### üîë The Key Challenge:

**Computers only understand numbers, not words!**

```
Human:    "I love AI" ‚ùå Computer can't process this
Computer: [0.2, 0.8, 0.5] ‚úÖ Computer understands vectors!
```

**Our Mission Today:**
1. **Clean text**: Remove noise, normalize words
2. **Convert to numbers**: Turn words into vectors (embeddings)
3. **Capture meaning**: Similar words ‚Üí Similar vectors

## üõ†Ô∏è Text Preprocessing Pipeline

Before feeding text to AI models, we need to CLEAN and PREPARE it!

### üìã Standard NLP Pipeline:

```
Raw Text
   ‚Üì
1Ô∏è‚É£ Tokenization     (Split into words/sentences)
   ‚Üì
2Ô∏è‚É£ Lowercasing      ("Hello" ‚Üí "hello")
   ‚Üì
3Ô∏è‚É£ Remove Stopwords (Remove "the", "is", "a")
   ‚Üì
4Ô∏è‚É£ Stemming/Lemmatization ("running" ‚Üí "run")
   ‚Üì
Clean Tokens
   ‚Üì
5Ô∏è‚É£ Vectorization    (Convert to numbers)
   ‚Üì
Ready for ML/DL!
```

### üéØ Why Each Step Matters:

**1Ô∏è‚É£ Tokenization**
- Splits text into units (words, subwords, characters)
- **Example**: "I love NLP!" ‚Üí ["I", "love", "NLP", "!"]
- **Used in**: All NLP models (BERT, GPT, etc.)

**2Ô∏è‚É£ Lowercasing**
- Treats "Hello" and "hello" as the same word
- **Example**: "Python" ‚Üí "python"
- **Trade-off**: Loses information ("Apple" company vs "apple" fruit)

**3Ô∏è‚É£ Stopword Removal**
- Removes common words with little meaning
- **Example**: "the", "is", "a", "an", "in"
- **Why**: Reduces noise, speeds up processing
- **Caution**: Modern LLMs often keep stopwords!

**4Ô∏è‚É£ Stemming vs Lemmatization**
- **Stemming**: Crude chopping ("running" ‚Üí "run", "better" ‚Üí "better")
- **Lemmatization**: Smart reduction ("running" ‚Üí "run", "better" ‚Üí "good")
- **Example**: "studies", "studying", "studied" ‚Üí "study"

**5Ô∏è‚É£ Vectorization**
- Converts words to numbers
- **Methods**: Bag of Words, TF-IDF, Word2Vec, BERT embeddings
- **This is where the magic happens!**

In [None]:
# Install required libraries
import sys
!{sys.executable} -m pip install nltk scikit-learn gensim matplotlib numpy pandas seaborn --quiet

print("‚úÖ Libraries installed!")

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# NLTK (Natural Language Toolkit)
import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

print("üìö Libraries loaded!")
print("‚úÖ NLTK data downloaded!")

## üî§ Step 1: Tokenization

**Tokenization** = Breaking text into smaller units (tokens)

### üìä Types of Tokenization:

1. **Word Tokenization**: Split by words
2. **Sentence Tokenization**: Split by sentences
3. **Subword Tokenization**: Split into parts (used by BERT, GPT)

### üéØ Why Tokenization Matters:
- **ChatGPT**: Uses BPE (Byte-Pair Encoding) tokenization
- **BERT**: Uses WordPiece tokenization
- **Token Limits**: GPT-4 has 8K/32K token limits
- **Pricing**: Many APIs charge per token!

In [None]:
# Example text about AI and RAG
text = """
Natural Language Processing is revolutionizing AI in 2024-2025! 
Large Language Models like ChatGPT and Claude can understand context. 
RAG systems combine retrieval with generation for better accuracy.
Word embeddings help computers understand semantic similarity.
"""

print("üìù Original Text:")
print(text)
print("\n" + "="*70 + "\n")

# Sentence Tokenization
sentences = sent_tokenize(text)
print("üìÑ Sentence Tokenization:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.strip()}")

print("\n" + "="*70 + "\n")

# Word Tokenization
words = word_tokenize(text)
print("üî§ Word Tokenization:")
print(words[:20])  # First 20 tokens
print(f"\nTotal tokens: {len(words)}")

## üßπ Step 2: Text Cleaning

In [None]:
import string

def preprocess_text(text, remove_stopwords=True, lowercase=True):
    """
    Clean and preprocess text
    
    Args:
        text: Input text string
        remove_stopwords: Whether to remove stopwords
        lowercase: Whether to convert to lowercase
    
    Returns:
        List of cleaned tokens
    """
    # Tokenize
    tokens = word_tokenize(text)
    
    # Lowercase
    if lowercase:
        tokens = [word.lower() for word in tokens]
    
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]
    
    # Remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words]
    
    return tokens

# Test the preprocessing
sample_text = "I love learning about AI and Machine Learning in 2024-2025!"

print("üìù Original:")
print(sample_text)
print("\n" + "="*70 + "\n")

print("üî§ After tokenization:")
print(word_tokenize(sample_text))
print("\n" + "="*70 + "\n")

print("üßπ After full preprocessing:")
cleaned = preprocess_text(sample_text)
print(cleaned)
print(f"\nOriginal tokens: {len(word_tokenize(sample_text))}")
print(f"Cleaned tokens: {len(cleaned)}")

## ‚úÇÔ∏è Step 3: Stemming vs Lemmatization

### üî™ Stemming (Fast but Crude)
- Chops word endings
- **Pro**: Very fast
- **Con**: Sometimes creates non-words
- **Example**: "running" ‚Üí "run", "studies" ‚Üí "studi"

### üéØ Lemmatization (Slow but Accurate)
- Uses dictionary and grammar
- **Pro**: Returns real words
- **Con**: Slower
- **Example**: "running" ‚Üí "run", "better" ‚Üí "good"

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Test words
test_words = [
    'running', 'runs', 'ran',
    'better', 'good', 'best',
    'studies', 'studying', 'studied',
    'computing', 'computers', 'computed'
]

print("üî¨ Stemming vs Lemmatization Comparison:\n")
print(f"{'Original':<15} {'Stemmed':<15} {'Lemmatized':<15}")
print("="*50)

for word in test_words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos='v')  # pos='v' for verb
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")

print("\nüí° Notice:")
print("   - Stemming sometimes creates non-words ('studi')")
print("   - Lemmatization preserves real words")
print("   - Both reduce vocabulary size!")

## üéí Step 4: Bag of Words (BoW)

**Bag of Words** = Count how many times each word appears

### üìä How it Works:

```
Documents:
D1: "I love AI"
D2: "I love machine learning"
D3: "AI and machine learning"

Vocabulary: [I, love, AI, machine, learning, and]

Vectors:
D1: [1, 1, 1, 0, 0, 0]  ‚Üí counts for each word
D2: [1, 1, 0, 1, 1, 0]
D3: [0, 0, 1, 1, 1, 1]
```

### ‚ö†Ô∏è Limitations:
- **No word order**: "AI loves me" = "me loves AI"
- **No semantics**: "good" ‚â† "great" (different vectors)
- **High dimensionality**: One column per unique word

### ‚úÖ Strengths:
- **Simple and fast**
- **Works well for classification**
- **Baseline for NLP tasks**

In [None]:
# Sample documents about AI
documents = [
    "I love natural language processing",
    "Machine learning is amazing",
    "I love machine learning",
    "Natural language processing uses machine learning",
    "Deep learning powers ChatGPT and Claude"
]

print("üìö Documents:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

print("\n" + "="*70 + "\n")

# Create Bag of Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

print("üéí Bag of Words Representation:")
print(f"\nVocabulary: {list(feature_names)}")
print(f"Vocabulary size: {len(feature_names)}")

# Display as DataFrame
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=feature_names,
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

print("\nüìä Word Counts per Document:")
print(bow_df)

In [None]:
# Visualize Bag of Words
plt.figure(figsize=(12, 6))
sns.heatmap(bow_df, annot=True, fmt='d', cmap='YlOrRd', cbar_kws={'label': 'Count'})
plt.title('Bag of Words Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Words', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("üé® Each cell shows how many times a word appears in a document")
print("üìå Notice: Most cells are 0 (sparse matrix!)")

## üìä Step 5: TF-IDF (Term Frequency - Inverse Document Frequency)

**TF-IDF** = Smart weighting that highlights important words

### üéØ The Problem with Bag of Words:
- Common words like "the", "is" get high counts
- But they don't carry much meaning!
- **Solution**: TF-IDF downweights common words

### üìê How TF-IDF Works:

**TF (Term Frequency):**
```
TF(word) = (Count of word in document) / (Total words in document)
```

**IDF (Inverse Document Frequency):**
```
IDF(word) = log(Total documents / Documents containing word)
```

**TF-IDF Score:**
```
TF-IDF = TF √ó IDF
```

### üí° Intuition:
- **High TF-IDF**: Word is frequent in THIS document, rare in others ‚Üí Important!
- **Low TF-IDF**: Word is common everywhere ‚Üí Not distinctive

### üéØ Used In:
- **Search Engines**: Ranking relevant documents
- **Document Similarity**: Finding similar texts
- **Keyword Extraction**: Identifying important terms
- **RAG Systems**: Document retrieval before generation

In [None]:
# Create TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names
tfidf_features = tfidf_vectorizer.get_feature_names_out()

# Display as DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_features,
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

print("üìä TF-IDF Representation:")
print(tfidf_df.round(3))

print("\nüí° Interpretation:")
print("   - Higher values = More important to that document")
print("   - Values range from 0 to 1")
print("   - Common words get lower scores")

In [None]:
# Visualize TF-IDF
plt.figure(figsize=(12, 6))
sns.heatmap(tfidf_df, annot=True, fmt='.2f', cmap='Blues', cbar_kws={'label': 'TF-IDF Score'})
plt.title('TF-IDF Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Words', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("üìä Notice how TF-IDF highlights distinctive words!")

## üöÄ Step 6: Word Embeddings (Word2Vec)

**Word Embeddings** = Dense vectors that capture word meaning!

### üéØ The Revolution:

**Old Way (BoW/TF-IDF):**
- Sparse vectors (mostly zeros)
- No semantic meaning
- "king" and "queen" are completely different

**New Way (Embeddings):**
- Dense vectors (all values meaningful)
- **Captures semantics**: Similar words ‚Üí Similar vectors
- **Math with meaning**: king - man + woman ‚âà queen

### üìê Word2Vec Properties:

```
"king"   ‚Üí [0.2, 0.8, 0.5, ...] (300 dimensions)
"queen"  ‚Üí [0.1, 0.7, 0.6, ...]
"man"    ‚Üí [0.3, 0.2, 0.1, ...]
"woman"  ‚Üí [0.2, 0.1, 0.2, ...]

Magic:
vector("king") - vector("man") + vector("woman") ‚âà vector("queen")
```

### üéØ Why This Matters for AI:

**üîç RAG Systems:**
- Embed documents and queries
- Find similar documents using cosine similarity
- Retrieve relevant context for LLM

**üí¨ Chatbots:**
- Understand user intent through embeddings
- Match to similar training examples
- Generate contextual responses

**üåê Semantic Search:**
- Search by meaning, not keywords
- "How to fix a bug" finds "debugging tutorial"

**ü§ñ Modern LLMs:**
- BERT, GPT, Claude use advanced embeddings
- Contextual embeddings (same word, different meanings)
- Foundation of all transformer models!

In [None]:
# Install and import Gensim for Word2Vec
from gensim.models import Word2Vec

# Sample corpus about AI and ML
sentences = [
    ['machine', 'learning', 'is', 'subset', 'of', 'artificial', 'intelligence'],
    ['deep', 'learning', 'uses', 'neural', 'networks'],
    ['natural', 'language', 'processing', 'helps', 'computers', 'understand', 'text'],
    ['chatgpt', 'and', 'claude', 'are', 'large', 'language', 'models'],
    ['rag', 'systems', 'combine', 'retrieval', 'and', 'generation'],
    ['word', 'embeddings', 'capture', 'semantic', 'meaning'],
    ['transformers', 'revolutionized', 'natural', 'language', 'processing'],
    ['bert', 'and', 'gpt', 'use', 'attention', 'mechanisms'],
    ['vector', 'databases', 'store', 'embeddings', 'for', 'similarity', 'search'],
    ['semantic', 'search', 'finds', 'documents', 'by', 'meaning'],
]

print("üî§ Training Word2Vec on AI/ML corpus...\n")

# Train Word2Vec model
# vector_size: dimension of embeddings
# window: context window size
# min_count: ignore words appearing less than this
# sg: 1 for skip-gram, 0 for CBOW
model = Word2Vec(
    sentences=sentences,
    vector_size=50,  # smaller for demo
    window=5,
    min_count=1,
    sg=1,  # Skip-gram
    epochs=100
)

print("‚úÖ Word2Vec model trained!")
print(f"\nVocabulary size: {len(model.wv)}")
print(f"Embedding dimension: {model.wv.vector_size}")

# Get embedding for a word
word = 'learning'
embedding = model.wv[word]

print(f"\nüî¢ Embedding for '{word}':")
print(f"Shape: {embedding.shape}")
print(f"First 10 dimensions: {embedding[:10].round(3)}")

In [None]:
# Find similar words
test_words = ['learning', 'language', 'embeddings', 'chatgpt']

print("üîç Finding Similar Words:\n")

for word in test_words:
    if word in model.wv:
        similar = model.wv.most_similar(word, topn=3)
        print(f"üìå Words similar to '{word}':")
        for similar_word, score in similar:
            print(f"   - {similar_word}: {score:.3f}")
        print()

print("üí° Higher score = More similar!")
print("üìä Similarity measured using cosine similarity")

## üé® Step 7: Visualizing Word Embeddings

Word embeddings are high-dimensional (50-300 dimensions). Let's use **t-SNE** to visualize them in 2D!

**t-SNE** = Dimensionality reduction that preserves similarity

In [None]:
from sklearn.manifold import TSNE

# Get all words and their embeddings
words = list(model.wv.index_to_key)
word_vectors = np.array([model.wv[word] for word in words])

# Reduce to 2D using t-SNE
print("üé® Reducing embeddings to 2D using t-SNE...")
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(word_vectors)

print("‚úÖ Dimensionality reduction complete!\n")

# Plot
plt.figure(figsize=(14, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.6, c='steelblue')

# Add word labels
for i, word in enumerate(words):
    plt.annotate(word, 
                xy=(embeddings_2d[i, 0], embeddings_2d[i, 1]),
                xytext=(5, 5),
                textcoords='offset points',
                fontsize=11,
                fontweight='bold')

plt.title('Word Embeddings Visualization (t-SNE)', fontsize=16, fontweight='bold')
plt.xlabel('Dimension 1', fontsize=12)
plt.ylabel('Dimension 2', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üìä Notice how similar words cluster together!")
print("üí° 'learning', 'language', 'processing' should be close")
print("üéØ This is how RAG systems find similar documents!")

## üèóÔ∏è Real AI Application: Building a Document Similarity System for RAG

**Scenario**: You're building a RAG-powered chatbot for a tech company!

**Task**: Given a user query, find the most relevant documents from your knowledge base.

**Steps**:
1. Convert documents to TF-IDF vectors (or embeddings)
2. Convert user query to same vector space
3. Calculate cosine similarity
4. Return top-K most similar documents
5. (In real RAG) Feed these docs to LLM for answer generation

**This is the core of RAG systems!**

In [None]:
# Knowledge base documents (simulating company documentation)
knowledge_base = [
    "Machine learning is a subset of artificial intelligence that learns from data",
    "Deep learning uses neural networks with multiple layers for complex pattern recognition",
    "Natural language processing helps computers understand and generate human language",
    "RAG systems combine retrieval and generation for more accurate AI responses",
    "ChatGPT and Claude are large language models trained on massive text datasets",
    "Word embeddings represent words as dense vectors capturing semantic meaning",
    "Transformers use attention mechanisms to process sequential data like text",
    "Vector databases store embeddings for efficient similarity search",
    "Semantic search finds documents by meaning rather than exact keyword matches",
    "Fine-tuning adapts pre-trained models to specific tasks or domains"
]

print("üìö Knowledge Base:")
for i, doc in enumerate(knowledge_base, 1):
    print(f"{i}. {doc}")

print("\n" + "="*70 + "\n")

# Create TF-IDF vectors for all documents
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(knowledge_base)

print(f"‚úÖ Vectorized {len(knowledge_base)} documents")
print(f"üìä Vocabulary size: {len(vectorizer.get_feature_names_out())}")

In [None]:
def search_documents(query, top_k=3):
    """
    Search for most relevant documents given a query
    
    This is the retrieval step in RAG!
    
    Args:
        query: User's search query
        top_k: Number of top results to return
    
    Returns:
        List of (document, similarity_score) tuples
    """
    # Convert query to TF-IDF vector
    query_vector = vectorizer.transform([query])
    
    # Calculate cosine similarity with all documents
    similarities = cosine_similarity(query_vector, doc_vectors)[0]
    
    # Get top-k most similar documents
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    results = []
    for idx in top_indices:
        results.append({
            'document': knowledge_base[idx],
            'score': similarities[idx],
            'rank': len(results) + 1
        })
    
    return results

# Test queries
test_queries = [
    "How do neural networks work?",
    "What is RAG and how does it improve AI?",
    "Explain word embeddings",
]

print("üîç Testing Document Retrieval System:\n")
print("="*80)

for query in test_queries:
    print(f"\n‚ùì Query: '{query}'\n")
    
    results = search_documents(query, top_k=3)
    
    print("üìÑ Top 3 Most Relevant Documents:\n")
    for result in results:
        print(f"#{result['rank']} (Similarity: {result['score']:.3f})")
        print(f"   {result['document']}\n")
    
    print("="*80)

print("\nüéØ This is exactly how RAG systems work!")
print("   1. User asks a question")
print("   2. System finds relevant documents (what we just did)")
print("   3. LLM generates answer using retrieved documents")
print("   4. Result: Accurate, grounded, source-backed answers!")

## üéØ Why This Matters for Modern AI

### üîç **RAG Systems (2024-2025 Trend)**

What you just built is the **core of RAG**!

**Full RAG Pipeline:**
```
User Query: "What is machine learning?"
     ‚Üì
1. EMBED query ‚Üí vector
     ‚Üì
2. RETRIEVE similar docs (what we built!)
     ‚Üì
3. GENERATE answer using LLM + retrieved docs
     ‚Üì
Answer: "Based on our documentation, machine learning is..."
```

### üéØ **Real-World Applications:**

**1. Customer Support Bots**
- Embed all support docs
- User asks question ‚Üí retrieve relevant docs
- LLM generates answer with sources

**2. Internal Knowledge Systems**
- Companies embed all internal docs
- Employees search by meaning, not keywords
- Find relevant info across 1000s of documents

**3. Code Search (GitHub Copilot)**
- Embed code snippets and descriptions
- Search for "how to read JSON file"
- Find relevant code examples

**4. Semantic Search Engines**
- Google, Bing use embeddings
- Understand query intent
- Find pages by meaning

### ü§ñ **Modern Embedding Models:**

In production, you'd use advanced embeddings:
- **OpenAI Embeddings**: `text-embedding-3-large`
- **Sentence Transformers**: `all-MiniLM-L6-v2`
- **Cohere Embeddings**: Multilingual support
- **Voyage AI**: Optimized for RAG

These are 100x better than Word2Vec for RAG!

## üéØ YOUR TURN: Interactive Exercise

**Challenge**: Build a product recommendation system using embeddings!

**Scenario**: You work for an e-commerce company. Build a system that recommends similar products based on descriptions.

**Tasks**:
1. Create product descriptions (at least 5)
2. Vectorize them using TF-IDF
3. Implement a `find_similar_products()` function
4. Test with sample queries

**Bonus**: Visualize the product embeddings!

In [None]:
# YOUR CODE HERE!

# TODO 1: Create product descriptions
products = [
    # Add at least 5 product descriptions
    # Example: "Wireless bluetooth headphones with noise cancellation"
]

# TODO 2: Vectorize products using TF-IDF
# Hint: Use TfidfVectorizer

# TODO 3: Implement find_similar_products(product_query, top_k=3)
def find_similar_products(product_query, top_k=3):
    """
    Find similar products based on description
    """
    # YOUR CODE HERE
    pass

# TODO 4: Test with queries
# Example: find_similar_products("looking for wireless headphones")

print("Complete the TODOs above!")

### ‚úÖ Solution (Try on your own first!)

In [None]:
# SOLUTION

# Product catalog
products = [
    "Wireless bluetooth headphones with active noise cancellation and 30-hour battery",
    "USB-C charging cable with fast charging support for smartphones and tablets",
    "Mechanical gaming keyboard with RGB lighting and customizable keys",
    "4K webcam with autofocus and built-in microphone for video conferencing",
    "Portable bluetooth speaker with waterproof design and 360-degree sound",
    "Wireless gaming mouse with programmable buttons and adjustable DPI",
    "Laptop stand with adjustable height and cooling fan for better ergonomics",
    "Noise-cancelling earbuds with wireless charging case and touch controls",
]

product_names = [
    "Premium Headphones",
    "Fast Charging Cable",
    "RGB Gaming Keyboard",
    "4K Webcam",
    "Portable Speaker",
    "Gaming Mouse",
    "Laptop Stand",
    "Wireless Earbuds"
]

print("üõçÔ∏è Product Catalog:")
for name, desc in zip(product_names, products):
    print(f"- {name}: {desc}")

print("\n" + "="*80 + "\n")

# Vectorize products
product_vectorizer = TfidfVectorizer()
product_vectors = product_vectorizer.fit_transform(products)

def find_similar_products(query, top_k=3):
    """
    Find similar products based on query
    """
    # Convert query to vector
    query_vec = product_vectorizer.transform([query])
    
    # Calculate similarities
    similarities = cosine_similarity(query_vec, product_vectors)[0]
    
    # Get top-k
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    results = []
    for idx in top_indices:
        results.append({
            'name': product_names[idx],
            'description': products[idx],
            'score': similarities[idx]
        })
    
    return results

# Test queries
queries = [
    "I need wireless audio device for music",
    "Looking for gaming peripherals with lights",
    "Want something for video calls and meetings"
]

for query in queries:
    print(f"üîç Query: '{query}'\n")
    recommendations = find_similar_products(query, top_k=3)
    
    print("üí° Recommended Products:\n")
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec['name']} (Score: {rec['score']:.3f})")
        print(f"   {rec['description']}\n")
    
    print("="*80 + "\n")

print("‚úÖ Product recommendation system built!")
print("üéØ This is used in e-commerce, content recommendations, and more!")

## üéâ Congratulations!

**You just learned:**
- ‚úÖ Text preprocessing pipeline (tokenization, cleaning, stemming, lemmatization)
- ‚úÖ Bag of Words and TF-IDF vectorization
- ‚úÖ Word2Vec embeddings and semantic similarity
- ‚úÖ Visualizing word embeddings with t-SNE
- ‚úÖ Built a document similarity system for RAG
- ‚úÖ Understood how embeddings power modern AI!

### üéØ Key Takeaways:

1. **Text must be converted to numbers for AI**
   - Preprocessing cleans and normalizes text
   - Vectorization creates numeric representations

2. **Embeddings capture semantic meaning**
   - Similar words have similar vectors
   - Enable semantic search and similarity

3. **TF-IDF highlights important words**
   - Downweights common words
   - Used for document ranking and search

4. **This powers modern AI systems**
   - RAG uses embeddings for retrieval
   - LLMs use advanced contextual embeddings
   - Vector databases store embeddings at scale

---

**üéØ Practice Exercise (Before Day 2):**

Build a simple chatbot FAQ system:
1. Create 10+ FAQ pairs (question, answer)
2. Vectorize all questions
3. When user asks question, find most similar FAQ
4. Return the corresponding answer

This is a mini-RAG system!

---

**üìö Next Lesson:** Day 2 - Text Classification with Deep Learning
- Embedding layers in neural networks
- 1D CNNs for text
- LSTM text classifiers
- Attention mechanisms
- Build spam detector and intent classifier!

---

**üí¨ Remember:**

*"You just built the foundation of RAG systems! Every time you use ChatGPT with file uploads, or ask a question to a company's AI chatbot, embeddings and similarity search are working behind the scenes. You now understand the core technology powering the AI revolution!"* üöÄ

---

**üîó Connections to Modern AI:**
- **RAG**: Embeddings + similarity search + LLM generation
- **Semantic Search**: TF-IDF/embeddings for meaningful search
- **Chatbots**: Intent matching using embedding similarity
- **LLMs**: Use advanced contextual embeddings (BERT, GPT)
- **Vector DBs**: Pinecone, Weaviate store billions of embeddings