# üìò Day 1: NLP Fundamentals

**üéØ Goal:** Master text preprocessing, word embeddings, and build a spam detector

**‚è±Ô∏è Time:** 90-120 minutes

**üåü Why This Matters for AI:**
- Text preprocessing is the FIRST step for ChatGPT, Claude, and all LLMs
- Word embeddings power RAG systems, semantic search, and document retrieval
- Understanding embeddings helps you build better chatbots, search engines, and AI assistants
- GPT-4, BERT, and all modern NLP models use embeddings as their foundation
- RAG systems rely on embedding similarity to find relevant documents!

---

## üß† What is Natural Language Processing (NLP)?

**Natural Language Processing** is teaching computers to understand and generate human language!

### üéØ Real-World NLP Applications (2024-2025):

#### ü§ñ **Large Language Models (LLMs)**
- **ChatGPT, Claude, GPT-4**: Generate human-like text
- **Gemini, Llama 3**: Open-source and multimodal models
- **Applications**: Writing assistants, code generation, tutoring

#### üîç **RAG (Retrieval-Augmented Generation)**
- **Process**: Embed documents ‚Üí Find similar ‚Üí Generate answer
- **Applications**: Customer support bots, knowledge bases, Q&A systems
- **Why**: Reduces hallucinations, provides sources, keeps data current

#### üí¨ **Chatbots & Virtual Assistants**
- **Intent Classification**: Understanding what users want
- **Entity Recognition**: Extracting names, dates, locations
- **Applications**: Customer service, booking systems, AI agents

#### üåê **Semantic Search**
- **Traditional Search**: Keyword matching ("apple" only finds "apple")
- **Semantic Search**: Meaning matching ("fruit" finds "apple", "orange")
- **Applications**: Google, document search, recommendation systems

### üîë The Key Challenge:

**Computers only understand numbers, not words!**

```
Human:    "I love AI" ‚ùå Computer can't process this
Computer: [0.2, 0.8, 0.5] ‚úÖ Computer understands vectors!
```

**Our Mission Today:**
1. **Clean text**: Remove noise, normalize words
2. **Convert to numbers**: Turn words into vectors (embeddings)
3. **Capture meaning**: Similar words ‚Üí Similar vectors
4. **Build real AI**: Spam detector using embeddings

In [None]:
# Install required libraries
import sys
!{sys.executable} -m pip install nltk scikit-learn gensim matplotlib numpy pandas seaborn --quiet

print("‚úÖ Libraries installed!")

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# NLTK (Natural Language Toolkit)
import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

print("üìö Libraries loaded!")
print("‚úÖ NLTK data downloaded!")

## üõ†Ô∏è Part 1: Text Preprocessing Pipeline

Before feeding text to AI models, we need to CLEAN and PREPARE it!

### üìã Standard NLP Pipeline:

```
Raw Text
   ‚Üì
1Ô∏è‚É£ Tokenization     (Split into words/sentences)
   ‚Üì
2Ô∏è‚É£ Lowercasing      ("Hello" ‚Üí "hello")
   ‚Üì
3Ô∏è‚É£ Remove Stopwords (Remove "the", "is", "a")
   ‚Üì
4Ô∏è‚É£ Stemming/Lemmatization ("running" ‚Üí "run")
   ‚Üì
Clean Tokens
   ‚Üì
5Ô∏è‚É£ Vectorization    (Convert to numbers)
   ‚Üì
Ready for ML/DL!
```

### üéØ Why Each Step Matters:

**1Ô∏è‚É£ Tokenization**
- Splits text into units (words, subwords, characters)
- **Example**: "I love NLP!" ‚Üí ["I", "love", "NLP", "!"]
- **Used in**: All NLP models (BERT, GPT, etc.)

**2Ô∏è‚É£ Lowercasing**
- Treats "Hello" and "hello" as the same word
- **Example**: "Python" ‚Üí "python"
- **Trade-off**: Loses information ("Apple" company vs "apple" fruit)

**3Ô∏è‚É£ Stopword Removal**
- Removes common words with little meaning
- **Example**: "the", "is", "a", "an", "in"
- **Why**: Reduces noise, speeds up processing
- **Caution**: Modern LLMs often keep stopwords!

**4Ô∏è‚É£ Stemming vs Lemmatization**
- **Stemming**: Crude chopping ("running" ‚Üí "run", "better" ‚Üí "better")
- **Lemmatization**: Smart reduction ("running" ‚Üí "run", "better" ‚Üí "good")
- **Example**: "studies", "studying", "studied" ‚Üí "study"

**5Ô∏è‚É£ Vectorization**
- Converts words to numbers
- **Methods**: Bag of Words, TF-IDF, Word2Vec, GloVe, BERT embeddings
- **This is where the magic happens!**

## üî§ Step 1: Tokenization

**Tokenization** = Breaking text into smaller units (tokens)

### üìä Types of Tokenization:

1. **Word Tokenization**: Split by words
2. **Sentence Tokenization**: Split by sentences
3. **Subword Tokenization**: Split into parts (used by BERT, GPT)

### üéØ Why Tokenization Matters:
- **ChatGPT**: Uses BPE (Byte-Pair Encoding) tokenization
- **BERT**: Uses WordPiece tokenization
- **Token Limits**: GPT-4 has 8K/32K/128K token limits
- **Pricing**: Many APIs charge per token!

In [None]:
# Example text about AI and RAG
text = """
Natural Language Processing is revolutionizing AI in 2024-2025! 
Large Language Models like ChatGPT and Claude can understand context. 
RAG systems combine retrieval with generation for better accuracy.
Word embeddings help computers understand semantic similarity.
"""

print("üìù Original Text:")
print(text)
print("\n" + "="*70 + "\n")

# Sentence Tokenization
sentences = sent_tokenize(text)
print("üìÑ Sentence Tokenization:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.strip()}")

print("\n" + "="*70 + "\n")

# Word Tokenization
words = word_tokenize(text)
print("üî§ Word Tokenization:")
print(words[:20])  # First 20 tokens
print(f"\nTotal tokens: {len(words)}")

## üßπ Step 2: Text Cleaning and Preprocessing

In [None]:
import string

def preprocess_text(text, remove_stopwords=True, lowercase=True, remove_punct=True):
    """
    Clean and preprocess text
    
    Args:
        text: Input text string
        remove_stopwords: Whether to remove stopwords
        lowercase: Whether to convert to lowercase
        remove_punct: Whether to remove punctuation
    
    Returns:
        List of cleaned tokens
    """
    # Tokenize
    tokens = word_tokenize(text)
    
    # Lowercase
    if lowercase:
        tokens = [word.lower() for word in tokens]
    
    # Remove punctuation
    if remove_punct:
        tokens = [word for word in tokens if word not in string.punctuation]
    
    # Remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words]
    
    return tokens

# Test the preprocessing
sample_text = "I love learning about AI and Machine Learning in 2024-2025!"

print("üìù Original:")
print(sample_text)
print("\n" + "="*70 + "\n")

print("üî§ After tokenization:")
print(word_tokenize(sample_text))
print("\n" + "="*70 + "\n")

print("üßπ After full preprocessing:")
cleaned = preprocess_text(sample_text)
print(cleaned)
print(f"\nOriginal tokens: {len(word_tokenize(sample_text))}")
print(f"Cleaned tokens: {len(cleaned)}")

## ‚úÇÔ∏è Step 3: Stemming vs Lemmatization

### üî™ Stemming (Fast but Crude)
- Chops word endings
- **Pro**: Very fast
- **Con**: Sometimes creates non-words
- **Example**: "running" ‚Üí "run", "studies" ‚Üí "studi"

### üéØ Lemmatization (Slow but Accurate)
- Uses dictionary and grammar
- **Pro**: Returns real words
- **Con**: Slower
- **Example**: "running" ‚Üí "run", "better" ‚Üí "good"

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Test words
test_words = [
    'running', 'runs', 'ran',
    'better', 'good', 'best',
    'studies', 'studying', 'studied',
    'computing', 'computers', 'computed'
]

print("üî¨ Stemming vs Lemmatization Comparison:\n")
print(f"{'Original':<15} {'Stemmed':<15} {'Lemmatized':<15}")
print("="*50)

for word in test_words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word, pos='v')  # pos='v' for verb
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")

print("\nüí° Notice:")
print("   - Stemming sometimes creates non-words ('studi')")
print("   - Lemmatization preserves real words")
print("   - Both reduce vocabulary size!")

## üéí Part 2: Bag of Words (BoW)

**Bag of Words** = Count how many times each word appears

### üìä How it Works:

```
Documents:
D1: "I love AI"
D2: "I love machine learning"
D3: "AI and machine learning"

Vocabulary: [I, love, AI, machine, learning, and]

Vectors:
D1: [1, 1, 1, 0, 0, 0]  ‚Üí counts for each word
D2: [1, 1, 0, 1, 1, 0]
D3: [0, 0, 1, 1, 1, 1]
```

### ‚ö†Ô∏è Limitations:
- **No word order**: "AI loves me" = "me loves AI"
- **No semantics**: "good" ‚â† "great" (different vectors)
- **High dimensionality**: One column per unique word

### ‚úÖ Strengths:
- **Simple and fast**
- **Works well for classification**
- **Baseline for NLP tasks**

In [None]:
# Sample documents about AI
documents = [
    "I love natural language processing",
    "Machine learning is amazing",
    "I love machine learning",
    "Natural language processing uses machine learning",
    "Deep learning powers ChatGPT and Claude"
]

print("üìö Documents:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

print("\n" + "="*70 + "\n")

# Create Bag of Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

print("üéí Bag of Words Representation:")
print(f"\nVocabulary: {list(feature_names)}")
print(f"Vocabulary size: {len(feature_names)}")

# Display as DataFrame
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=feature_names,
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

print("\nüìä Word Counts per Document:")
print(bow_df)

In [None]:
# Visualize Bag of Words
plt.figure(figsize=(12, 6))
sns.heatmap(bow_df, annot=True, fmt='d', cmap='YlOrRd', cbar_kws={'label': 'Count'})
plt.title('Bag of Words Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Words', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("üé® Each cell shows how many times a word appears in a document")
print("üìå Notice: Most cells are 0 (sparse matrix!)")

## üìä Part 3: TF-IDF (Term Frequency - Inverse Document Frequency)

**TF-IDF** = Smart weighting that highlights important words

### üéØ The Problem with Bag of Words:
- Common words like "the", "is" get high counts
- But they don't carry much meaning!
- **Solution**: TF-IDF downweights common words

### üìê How TF-IDF Works:

**TF (Term Frequency):**
```
TF(word) = (Count of word in document) / (Total words in document)
```

**IDF (Inverse Document Frequency):**
```
IDF(word) = log(Total documents / Documents containing word)
```

**TF-IDF Score:**
```
TF-IDF = TF √ó IDF
```

### üí° Intuition:
- **High TF-IDF**: Word is frequent in THIS document, rare in others ‚Üí Important!
- **Low TF-IDF**: Word is common everywhere ‚Üí Not distinctive

### üéØ Used In:
- **Search Engines**: Ranking relevant documents
- **Document Similarity**: Finding similar texts
- **Keyword Extraction**: Identifying important terms
- **RAG Systems**: Document retrieval before generation

In [None]:
# Create TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names
tfidf_features = tfidf_vectorizer.get_feature_names_out()

# Display as DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_features,
    index=[f"Doc {i+1}" for i in range(len(documents))]
)

print("üìä TF-IDF Representation:")
print(tfidf_df.round(3))

print("\nüí° Interpretation:")
print("   - Higher values = More important to that document")
print("   - Values range from 0 to 1")
print("   - Common words get lower scores")

In [None]:
# Visualize TF-IDF
plt.figure(figsize=(12, 6))
sns.heatmap(tfidf_df, annot=True, fmt='.2f', cmap='Blues', cbar_kws={'label': 'TF-IDF Score'})
plt.title('TF-IDF Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Words', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("üìä Notice how TF-IDF highlights distinctive words!")

## üöÄ Part 4: Word Embeddings - Word2Vec

**Word Embeddings** = Dense vectors that capture word meaning!

### üéØ The Revolution:

**Old Way (BoW/TF-IDF):**
- Sparse vectors (mostly zeros)
- No semantic meaning
- "king" and "queen" are completely different

**New Way (Embeddings):**
- Dense vectors (all values meaningful)
- **Captures semantics**: Similar words ‚Üí Similar vectors
- **Math with meaning**: king - man + woman ‚âà queen

### üìê Word2Vec Properties:

```
"king"   ‚Üí [0.2, 0.8, 0.5, ...] (300 dimensions)
"queen"  ‚Üí [0.1, 0.7, 0.6, ...]
"man"    ‚Üí [0.3, 0.2, 0.1, ...]
"woman"  ‚Üí [0.2, 0.1, 0.2, ...]

Magic:
vector("king") - vector("man") + vector("woman") ‚âà vector("queen")
```

### üéØ Why This Matters for AI:

**üîç RAG Systems:**
- Embed documents and queries
- Find similar documents using cosine similarity
- Retrieve relevant context for LLM

**üí¨ Chatbots:**
- Understand user intent through embeddings
- Match to similar training examples
- Generate contextual responses

**üåê Semantic Search:**
- Search by meaning, not keywords
- "How to fix a bug" finds "debugging tutorial"

**ü§ñ Modern LLMs:**
- BERT, GPT, Claude use advanced embeddings
- Contextual embeddings (same word, different meanings)
- Foundation of all transformer models!

In [None]:
# Install and import Gensim for Word2Vec
from gensim.models import Word2Vec

# Sample corpus about AI and ML
sentences = [
    ['machine', 'learning', 'is', 'subset', 'of', 'artificial', 'intelligence'],
    ['deep', 'learning', 'uses', 'neural', 'networks'],
    ['natural', 'language', 'processing', 'helps', 'computers', 'understand', 'text'],
    ['chatgpt', 'and', 'claude', 'are', 'large', 'language', 'models'],
    ['rag', 'systems', 'combine', 'retrieval', 'and', 'generation'],
    ['word', 'embeddings', 'capture', 'semantic', 'meaning'],
    ['transformers', 'revolutionized', 'natural', 'language', 'processing'],
    ['bert', 'and', 'gpt', 'use', 'attention', 'mechanisms'],
    ['vector', 'databases', 'store', 'embeddings', 'for', 'similarity', 'search'],
    ['semantic', 'search', 'finds', 'documents', 'by', 'meaning'],
]

print("üî§ Training Word2Vec on AI/ML corpus...\n")

# Train Word2Vec model
# vector_size: dimension of embeddings
# window: context window size
# min_count: ignore words appearing less than this
# sg: 1 for skip-gram, 0 for CBOW
model_w2v = Word2Vec(
    sentences=sentences,
    vector_size=50,  # smaller for demo
    window=5,
    min_count=1,
    sg=1,  # Skip-gram
    epochs=100
)

print("‚úÖ Word2Vec model trained!")
print(f"\nVocabulary size: {len(model_w2v.wv)}")
print(f"Embedding dimension: {model_w2v.wv.vector_size}")

# Get embedding for a word
word = 'learning'
embedding = model_w2v.wv[word]

print(f"\nüî¢ Embedding for '{word}':")
print(f"Shape: {embedding.shape}")
print(f"First 10 dimensions: {embedding[:10].round(3)}")

In [None]:
# Find similar words
test_words = ['learning', 'language', 'embeddings', 'chatgpt']

print("üîç Finding Similar Words:\n")

for word in test_words:
    if word in model_w2v.wv:
        similar = model_w2v.wv.most_similar(word, topn=3)
        print(f"üìå Words similar to '{word}':")
        for similar_word, score in similar:
            print(f"   - {similar_word}: {score:.3f}")
        print()

print("üí° Higher score = More similar!")
print("üìä Similarity measured using cosine similarity")

## üåç Part 5: GloVe Embeddings (Pre-trained)

**GloVe** (Global Vectors for Word Representation) is another popular embedding method!

### üéØ GloVe vs Word2Vec:

**Word2Vec:**
- Predicts context from words (or vice versa)
- Local context window
- Faster training

**GloVe:**
- Uses global word co-occurrence statistics
- Matrix factorization approach
- Better captures global semantic relationships

### üì¶ Pre-trained Embeddings:

Instead of training from scratch, use pre-trained embeddings trained on massive corpora!

**Popular Pre-trained Embeddings:**
- **GloVe**: Trained on Wikipedia + Gigaword (6B tokens)
- **Word2Vec**: Trained on Google News (100B words)
- **FastText**: Facebook's embeddings with subword information

**In production, you'd use:**
- Download pre-trained GloVe vectors
- Load them into your application
- Use immediately without training!

In [None]:
# Simulating pre-trained embeddings (in production, you'd download GloVe)
# For demonstration, we'll use our Word2Vec model

def get_word_vector(word, model):
    """
    Get word vector from model
    """
    if word in model.wv:
        return model.wv[word]
    else:
        # Return zero vector for unknown words
        return np.zeros(model.wv.vector_size)

def document_vector(doc, model):
    """
    Create document vector by averaging word vectors
    This is used in many NLP applications!
    """
    # Tokenize and preprocess
    tokens = preprocess_text(doc)
    
    # Get word vectors
    word_vectors = [get_word_vector(word, model) for word in tokens]
    
    # Average word vectors
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(model.wv.vector_size)

# Test document vectorization
test_doc = "Machine learning and deep learning are powerful AI techniques"

doc_vec = document_vector(test_doc, model_w2v)

print("üìÑ Document Vectorization:")
print(f"Document: '{test_doc}'")
print(f"\nDocument vector shape: {doc_vec.shape}")
print(f"First 10 values: {doc_vec[:10].round(3)}")
print("\nüí° This vector represents the entire document!")
print("   Used in RAG systems for document retrieval")

## üé® Visualizing Word Embeddings

Word embeddings are high-dimensional (50-300 dimensions). Let's use **t-SNE** to visualize them in 2D!

**t-SNE** = Dimensionality reduction that preserves similarity

In [None]:
from sklearn.manifold import TSNE

# Get all words and their embeddings
words = list(model_w2v.wv.index_to_key)
word_vectors = np.array([model_w2v.wv[word] for word in words])

# Reduce to 2D using t-SNE
print("üé® Reducing embeddings to 2D using t-SNE...")
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(word_vectors)

print("‚úÖ Dimensionality reduction complete!\n")

# Plot
plt.figure(figsize=(14, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.6, c='steelblue')

# Add word labels
for i, word in enumerate(words):
    plt.annotate(word, 
                xy=(embeddings_2d[i, 0], embeddings_2d[i, 1]),
                xytext=(5, 5),
                textcoords='offset points',
                fontsize=11,
                fontweight='bold')

plt.title('Word Embeddings Visualization (t-SNE)', fontsize=16, fontweight='bold')
plt.xlabel('Dimension 1', fontsize=12)
plt.ylabel('Dimension 2', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üìä Notice how similar words cluster together!")
print("üí° 'learning', 'language', 'processing' should be close")
print("üéØ This is how RAG systems find similar documents!")

## üèóÔ∏è Real AI Application: Building a Spam Detector

**Scenario**: Build an email spam detector using text embeddings!

**Task**: Classify emails as SPAM or HAM (not spam)

**Architecture**:
1. Preprocess emails (clean text)
2. Convert to TF-IDF vectors
3. Train classifier (Naive Bayes)
4. Evaluate on test set
5. Test with real examples!

**This is used in:**
- Email providers (Gmail, Outlook)
- Spam detection systems
- Content moderation
- Phishing detection

In [None]:
# Create synthetic spam/ham dataset
emails = [
    # SPAM emails
    "Congratulations! You won a million dollars! Click here to claim your prize now!",
    "Urgent! Your account will be closed. Verify your password immediately.",
    "Get rich quick! Make money from home. Limited time offer!",
    "Free iPhone! Click now to get your free phone today!",
    "Weight loss miracle! Lose 20 pounds in 2 weeks guaranteed!",
    "You have inherited 5 million dollars from a distant relative in Nigeria",
    "Cheap medications! No prescription needed! Order now!",
    "Work from home and earn $5000 per week! No experience required!",
    "Congratulations! You are the lucky winner of our lottery!",
    "Click here for free credit card with unlimited limit!",
    
    # HAM emails (legitimate)
    "Hi, let's schedule a meeting tomorrow to discuss the project progress.",
    "Your Amazon order has been shipped and will arrive on Friday.",
    "Reminder: Team standup meeting at 10 AM tomorrow.",
    "Thank you for your purchase. Your receipt is attached.",
    "Hi Mom, I'll be visiting next weekend. Looking forward to seeing you!",
    "Your flight booking confirmation for New York on Dec 25th.",
    "The quarterly report is ready for review. Please let me know your feedback.",
    "Happy birthday! Hope you have a wonderful day!",
    "Reminder: Your dentist appointment is scheduled for next Tuesday at 2 PM.",
    "The new software update includes several bug fixes and improvements.",
]

# Labels (0 = HAM, 1 = SPAM)
labels = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # First 10 are SPAM
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  # Last 10 are HAM

print("üìß Email Dataset:")
print(f"Total emails: {len(emails)}")
print(f"SPAM emails: {sum(labels)}")
print(f"HAM emails: {len(labels) - sum(labels)}")

# Create DataFrame
df = pd.DataFrame({
    'email': emails,
    'label': labels,
    'category': ['SPAM' if l == 1 else 'HAM' for l in labels]
})

print("\nüìä Sample emails:")
print(df.head(10))

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['email'], df['label'], test_size=0.3, random_state=42, stratify=df['label']
)

print("üîÄ Data Split:")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nTraining set class distribution:")
print(f"  SPAM: {sum(y_train)}")
print(f"  HAM: {len(y_train) - sum(y_train)}")

In [None]:
# Vectorize emails using TF-IDF
tfidf_vec = TfidfVectorizer(max_features=100, stop_words='english')

# Fit on training data
X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)

print("üìä TF-IDF Vectorization:")
print(f"Training matrix shape: {X_train_tfidf.shape}")
print(f"Test matrix shape: {X_test_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vec.get_feature_names_out())}")
print(f"\nTop 20 features:")
print(list(tfidf_vec.get_feature_names_out()[:20]))

In [None]:
# Train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

print("ü§ñ Training Spam Classifier...\n")
print("‚úÖ Model trained!\n")

# Make predictions
y_pred = clf.predict(X_test_tfidf)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)

print("üìä Model Performance:")
print(f"Accuracy: {accuracy:.2%}\n")

print("üìã Classification Report:")
print(classification_report(y_test, y_pred, target_names=['HAM', 'SPAM']))

In [None]:
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['HAM', 'SPAM'], 
            yticklabels=['HAM', 'SPAM'])
plt.title('Confusion Matrix - Spam Detector', fontsize=16, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

print("üìä Confusion Matrix Interpretation:")
print("   - Diagonal cells = Correct predictions")
print("   - Off-diagonal = Misclassifications")

In [None]:
# Test with new emails
new_emails = [
    "Congratulations! You won a free vacation! Click here now!",
    "Hi, are we still meeting for lunch tomorrow?",
    "Urgent! Your package could not be delivered. Verify your address.",
    "The Python programming course starts next Monday. See you there!",
]

print("üîÆ Testing Spam Detector on New Emails:\n")
print("="*80)

# Vectorize and predict
new_emails_tfidf = tfidf_vec.transform(new_emails)
predictions = clf.predict(new_emails_tfidf)
probabilities = clf.predict_proba(new_emails_tfidf)

for i, email in enumerate(new_emails):
    prediction = "SPAM" if predictions[i] == 1 else "HAM"
    spam_prob = probabilities[i][1]
    ham_prob = probabilities[i][0]
    
    print(f"\nüìß Email: \"{email}\"")
    print(f"üéØ Prediction: {prediction}")
    print(f"üìä Confidence: HAM {ham_prob:.2%} | SPAM {spam_prob:.2%}")
    print("="*80)

print("\n‚úÖ Spam detector working!")
print("üí° This is how Gmail and Outlook filter your emails!")

## üéØ Why This Matters for Modern AI

### üîç **RAG Systems (2024-2025 Trend)**

What you learned today is the **foundation of RAG**!

**Full RAG Pipeline:**
```
User Query: "What is machine learning?"
     ‚Üì
1. EMBED query ‚Üí vector (using Word2Vec, BERT, etc.)
     ‚Üì
2. RETRIEVE similar docs (cosine similarity with TF-IDF/embeddings)
     ‚Üì
3. GENERATE answer using LLM + retrieved docs
     ‚Üì
Answer: "Based on our documentation, machine learning is..."
```

### üéØ **Real-World Applications:**

**1. Customer Support Bots**
- Embed all support docs
- User asks question ‚Üí retrieve relevant docs
- LLM generates answer with sources

**2. Spam & Content Moderation**
- Email providers (Gmail, Outlook)
- Social media platforms (Twitter, Facebook)
- Comment filtering systems

**3. Semantic Search**
- Google, Bing use embeddings
- Understand query intent
- Find pages by meaning

**4. Recommendation Systems**
- Netflix, YouTube recommendations
- E-commerce product suggestions
- Content discovery

### ü§ñ **Modern Embedding Models:**

In production, you'd use advanced embeddings:
- **OpenAI Embeddings**: `text-embedding-3-large`
- **Sentence Transformers**: `all-MiniLM-L6-v2`
- **Cohere Embeddings**: Multilingual support
- **Voyage AI**: Optimized for RAG

These are 100x better than Word2Vec for RAG!

## üéØ Interactive Exercise

**Challenge**: Build a product review sentiment analyzer!

**Scenario**: Classify product reviews as POSITIVE or NEGATIVE

**Tasks**:
1. Create a dataset of product reviews (at least 10 positive, 10 negative)
2. Preprocess the reviews
3. Vectorize using TF-IDF
4. Train a classifier
5. Test on new reviews

**Bonus**: Visualize the most important words for each class!

In [None]:
# YOUR CODE HERE!

# TODO 1: Create product review dataset
reviews = [
    # Add positive reviews
    # Add negative reviews
]

# TODO 2: Create labels (1 = positive, 0 = negative)
review_labels = []

# TODO 3: Split, vectorize, train, evaluate

# TODO 4: Test on new reviews

print("Complete the TODOs above!")

### ‚úÖ Solution (Try on your own first!)

In [None]:
# SOLUTION

# Product reviews dataset
reviews = [
    # POSITIVE reviews
    "This product is amazing! Best purchase ever!",
    "Excellent quality, highly recommend to everyone!",
    "Love it! Works perfectly and arrived quickly.",
    "Outstanding product, exceeded my expectations!",
    "Great value for money, very satisfied with purchase.",
    "Fantastic! Exactly what I needed.",
    "Best product in its category. 5 stars!",
    "Impressed with the quality and fast shipping.",
    "Wonderful experience, will buy again!",
    "Absolutely love this product, works great!",
    
    # NEGATIVE reviews
    "Terrible product, waste of money!",
    "Poor quality, broke after one week.",
    "Disappointed, does not work as advertised.",
    "Awful experience, would not recommend.",
    "Complete waste of time and money.",
    "Low quality, returned immediately.",
    "Horrible product, do not buy!",
    "Very disappointed with this purchase.",
    "Worst product ever, complete failure.",
    "Useless, does not work at all.",
]

# Labels (1 = positive, 0 = negative)
review_labels = [1]*10 + [0]*10

# Split data
X_train_rev, X_test_rev, y_train_rev, y_test_rev = train_test_split(
    reviews, review_labels, test_size=0.3, random_state=42, stratify=review_labels
)

# Vectorize
tfidf_rev = TfidfVectorizer(max_features=50, stop_words='english')
X_train_rev_tfidf = tfidf_rev.fit_transform(X_train_rev)
X_test_rev_tfidf = tfidf_rev.transform(X_test_rev)

# Train classifier
clf_rev = MultinomialNB()
clf_rev.fit(X_train_rev_tfidf, y_train_rev)

# Evaluate
y_pred_rev = clf_rev.predict(X_test_rev_tfidf)
accuracy_rev = accuracy_score(y_test_rev, y_pred_rev)

print("üéØ Product Review Sentiment Analyzer\n")
print(f"Accuracy: {accuracy_rev:.2%}\n")
print("Classification Report:")
print(classification_report(y_test_rev, y_pred_rev, target_names=['Negative', 'Positive']))

# Test on new reviews
new_reviews = [
    "This is the best product I have ever bought!",
    "Complete disaster, do not waste your money.",
    "Okay product, nothing special.",
]

print("\nüîÆ Testing on new reviews:\n")
new_reviews_tfidf = tfidf_rev.transform(new_reviews)
predictions_rev = clf_rev.predict(new_reviews_tfidf)

for review, pred in zip(new_reviews, predictions_rev):
    sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
    print(f"Review: \"{review}\"")
    print(f"Sentiment: {sentiment}\n")

print("‚úÖ Sentiment analyzer built!")
print("üí° This is used by Amazon, Yelp, and all review platforms!")

## üéâ Congratulations!

**You just learned:**
- ‚úÖ Text preprocessing pipeline (tokenization, stemming, lemmatization)
- ‚úÖ Bag of Words and TF-IDF vectorization
- ‚úÖ Word embeddings (Word2Vec) and semantic similarity
- ‚úÖ Understanding of GloVe embeddings
- ‚úÖ Built a real spam detector using text classification
- ‚úÖ How embeddings power modern AI (RAG, search, chatbots)

### üéØ Key Takeaways:

1. **Text preprocessing is essential**
   - Cleaning and normalization improve model performance
   - Tokenization is the first step for all NLP tasks
   - Modern LLMs use sophisticated tokenizers (BPE, WordPiece)

2. **Embeddings capture semantic meaning**
   - Similar words ‚Üí Similar vectors
   - Enable semantic search and similarity
   - Foundation of modern NLP

3. **TF-IDF highlights important words**
   - Simple yet powerful for classification
   - Used in search engines and document ranking
   - Still relevant in production systems

4. **Text classification has countless applications**
   - Spam detection, sentiment analysis, intent classification
   - Powers email filters, review systems, chatbots
   - Critical for content moderation

---

**üéØ Practice Exercise (Before Day 2):**

Build a news article categorizer:
1. Create articles from 3-4 categories (tech, sports, politics, entertainment)
2. Preprocess and vectorize using TF-IDF
3. Train a multi-class classifier
4. Evaluate and test on new articles

---

**üìö Next Lesson:** Day 2 - Advanced NLP Techniques
- Named Entity Recognition (NER)
- Part-of-Speech (POS) tagging
- Text classification with deep learning
- News article categorization
- Customer review analysis

---

**üí¨ Remember:**

*"Every AI system that processes text - from ChatGPT to Gmail's spam filter to Google Search - starts with the fundamentals you learned today. Text preprocessing, embeddings, and classification are the building blocks of modern NLP. You now understand the core concepts that power billions of AI-driven interactions every day!"* üöÄ

---

**üîó Connections to Modern AI:**
- **RAG**: Embeddings + TF-IDF for document retrieval
- **LLMs**: Advanced tokenization and embeddings
- **Chatbots**: Text classification for intent recognition
- **Search**: TF-IDF and semantic similarity
- **Content Moderation**: Text classification at scale