# Day 8: Full Review + Transformers & Modern NLP
**The AI Engineer Course 2026 - Sections 28 + Advanced Topics**

**Student:** Natruja

**Date:** Thursday, February 19, 2026

---

## Learning Objectives
1. Review all NLP concepts from Days 1-7
2. Practice exercises covering all topics
3. Understand modern approaches (Transformers)
4. Learn about Large Language Models (LLMs)
5. Explore the future of NLP

## Setup: Install and Import Libraries

In [None]:
import subprocess
import sys

# Install libraries
subprocess.check_call([sys.executable, "-m", "pip", "install", "nltk", "scikit-learn", "spacy", "textblob", "-q"])

# Download data
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('vader_lexicon', quiet=True)

# Download spacy model
subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm", "-q"])

print("✓ All libraries installed!")

In [None]:
# Import all libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
import re
import numpy as np
import pandas as pd

# Load spacy model
nlp = spacy.load('en_core_web_sm')
analyzer = SentimentIntensityAnalyzer()
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

print("✓ All imports successful!")

# PART 1: COMPREHENSIVE NLP REVIEW

## Complete NLP Pipeline Overview

### Day 1-2: Text Preprocessing
- **Tokenization**: Breaking text into words/sentences
- **Stop Words**: Removing common, low-value words
- **Stemming**: Reducing words to root form (fast, approximate)
- **Lemmatization**: Converting to dictionary form (accurate)
- **Regex**: Pattern matching and text cleaning

### Day 3: Information Extraction
- **POS Tagging**: Identifying word roles (NOUN, VERB, etc.)
- **NER**: Extracting named entities (people, places, organizations)
- **Noun Chunks**: Identifying noun phrases

### Day 4: Sentiment Analysis
- **TextBlob**: Simple sentiment scoring
- **VADER**: Social media sentiment analysis
- **Classification**: Positive, negative, neutral

### Day 5: Text Representation
- **Bag of Words**: Word frequency vectors
- **TF-IDF**: Weighted importance scores
- **LDA**: Topic modeling (probabilistic)
- **NMF**: Topic modeling (matrix decomposition)

### Day 6-7: Text Classification
- **Naive Bayes**: Fast, effective for text
- **Logistic Regression**: Linear classification
- **Pipelines**: Streamlined preprocessing + classification
- **Evaluation**: Metrics and confusion matrices

## Example: Complete Preprocessing Pipeline

In [None]:
def complete_preprocessing(text):
    """Complete text preprocessing pipeline."""
    
    # Step 1: Lowercase and basic cleaning
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    
    # Step 2: Tokenization
    tokens = word_tokenize(text)
    
    # Step 3: Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    
    # Step 4: Lemmatization
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    return tokens

# Test
sample = "The quick brown foxes are running through the forest!"
processed = complete_preprocessing(sample)
print(f"Original: {sample}")
print(f"Processed: {processed}")

# PART 2: MODERN NLP & THE FUTURE

## Beyond Traditional NLP: Transformers

### The Evolution of NLP

**Traditional Approaches (Days 1-7):**
- Rule-based and statistical methods
- Manual feature engineering
- Bag of Words, TF-IDF
- Naive Bayes, SVM
- Limited context understanding

**Deep Learning Era (2012+):**
- Neural networks learn features automatically
- RNNs, LSTMs, GRUs
- Word embeddings (Word2Vec, GloVe, FastText)
- Sequence-to-sequence models

**Transformer Era (2017+):**
- Attention mechanism enables parallelization
- BERT, GPT, T5
- Pre-trained models on massive data
- Transfer learning for downstream tasks
- Context-aware representations

## Understanding Transformers

### Key Innovations:

**1. Attention Mechanism**
- Words "attend to" other relevant words
- Weights show which words are important
- Example: In "The bank robber was caught", "bank" attends to "robber" (context matters!)

**2. Self-Attention**
- Each word looks at all other words in sequence
- Learns dependencies regardless of distance
- Parallelizable (unlike RNNs)

**3. Pre-training + Fine-tuning**
- Pre-train on massive text (billions of words)
- Fine-tune on specific task with small data
- Transfer learning in NLP

### Advantages:
- Captures long-range dependencies
- Parallelize training (faster)
- State-of-the-art performance
- Transfer learning works well
- Understands context much better

## Popular Transformer Models

### BERT (Bidirectional Encoder Representations from Transformers)
- Developed by Google (2018)
- Reads text in both directions
- Great for understanding tasks:
  - Text classification
  - Named entity recognition
  - Question answering
  - Semantic similarity

### GPT (Generative Pre-trained Transformer)
- Developed by OpenAI
- Reads left-to-right (autoregressive)
- Great for generation tasks:
  - Text generation
  - Creative writing
  - Code generation
  - Dialogue

### T5 (Text-to-Text Transfer Transformer)
- Developed by Google (2019)
- All tasks as text-to-text
- Can do:
  - Translation
  - Summarization
  - Classification
  - Question answering

### Other Notable Models:
- **RoBERTa**: Improved BERT
- **ALBERT**: Efficient BERT
- **DistilBERT**: Smaller, faster BERT
- **ELECTRA**: Better pre-training method

## Large Language Models (LLMs)

### What are LLMs?
- Massive neural networks trained on billions of words
- Billions to hundreds of billions of parameters
- Predict next word probability
- Generate coherent, contextual text

### Examples:
- **GPT-4** (OpenAI): 1.8 trillion parameters
- **Claude** (Anthropic): Focused on safety and reasoning
- **Gemini** (Google): Multimodal (text + images)
- **LLaMA** (Meta): Open-source foundation model
- **Mistral** (European): Efficient open model

### Capabilities:
- Text generation and completion
- Question answering
- Code generation
- Translation
- Summarization
- Reasoning and problem-solving
- Creative writing
- Multi-turn conversations

## EXAMPLE: Understanding Attention (Conceptual)

In [None]:
# Conceptual attention visualization
sentence = "The bank robber was caught by police"
words = sentence.split()

# Simulate attention weights (how much each word "looks at" other words)
# Real transformers use learned attention, this is just for illustration
print(f"Sentence: {sentence}")
print(f"\nAttention Visualization:")
print("(How much each word attends to 'bank')\n")

# Attention weights for understanding 'bank'
attention_for_bank = {
    'the': 0.1,      # article, low relevance
    'bank': 1.0,     # itself
    'robber': 0.8,   # high relevance (who is robbing the bank?)
    'was': 0.2,      # verb, some relevance
    'caught': 0.3,   # action, moderate relevance
    'by': 0.1,       # preposition
    'police': 0.5    # relevant (who caught them?)
}

for word, weight in attention_for_bank.items():
    bar = '█' * int(weight * 20)
    print(f"  {word:10} {bar} {weight:.1f}")

print("\nKey Insight: 'bank' pays high attention to 'robber'")
print("This helps the model understand the context!")

## Transformer vs Traditional Methods Comparison

In [None]:
comparison = {
    'Aspect': ['Feature Engineering', 'Context Understanding', 'Training Time', 'Inference Speed', 'Data Requirements', 'Transfer Learning', 'Performance on Complex Tasks'],
    'Traditional (Naive Bayes, SVM)': ['Manual', 'Limited', 'Fast', 'Very Fast', 'Small data ok', 'Difficult', 'Lower'],
    'Deep Learning (RNN, LSTM)': ['Automatic', 'Moderate', 'Slow', 'Slow', 'Need more data', 'Possible', 'Good'],
    'Transformers (BERT, GPT)': ['Automatic', 'Excellent', 'Very Slow', 'Fast', 'Massive data (pre-trained)', 'Excellent', 'State-of-the-art']
}
df = pd.DataFrame(comparison)
print("\nTRADITIONAL vs DEEP LEARNING vs TRANSFORMERS")
print("="*80)
print(df.to_string(index=False))

## The Future of NLP

### Current Trends (2024-2025):

**1. Efficiency & Compression**
- Smaller models for edge devices
- Quantization and distillation
- LoRA and parameter-efficient fine-tuning

**2. Multimodal AI**
- Text + Images (GPT-4V, Gemini)
- Text + Audio (Whisper)
- Unified representations

**3. Open-Source Development**
- LLaMA, Mistral becoming accessible
- Democratizing AI development
- Fine-tuning on consumer hardware

**4. Specialized Models**
- Domain-specific (medical, legal, financial)
- Task-specific (code, multilingual, reasoning)
- Mixture of Experts

**5. Better Evaluation**
- Beyond accuracy metrics
- Bias and fairness testing
- Robustness to adversarial examples
- Human alignment

**6. Reasoning & Planning**
- Chain-of-thought prompting
- Few-shot learning
- In-context learning
- Multi-step reasoning

---

# PART 3: COMPREHENSIVE EXERCISES BY DIFFICULTY

**Note:** For exercises using the `transformers` library, you'll need to install it first:
```bash
pip install transformers torch
```

This may take a few minutes on first install. Torch is a large dependency.

## EASY (★) - Exercises 1-5

These exercises review fundamental concepts from Days 1-7.

## ★ EASY: Exercise 1 - Tokenize and Remove Stop Words

**Objective:** Review basic text preprocessing: tokenization and stop word removal.

**Instructions:**
- Tokenize the given text using NLTK
- Remove stop words
- Print the original and processed tokens

In [None]:
# ★ EASY: Exercise 1 - Tokenize and Remove Stop Words

def exercise1_tokenize_remove_stopwords(text):
    """
    TODO: Complete this function
    1. Tokenize the text using word_tokenize()
    2. Get English stop words from stopwords.words('english')
    3. Filter out stop words from the tokens
    4. Return both original and filtered tokens
    """
    # Step 1: Tokenize
    tokens = ___
    
    # Step 2: Get stop words
    stop_words = ___
    
    # Step 3: Remove stop words
    filtered_tokens = ___
    
    return tokens, filtered_tokens

# Test your solution
test_text = "Machine learning and artificial intelligence are transforming the world."
# original, filtered = exercise1_tokenize_remove_stopwords(test_text)
# print(f"Original tokens: {original}")
# print(f"After stop word removal: {filtered}")

## ★ EASY: Exercise 2 - Stem vs Lemmatize a Word

**Objective:** Understand the difference between stemming and lemmatization.

**Instructions:**
- Apply stemming using PorterStemmer
- Apply lemmatization using WordNetLemmatizer
- Compare the results for different words
- Note: Lemmatization is more accurate but slower

In [None]:
# ★ EASY: Exercise 2 - Stem vs Lemmatize a Word

def exercise2_stem_vs_lemmatize(words):
    """
    TODO: Complete this function
    For each word, apply:
    1. stemmer.stem(word) - using the global stemmer object
    2. lemmatizer.lemmatize(word) - using the global lemmatizer object
    Return a dictionary with original, stem, and lemma for each word
    """
    results = {}
    
    for word in words:
        stemmed = ___
        lemmatized = ___
        
        results[word] = {
            'stem': stemmed,
            'lemma': lemmatized
        }
    
    return results

# Test your solution
test_words = ['running', 'runs', 'ran', 'studies', 'studying', 'better']
# results = exercise2_stem_vs_lemmatize(test_words)
# for word, forms in results.items():
#     print(f"{word:12} -> stem: {forms['stem']:12} lemma: {forms['lemma']}")

## ★ EASY: Exercise 3 - Match Concepts to Tools

**Objective:** Review what each NLP tool does.

**Instructions:**
- Create a dictionary mapping NLP tasks/concepts to the tools that perform them
- Include: tokenization, POS tagging, NER, sentiment analysis, vectorization, classification
- For each tool, provide a brief description of what it does

In [None]:
# ★ EASY: Exercise 3 - Match Concepts to Tools

def exercise3_concept_to_tool():
    """
    TODO: Create a dictionary that maps NLP concepts to the tools/libraries that perform them
    Structure: {
        'concept_name': {
            'tool': 'library/function name',
            'description': 'what it does'
        },
        ...
    }
    
    Include these concepts:
    1. 'Tokenization' - Breaking text into words/sentences
    2. 'POS Tagging' - Identifying parts of speech
    3. 'Named Entity Recognition' - Extracting entities
    4. 'Sentiment Analysis' - Analyzing emotions/opinions
    5. 'Text Vectorization' - Converting text to numbers
    6. 'Text Classification' - Categorizing text
    """
    
    concept_map = {
        'Tokenization': {
            'tool': ___,
            'description': ___
        },
        'POS Tagging': {
            'tool': ___,
            'description': ___
        },
        # TODO: Add remaining 4 concepts
    }
    
    return concept_map

# Test your solution
# concept_map = exercise3_concept_to_tool()
# for concept, info in concept_map.items():
#     print(f"\n{concept}:")
#     print(f"  Tool: {info['tool']}")
#     print(f"  Description: {info['description']}")

## ★ EASY: Exercise 4 - Use Transformers Pipeline for Sentiment Analysis

**Objective:** Learn to use the transformers library for sentiment analysis (Modern NLP approach).

**Note:** This requires `pip install transformers torch` (if not already installed)

**Instructions:**
- Import the pipeline function from transformers
- Create a sentiment-analysis pipeline
- Test it on sample sentences
- Compare with traditional VADER sentiment analysis

In [None]:
# ★ EASY: Exercise 4 - Use Transformers Pipeline for Sentiment Analysis

def exercise4_transformers_sentiment(texts):
    """
    TODO: Complete this function using the transformers library
    1. Import pipeline from transformers
    2. Create a 'sentiment-analysis' pipeline
    3. Use the pipeline on each text
    4. Return results with text, label, and score
    
    If you haven't installed transformers yet:
    pip install transformers torch
    """
    try:
        from transformers import pipeline
    except ImportError:
        print("Please install transformers: pip install transformers torch")
        return None
    
    # TODO: Create sentiment analysis pipeline
    sentiment_pipeline = ___
    
    results = []
    for text in texts:
        # TODO: Use pipeline to analyze text
        output = ___
        
        results.append({
            'text': text,
            'label': ___ ,  # Extract label from output
            'score': ___    # Extract score from output
        })
    
    return results

# Test your solution
test_texts = [
    "I absolutely love this! It's amazing!",
    "This is terrible and disappointing.",
    "The product is okay, nothing special."
]
# results = exercise4_transformers_sentiment(test_texts)
# if results:
#     for r in results:
#         print(f"Text: {r['text']}")
#         print(f"  Prediction: {r['label']} (confidence: {r['score']:.2f})\n")

## ★ EASY: Exercise 5 - BERT vs GPT Comparison (Fill-in-the-Blank)

**Objective:** Understand the key differences between BERT and GPT architectures.

**Instructions:**
- Complete the comparison table by filling in the blanks
- This is a conceptual exercise (no code required)
- Consider: bidirectional vs unidirectional, encoder vs decoder, tasks, etc.

In [None]:
# ★ EASY: Exercise 5 - BERT vs GPT (Conceptual Comparison)

def exercise5_bert_vs_gpt():
    """
    TODO: Fill in the blanks for the BERT vs GPT comparison
    
    For each property, provide the correct answer:
    
    BERT:
    - Direction: Bidirectional (reads left AND right)
    - Architecture: ___ (Encoder only / Decoder only / Encoder-Decoder)
    - Pre-training objective: ___ (Masked Language Modeling / Next Sentence Prediction)
    - Best for: ___ (Generation / Understanding)
    - Example tasks: ___ (list 2-3 tasks)
    
    GPT:
    - Direction: ___ (reads left to right only)
    - Architecture: ___ (Encoder only / Decoder only / Encoder-Decoder)
    - Pre-training objective: ___ (Masked Language Modeling / Causal Language Modeling)
    - Best for: ___ (Generation / Understanding)
    - Example tasks: ___ (list 2-3 tasks)
    """
    
    comparison = {
        'BERT': {
            'direction': 'Bidirectional',
            'architecture': ___,
            'pre_training': ___,
            'best_for': ___,
            'example_tasks': ___
        },
        'GPT': {
            'direction': ___,
            'architecture': ___,
            'pre_training': ___,
            'best_for': ___,
            'example_tasks': ___
        }
    }
    
    return comparison

# Test your understanding
# comparison = exercise5_bert_vs_gpt()
# for model, properties in comparison.items():
#     print(f"\n{model}:")
#     for prop, value in properties.items():
#         print(f"  {prop}: {value}")

## MEDIUM (★★) - Exercises 6-10

These exercises combine multiple concepts and require more complex implementation.

## ★★ MEDIUM: Exercise 6 - Build Preprocessing Pipeline from Scratch

**Objective:** Create a reusable text preprocessing function that combines multiple steps.

**Instructions:**
- Combine: lowercase, regex cleaning, tokenization, stop word removal, lemmatization
- Make it flexible with optional parameters
- Test on multiple types of text

In [None]:
# ★★ MEDIUM: Exercise 6 - Build Preprocessing Pipeline from Scratch

def exercise6_preprocessing_pipeline(text, remove_stopwords=True, use_lemmatization=True):
    """
    TODO: Build a complete preprocessing pipeline
    
    Steps:
    1. Convert to lowercase
    2. Remove non-alphanumeric characters (keep spaces) using regex
    3. Tokenize using word_tokenize()
    4. (Optional) Remove stop words if remove_stopwords=True
    5. (Optional) Lemmatize if use_lemmatization=True, else stem
    6. Remove any empty tokens
    
    Return: list of processed tokens
    """
    
    # Step 1: Lowercase
    text = ___
    
    # Step 2: Remove special characters
    text = ___
    
    # Step 3: Tokenize
    tokens = ___
    
    # Step 4: Remove stop words (if enabled)
    if remove_stopwords:
        stop_words = set(___)
        tokens = ___
    
    # Step 5: Lemmatization or stemming
    if use_lemmatization:
        tokens = ___
    else:
        tokens = ___
    
    # Step 6: Remove empty tokens
    tokens = ___
    
    return tokens

# Test your solution
test_text = "Hello, World! Machine Learning is awesome!!! Email: test@example.com"
# result1 = exercise6_preprocessing_pipeline(test_text)
# result2 = exercise6_preprocessing_pipeline(test_text, remove_stopwords=False, use_lemmatization=False)
# print(f"With all processing: {result1}")
# print(f"Without stopword removal and stemming: {result2}")

## ★★ MEDIUM: Exercise 7 - Use spaCy for POS + NER

**Objective:** Extract linguistic and named entity information from text using spaCy.

**Instructions:**
- Process text with spaCy
- Extract POS tags for each token
- Extract named entities with their labels
- Return structured results

In [None]:
# ★★ MEDIUM: Exercise 7 - Use spaCy for POS + NER

def exercise7_spacy_pos_ner(text):
    """
    TODO: Use spaCy to extract POS tags and named entities
    
    Steps:
    1. Process text with nlp(text) - the spaCy model
    2. Extract POS tags: for each token, get token.pos_ and token.text
    3. Extract entities: for each ent in doc.ents, get ent.text and ent.label_
    4. Return dictionary with 'pos_tags' and 'entities'
    """
    
    # Step 1: Process with spaCy
    doc = ___
    
    # Step 2: Extract POS tags
    pos_tags = ___
    
    # Step 3: Extract named entities
    entities = ___
    
    return {
        'pos_tags': pos_tags,
        'entities': entities,
        'text': text
    }

# Test your solution
test_text = "Apple Inc. CEO Tim Cook announced the iPhone 15 in Cupertino, California."
# result = exercise7_spacy_pos_ner(test_text)
# print(f"Text: {result['text']}")
# print(f"\nPOS Tags: {result['pos_tags']}")
# print(f"\nEntities: {result['entities']}")

## ★★ MEDIUM: Exercise 8 - Zero-Shot Classification Pipeline

**Objective:** Use transformers for zero-shot text classification without training data.

**Note:** Requires `pip install transformers torch`

**Instructions:**
- Create a zero-shot classification pipeline
- Classify text into custom candidate labels
- Compare predictions across different label sets

In [None]:
# ★★ MEDIUM: Exercise 8 - Zero-Shot Classification Pipeline

def exercise8_zero_shot_classification(text, candidate_labels):
    """
    TODO: Implement zero-shot classification
    
    Steps:
    1. Import pipeline from transformers
    2. Create a 'zero-shot-classification' pipeline
    3. Use the pipeline with text and candidate_labels
    4. Return the results (sequence, labels, scores)
    
    Zero-shot means: classify without training on these specific labels
    The model generalizes from pre-training knowledge
    """
    try:
        from transformers import pipeline
    except ImportError:
        print("Please install transformers: pip install transformers torch")
        return None
    
    # TODO: Create zero-shot-classification pipeline
    classifier = ___
    
    # TODO: Use pipeline to classify the text
    results = ___
    
    return results

# Test your solution
test_text = "The stock market went up 5% today due to positive earnings reports."
labels = ["finance", "sports", "entertainment", "technology"]

# result = exercise8_zero_shot_classification(test_text, labels)
# if result:
#     print(f"Text: {test_text}")
#     print(f"\nPredictions:")
#     for label, score in zip(result['labels'], result['scores']):
#         print(f"  {label}: {score:.3f}")

## ★★ MEDIUM: Exercise 9 - Compare Traditional vs Transformer Sentiment

**Objective:** Compare VADER (traditional) with transformer-based sentiment analysis.

**Instructions:**
- Analyze the same texts with both methods
- Compare results and discuss differences
- Note when each method works better

In [None]:
# ★★ MEDIUM: Exercise 9 - Compare Traditional vs Transformer Sentiment

def exercise9_compare_sentiment_methods(texts):
    """
    TODO: Compare VADER and transformer sentiment analysis
    
    Steps:
    1. For each text, get VADER sentiment using analyzer.polarity_scores()
    2. For each text, get transformer sentiment (use exercise4 or pipeline)
    3. Create a comparison showing both results
    4. Return formatted results for analysis
    """
    from transformers import pipeline
    
    # Initialize transformer pipeline
    transformer_pipeline = ___
    
    results = []
    
    for text in texts:
        # Get VADER sentiment
        vader_scores = ___
        vader_label = "Positive" if vader_scores['compound'] > 0.1 else ("Negative" if vader_scores['compound'] < -0.1 else "Neutral")
        
        # Get transformer sentiment
        transformer_output = ___
        transformer_label = transformer_output[0][0]['label']
        transformer_score = transformer_output[0][0]['score']
        
        results.append({
            'text': text,
            'vader': {
                'compound': vader_scores['compound'],
                'label': vader_label
            },
            'transformer': {
                'label': transformer_label,
                'score': transformer_score
            }
        })
    
    return results

# Test your solution
test_texts = [
    "This product is absolutely amazing! I love it so much!",
    "Terrible quality. Very disappointed with this purchase.",
    "It's okay, nothing special but gets the job done."
]

# results = exercise9_compare_sentiment_methods(test_texts)
# for r in results:
#     print(f"\nText: {r['text']}")
#     print(f"  VADER: {r['vader']['label']} (compound: {r['vader']['compound']:.2f})")
#     print(f"  Transformer: {r['transformer']['label']} (score: {r['transformer']['score']:.2f})")

## ★★ MEDIUM: Exercise 10 - Create TF-IDF Matrix

**Objective:** Vectorize text using TF-IDF and understand matrix properties.

**Instructions:**
- Create a TF-IDF vectorizer
- Transform document collection
- Print matrix shape and examine values
- Identify important words

In [None]:
# ★★ MEDIUM: Exercise 10 - Create TF-IDF Matrix

def exercise10_tfidf_matrix(documents):
    """
    TODO: Create a TF-IDF matrix from documents
    
    Steps:
    1. Create a TfidfVectorizer with English stop words
    2. Fit and transform the documents
    3. Get feature names (vocabulary)
    4. Return the matrix, feature names, and matrix info
    """
    
    # Step 1: Create vectorizer
    vectorizer = ___
    
    # Step 2: Fit and transform
    tfidf_matrix = ___
    
    # Step 3: Get feature names
    feature_names = ___
    
    return {
        'matrix': tfidf_matrix,
        'feature_names': feature_names,
        'shape': tfidf_matrix.shape,  # (num_documents, num_features)
        'vectorizer': vectorizer
    }

# Test your solution
test_docs = [
    "machine learning is powerful",
    "deep learning uses neural networks",
    "natural language processing is important",
    "machine learning and deep learning are AI technologies"
]

# result = exercise10_tfidf_matrix(test_docs)
# print(f"Matrix shape: {result['shape']}")
# print(f"Number of documents: {result['shape'][0]}")
# print(f"Number of unique words: {result['shape'][1]}")
# print(f"\nVocabulary (first 20 words): {result['feature_names'][:20]}")

## HARD (★★★) - Exercises 11-15

These exercises require combining multiple techniques and creative problem-solving.

## ★★★ HARD: Exercise 11 - Complete NLP Pipeline (Preprocessing to Classification)

**Objective:** Build an end-to-end NLP pipeline from raw text to predictions.

**Instructions:**
- Step 1: Preprocess documents
- Step 2: Vectorize with TF-IDF
- Step 3: Train a classifier
- Step 4: Make predictions and evaluate

In [None]:
# ★★★ HARD: Exercise 11 - Complete NLP Pipeline

def exercise11_complete_nlp_pipeline(documents, labels):
    """
    TODO: Build a complete NLP pipeline
    
    Steps:
    1. Split data using train_test_split (test_size=0.3, random_state=42)
    2. Create a Pipeline combining:
       - TfidfVectorizer (with stop_words='english')
       - LogisticRegression classifier (max_iter=1000)
    3. Train the pipeline on training data
    4. Evaluate on test data (get accuracy)
    5. Get predictions on test set
    6. Return predictions and accuracy
    """
    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    # Step 1: Split data
    X_train, X_test, y_train, y_test = ___
    
    # Step 2: Create pipeline
    pipeline = Pipeline([
        ('tfidf', ___),
        ('classifier', ___)
    ])
    
    # Step 3: Train
    ___
    
    # Step 4 & 5: Evaluate and predict
    accuracy = ___
    predictions = ___
    
    return {
        'pipeline': pipeline,
        'accuracy': accuracy,
        'predictions': predictions,
        'test_data': X_test,
        'test_labels': y_test
    }

# Test your solution
test_docs = [
    "machine learning is great", "neural networks are powerful",
    "python programming is fun", "data science is important",
    "artificial intelligence transforms industry", "deep learning models",
    "web development with javascript", "software engineering practices"
]
test_labels = [0, 0, 1, 0, 0, 0, 1, 1]  # 0=AI/ML, 1=Other

# result = exercise11_complete_nlp_pipeline(test_docs, test_labels)
# print(f"Accuracy: {result['accuracy']:.2f}")
# print(f"Predictions: {result['predictions']}")

## ★★★ HARD: Exercise 12 - Multi-Technique Analysis (Entities + Sentiment + Keywords)

**Objective:** Combine multiple NLP techniques on a single text.

**Instructions:**
- Extract named entities
- Analyze sentiment
- Extract important keywords (TF-IDF)
- Return comprehensive analysis

In [None]:
# ★★★ HARD: Exercise 12 - Multi-Technique Analysis

def exercise12_multi_technique_analysis(text):
    """
    TODO: Analyze text using multiple NLP techniques
    
    Perform:
    1. Named Entity Recognition - extract entities and labels
    2. Sentiment Analysis - use VADER for sentiment
    3. Keyword Extraction - identify important words
    4. Return comprehensive analysis object
    
    For keyword extraction, you can:
    - Use lemmatization + filter by POS (nouns, verbs)
    - Remove stop words
    - Return top keywords
    """
    
    # 1. Named Entity Recognition with spaCy
    doc = ___
    entities = ___
    
    # 2. Sentiment Analysis with VADER
    sentiment_scores = ___
    sentiment_label = ___  # Positive/Negative/Neutral based on compound score
    
    # 3. Keyword Extraction
    # Process tokens: remove stop words, lemmatize, filter by POS
    stop_words = set(___)
    keywords = [
        token.text for token in doc 
        if token.pos_ in ['NOUN', 'VERB', 'ADJ'] and token.text not in stop_words
    ]
    # Remove duplicates and lemmatize
    keywords = list(set(___))
    
    return {
        'text': text,
        'entities': entities,
        'sentiment': {
            'label': sentiment_label,
            'compound_score': sentiment_scores['compound']
        },
        'keywords': keywords
    }

# Test your solution
test_text = "Apple Inc. released the iPhone 15! Customers love it in the USA. The product is amazing!"
# result = exercise12_multi_technique_analysis(test_text)
# print(f"Text: {result['text']}")
# print(f"\nEntities: {result['entities']}")
# print(f"Sentiment: {result['sentiment']}")
# print(f"Keywords: {result['keywords']}")

## ★★★ HARD: Exercise 13 - Transformer Text Classification with Custom Labels

**Objective:** Use transformers for text classification on custom domains.

**Note:** Requires `pip install transformers torch`

**Instructions:**
- Implement zero-shot classification on multiple texts
- Use domain-specific labels
- Build a classifier function that handles multiple texts

In [None]:
# ★★★ HARD: Exercise 13 - Transformer Text Classification with Custom Labels

def exercise13_transformer_custom_classification(texts, candidate_labels):
    """
    TODO: Build a transformer-based classifier with custom labels
    
    Steps:
    1. Import pipeline from transformers
    2. Create zero-shot-classification pipeline
    3. For each text, get predictions across all labels
    4. Find the top prediction
    5. Return comprehensive results with confidence scores
    """
    try:
        from transformers import pipeline
    except ImportError:
        return None
    
    # Step 1-2: Create classifier
    classifier = ___
    
    results = []
    for text in texts:
        # Step 3: Get predictions
        output = ___
        
        # Step 4-5: Format results
        result_dict = {
            'text': text,
            'predictions': {},
            'top_prediction': ___ ,
            'top_score': ___
        }
        
        # Store all label scores
        for label, score in zip(output['labels'], output['scores']):
            result_dict['predictions'][label] = score
        
        results.append(result_dict)
    
    return results

# Test your solution
test_texts = [
    "I just finished my yoga session and feel so energized!",
    "The stock market crashed due to economic concerns.",
    "Our team won the championship game yesterday!"
]
labels = ["fitness", "finance", "sports"]

# results = exercise13_transformer_custom_classification(test_texts, labels)
# if results:
#     for r in results:
#         print(f"\nText: {r['text']}")
#         print(f"Top prediction: {r['top_prediction']} ({r['top_score']:.3f})")
#         print(f"All scores: {r['predictions']}")

## ★★★ HARD: Exercise 14 - Build Mini NLP Toolkit Class

**Objective:** Create a reusable NLP toolkit class with all learned methods.

**Instructions:**
- Create a class with methods for:
  - Text preprocessing
  - Named entity extraction
  - Sentiment analysis
  - Text vectorization
  - Classification
- Make it easy to use with a consistent interface

In [None]:
# ★★★ HARD: Exercise 14 - Build Mini NLP Toolkit Class

class NLPToolkit:
    """
    TODO: Complete this NLP toolkit class
    
    Methods to implement:
    1. preprocess(text) - clean and tokenize text
    2. extract_entities(text) - get named entities
    3. analyze_sentiment(text) - get sentiment
    4. extract_keywords(text, top_n=5) - get important words
    5. vectorize(documents) - create TF-IDF matrix
    6. classify(documents, labels) - train and classify
    """
    
    def __init__(self):
        """Initialize toolkit with necessary components."""
        self.vectorizer = None
        self.classifier = None
    
    def preprocess(self, text):
        """TODO: Implement preprocessing"""
        # Lowercase, clean, tokenize, lemmatize
        text = ___
        text = ___
        tokens = ___
        stop_words = set(___)
        tokens = ___
        tokens = ___
        return tokens
    
    def extract_entities(self, text):
        """TODO: Extract named entities"""
        doc = ___
        return ___
    
    def analyze_sentiment(self, text):
        """TODO: Analyze sentiment"""
        scores = ___
        label = ___ if scores['compound'] > 0.1 else (___)
        return {'label': label, 'score': scores['compound']}
    
    def extract_keywords(self, text, top_n=5):
        """TODO: Extract top keywords"""
        # Use preprocessing + POS filtering
        doc = ___
        keywords = [
            ___.text for token in doc
            if token.pos_ in ['NOUN', 'VERB', 'ADJ']
        ]
        return list(set(keywords))[:top_n]
    
    def vectorize(self, documents):
        """TODO: Vectorize documents with TF-IDF"""
        self.vectorizer = ___
        return self.vectorizer.fit_transform(documents)
    
    def classify(self, documents, labels):
        """TODO: Train classifier"""
        self.classifier = Pipeline([
            ('tfidf', ___),
            ('clf', ___)
        ])
        self.classifier.fit(documents, labels)
        return self.classifier.score(documents, labels)
    
    def predict(self, text):
        """Predict class of new text."""
        if self.classifier is None:
            raise ValueError("Classifier not trained. Call classify() first.")
        return self.classifier.predict([text])[0]

# Test your solution
# toolkit = NLPToolkit()
# 
# # Test preprocessing
# processed = toolkit.preprocess("Hello, World! Machine learning is amazing!")
# print(f"Processed: {processed}")
# 
# # Test entities
# entities = toolkit.extract_entities("Apple Inc. was founded by Steve Jobs in California.")
# print(f"Entities: {entities}")
# 
# # Test sentiment
# sentiment = toolkit.analyze_sentiment("I love this product!")
# print(f"Sentiment: {sentiment}")
# 
# # Test keywords
# keywords = toolkit.extract_keywords("Machine learning and artificial intelligence transform technology")
# print(f"Keywords: {keywords}")

## ★★★ HARD: Exercise 15 - Final Project: Analyze Text Collection

**Objective:** Apply all learned techniques to analyze a collection of texts.

**Instructions:**
- Analyze multiple texts for:
  - Common entities across collection
  - Overall sentiment distribution
  - Important keywords and topics
  - Text classification/clustering
- Generate a comprehensive report

In [None]:
# ★★★ HARD: Exercise 15 - Final Project: Analyze Text Collection

def exercise15_text_collection_analysis(texts, labels=None):
    """
    TODO: Perform comprehensive analysis on a collection of texts
    
    Analysis components:
    1. Entity frequency - count all entities across texts
    2. Sentiment distribution - overall sentiment trend
    3. Keyword frequency - important words across collection
    4. Text classification (if labels provided)
    5. Generate summary report
    
    Return analysis results with insights
    """
    from collections import Counter
    
    analysis_results = {
        'total_texts': len(texts),
        'entity_frequency': Counter(),
        'sentiment_distribution': {'positive': 0, 'negative': 0, 'neutral': 0},
        'keywords': Counter(),
        'classification_accuracy': None
    }
    
    # 1. Extract entities and count frequencies
    for text in texts:
        doc = ___
        for ent in ___:
            analysis_results['entity_frequency'][___] += 1
    
    # 2. Analyze sentiment distribution
    for text in texts:
        scores = ___
        if scores['compound'] > 0.1:
            analysis_results['sentiment_distribution']['positive'] += 1
        elif scores['compound'] < -0.1:
            analysis_results['sentiment_distribution']['negative'] += 1
        else:
            analysis_results['sentiment_distribution']['neutral'] += 1
    
    # 3. Extract keywords
    stop_words = set(___)
    for text in texts:
        tokens = ___)
        keywords = [
            lemmatizer.lemmatize(t) for t in tokens 
            if t.isalpha() and t not in stop_words and len(t) > 3
        ]
        for kw in keywords:
            analysis_results['keywords'][___] += 1
    
    # 4. Classification (if labels provided)
    if labels is not None:
        from sklearn.pipeline import Pipeline
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.linear_model import LogisticRegression
        
        pipeline = Pipeline([
            ('tfidf', ___),
            ('clf', ___)
        ])
        
        pipeline.fit(texts, labels)
        accuracy = pipeline.score(texts, labels)
        analysis_results['classification_accuracy'] = accuracy
    
    # Get top entities and keywords
    analysis_results['top_entities'] = analysis_results['entity_frequency'].most_common(5)
    analysis_results['top_keywords'] = analysis_results['keywords'].most_common(10)
    
    return analysis_results

# Test your solution
test_texts = [
    "Apple Inc. released the iPhone 15 in California. Customers love it!",
    "Microsoft announced Windows 12 in Seattle. The product is amazing!",
    "Google launched Gemini AI in San Francisco. Innovation at its best!",
    "Amazon expanded AI services in Ohio. Growth continues!"
]
test_labels = [0, 0, 1, 1]  # Tech companies vs Services

# results = exercise15_text_collection_analysis(test_texts, test_labels)
# print(f"\nCollection Analysis Report")
# print(f"Total texts analyzed: {results['total_texts']}")
# print(f"\nTop Entities: {results['top_entities']}")
# print(f"Top Keywords: {results['top_keywords']}")
# print(f"Sentiment Distribution: {results['sentiment_distribution']}")
# if results['classification_accuracy'] is not None:
#     print(f"Classification Accuracy: {results['classification_accuracy']:.2f}")

---

## Summary of All 15 Exercises

### Easy (★) - Foundations Review
1. Tokenize and remove stop words
2. Stem vs lemmatize comparison
3. Match NLP concepts to tools
4. Transformers sentiment analysis pipeline
5. BERT vs GPT comparison

### Medium (★★) - Combined Techniques
6. Build preprocessing pipeline from scratch
7. Use spaCy for POS + NER extraction
8. Zero-shot classification pipeline
9. Compare traditional vs transformer sentiment
10. Create TF-IDF matrix and analyze

### Hard (★★★) - Advanced Projects
11. Complete NLP pipeline (preprocess → vectorize → classify)
12. Multi-technique analysis (entities + sentiment + keywords)
13. Transformer text classification with custom labels
14. Build mini NLP toolkit class
15. Final project: comprehensive text collection analysis

---

## Congratulations!

You've completed the comprehensive NLP course covering:
- Days 1-7: Traditional NLP techniques
- Day 8: Modern transformers and LLMs

### Next Steps:
1. Complete all 15 exercises
2. Explore real datasets on Kaggle
3. Build your own NLP project
4. Learn fine-tuning with Hugging Face
5. Contribute to open-source NLP projects

### Key Resources:
- [Hugging Face](https://huggingface.co/)
- [spaCy Documentation](https://spacy.io/)
- [NLTK Documentation](https://www.nltk.org/)
- [Papers with Code](https://paperswithcode.com/)

Good luck on your NLP journey!