# Week 6: NLP & Transformers

## Overview
Welcome to Week 6 of the AI Engineering curriculum. This week focuses on **Natural Language Processing (NLP)** and the **Transformer architecture**, which revolutionized how machines understand and generate human language.

### Learning Objectives
By the end of this week, you will be able to:
- Understand text preprocessing and tokenization strategies
- Work with word and sentence embeddings
- Comprehend the Transformer architecture and attention mechanism
- Build NLP pipelines using modern pre-trained models
- Apply transformers to text classification and analysis tasks
- Evaluate NLP systems properly

### Real-World Outcome
Build a **Document Intelligence Engine** that can classify, extract insights, and analyze text documents automatically.

### Prerequisites
- Python fundamentals (Week 1)
- Machine Learning basics (Week 4)
- Deep Learning fundamentals (Week 5)

---

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
from pathlib import Path
import logging
import re
import string

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Deep Learning
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Transformers
from transformers import (
    AutoTokenizer, AutoModel, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, pipeline
)

# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---

## Part 1: Text Preprocessing

### 1.1 Understanding Text Preprocessing

Text preprocessing converts raw text into a clean, structured format suitable for machine learning.

**Common Steps:**
- **Lowercasing**: Normalize case
- **Tokenization**: Split text into words/tokens
- **Punctuation Removal**: Remove special characters
- **Stopword Removal**: Remove common words (the, is, at)
- **Stemming/Lemmatization**: Reduce words to root form
- **Cleaning**: Remove URLs, emails, numbers, etc.

### TODO 1.1: Implement Text Preprocessor

In [None]:
class TextPreprocessor:
    """
    Comprehensive text preprocessing pipeline.
    """
    
    def __init__(self, 
                 lowercase: bool = True,
                 remove_punctuation: bool = True,
                 remove_stopwords: bool = True,
                 lemmatize: bool = True):
        """
        Initialize preprocessor with options.
        
        Args:
            lowercase: Convert text to lowercase
            remove_punctuation: Remove punctuation marks
            remove_stopwords: Remove common stopwords
            lemmatize: Apply lemmatization (vs stemming)
        """
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        
        # TODO: Initialize stopwords set
        self.stopwords = None  # Replace with: set(stopwords.words('english'))
        
        # TODO: Initialize lemmatizer or stemmer
        if lemmatize:
            self.lemmatizer = None  # Replace with: WordNetLemmatizer()
        else:
            self.stemmer = None  # Replace with: PorterStemmer()
    
    def clean_text(self, text: str) -> str:
        """
        Clean text by removing URLs, emails, numbers, and extra whitespace.
        
        Args:
            text: Input text
        
        Returns:
            Cleaned text
        """
        # TODO: Remove URLs (http://... or https://...)
        # Hint: Use re.sub(r'http\S+', '', text)
        
        # TODO: Remove email addresses
        # Hint: Use re.sub(r'\S+@\S+', '', text)
        
        # TODO: Remove numbers
        # Hint: Use re.sub(r'\d+', '', text)
        
        # TODO: Remove extra whitespace
        # Hint: Use re.sub(r'\s+', ' ', text).strip()
        
        pass
    
    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text into words.
        
        Returns:
            List of tokens
        """
        # TODO: Use NLTK's word_tokenize
        pass
    
    def preprocess(self, text: str) -> List[str]:
        """
        Complete preprocessing pipeline.
        
        Returns:
            List of processed tokens
        """
        # TODO: Clean text
        text = self.clean_text(text)
        
        # TODO: Convert to lowercase if enabled
        if self.lowercase:
            pass
        
        # TODO: Tokenize
        tokens = self.tokenize(text)
        
        # TODO: Remove punctuation if enabled
        if self.remove_punctuation:
            # Filter out tokens that are pure punctuation
            pass
        
        # TODO: Remove stopwords if enabled
        if self.remove_stopwords:
            pass
        
        # TODO: Lemmatize or stem if enabled
        if self.lemmatize:
            # tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
            pass
        else:
            # tokens = [self.stemmer.stem(token) for token in tokens]
            pass
        
        return tokens
    
    def preprocess_documents(self, documents: List[str]) -> List[List[str]]:
        """
        Preprocess multiple documents.
        
        Returns:
            List of token lists
        """
        # TODO: Process each document
        pass

# Test the preprocessor
# TODO: Uncomment and test
# preprocessor = TextPreprocessor()
# test_text = "Check out https://example.com for more info! Email me at test@email.com. This is GREAT!!!"
# tokens = preprocessor.preprocess(test_text)
# print(f"Original: {test_text}")
# print(f"Processed: {tokens}")

---

## Part 2: Text Representation & Embeddings

### 2.1 Understanding Text Representations

**Traditional Methods:**
- **Bag of Words (BoW)**: Count word occurrences
- **TF-IDF**: Weight by term frequency and inverse document frequency

**Modern Methods:**
- **Word Embeddings**: Dense vectors capturing semantic meaning (Word2Vec, GloVe)
- **Contextual Embeddings**: Context-aware representations (BERT, GPT)

### TODO 2.1: Implement TF-IDF Vectorizer

In [None]:
class TextVectorizer:
    """
    Text vectorization using TF-IDF.
    """
    
    def __init__(self, max_features: int = 5000):
        """
        Initialize vectorizer.
        
        Args:
            max_features: Maximum number of features to keep
        """
        # TODO: Initialize TfidfVectorizer
        # Set max_features, ngram_range=(1, 2) for unigrams and bigrams
        self.vectorizer = None
    
    def fit_transform(self, documents: List[str]) -> np.ndarray:
        """
        Fit vectorizer and transform documents.
        
        Returns:
            TF-IDF matrix
        """
        # TODO: Fit and transform using self.vectorizer
        pass
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """
        Transform documents using fitted vectorizer.
        """
        # TODO: Transform using self.vectorizer
        pass
    
    def get_feature_names(self) -> List[str]:
        """
        Get feature names (vocabulary).
        """
        # TODO: Return feature names
        pass
    
    def get_top_features(self, document_vector: np.ndarray, top_n: int = 10) -> List[Tuple[str, float]]:
        """
        Get top N features for a document.
        
        Returns:
            List of (feature, score) tuples
        """
        # TODO: Get indices of top N values
        # Get corresponding feature names and scores
        pass

# Test the vectorizer
# TODO: Test with sample documents

### 2.2 Word Embeddings

Word embeddings map words to dense vectors where semantic similarity is captured by vector proximity.

### TODO 2.2: Work with Pre-trained Embeddings

In [None]:
class EmbeddingHandler:
    """
    Handle word embeddings using pre-trained models.
    """
    
    def __init__(self, model_name: str = 'bert-base-uncased'):
        """
        Initialize with pre-trained model.
        
        Args:
            model_name: Name of Hugging Face model
        """
        # TODO: Load tokenizer and model
        # self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # self.model = AutoModel.from_pretrained(model_name)
        self.tokenizer = None
        self.model = None
    
    def get_sentence_embedding(self, text: str) -> np.ndarray:
        """
        Get embedding for a sentence.
        
        Returns:
            Embedding vector
        """
        # TODO: Tokenize text
        # inputs = self.tokenizer(text, return_tensors='pt', padding=True, truncation=True)
        
        # TODO: Get model output
        # with torch.no_grad():
        #     outputs = self.model(**inputs)
        
        # TODO: Use [CLS] token embedding or mean pooling
        # embedding = outputs.last_hidden_state[:, 0, :].squeeze()
        
        pass
    
    def compute_similarity(self, text1: str, text2: str) -> float:
        """
        Compute cosine similarity between two texts.
        
        Returns:
            Similarity score (0-1)
        """
        # TODO: Get embeddings for both texts
        # TODO: Compute cosine similarity
        # cosine_sim = (emb1 @ emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
        pass

# Test embedding handler
# TODO: Test with sample texts

---

## Part 3: Transformer Architecture

### 3.1 Understanding Transformers

**Key Components:**
- **Self-Attention**: Each token attends to all other tokens
- **Multi-Head Attention**: Multiple attention mechanisms in parallel
- **Position Encoding**: Inject sequence order information
- **Feed-Forward Networks**: Process attended representations
- **Layer Normalization**: Stabilize training

**Why Transformers?**
- Parallel processing (unlike RNNs)
- Long-range dependencies
- Transfer learning capability

### TODO 3.1: Implement Simple Attention Mechanism

In [None]:
class SimpleAttention(nn.Module):
    """
    Simple scaled dot-product attention.
    """
    
    def __init__(self, hidden_size: int):
        super(SimpleAttention, self).__init__()
        self.hidden_size = hidden_size
        
        # TODO: Define query, key, value projections
        self.query = nn.Linear(hidden_size, hidden_size)
        self.key = nn.Linear(hidden_size, hidden_size)
        self.value = nn.Linear(hidden_size, hidden_size)
    
    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Compute attention.
        
        Args:
            x: Input tensor of shape (batch, seq_len, hidden_size)
            mask: Optional attention mask
        
        Returns:
            Attended output
        """
        # TODO: Compute Q, K, V
        # Q = self.query(x)
        # K = self.key(x)
        # V = self.value(x)
        
        # TODO: Compute attention scores: Q @ K^T / sqrt(d_k)
        # scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.hidden_size)
        
        # TODO: Apply mask if provided
        # if mask is not None:
        #     scores = scores.masked_fill(mask == 0, -1e9)
        
        # TODO: Apply softmax
        # attention_weights = torch.softmax(scores, dim=-1)
        
        # TODO: Apply attention to values
        # output = torch.matmul(attention_weights, V)
        
        pass

# Test attention
# TODO: Test with sample input

### 3.2 Using Pre-trained Transformers

Pre-trained transformer models like BERT, RoBERTa, and DistilBERT can be fine-tuned for specific tasks.

### TODO 3.2: Build Text Classifier with Transformers

In [None]:
class TransformerClassifier:
    """
    Text classification using pre-trained transformers.
    """
    
    def __init__(self, model_name: str = 'distilbert-base-uncased', num_labels: int = 2):
        """
        Initialize classifier.
        
        Args:
            model_name: Pre-trained model name
            num_labels: Number of classification labels
        """
        # TODO: Load tokenizer
        self.tokenizer = None  # AutoTokenizer.from_pretrained(model_name)
        
        # TODO: Load model for sequence classification
        self.model = None  # AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    def prepare_dataset(self, texts: List[str], labels: List[int]) -> Dataset:
        """
        Prepare dataset for training.
        
        Returns:
            PyTorch Dataset
        """
        # TODO: Tokenize texts
        # encodings = self.tokenizer(texts, truncation=True, padding=True, max_length=512)
        
        # TODO: Create dataset
        # class TextDataset(Dataset):
        #     def __init__(self, encodings, labels):
        #         self.encodings = encodings
        #         self.labels = labels
        #     
        #     def __len__(self):
        #         return len(self.labels)
        #     
        #     def __getitem__(self, idx):
        #         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        #         item['labels'] = torch.tensor(self.labels[idx])
        #         return item
        
        pass
    
    def train(self, train_dataset: Dataset, val_dataset: Dataset, 
              output_dir: str = './results', num_epochs: int = 3):
        """
        Train the model.
        """
        # TODO: Define training arguments
        # training_args = TrainingArguments(
        #     output_dir=output_dir,
        #     num_train_epochs=num_epochs,
        #     per_device_train_batch_size=16,
        #     per_device_eval_batch_size=16,
        #     warmup_steps=500,
        #     weight_decay=0.01,
        #     logging_dir='./logs',
        #     evaluation_strategy="epoch",
        #     save_strategy="epoch",
        #     load_best_model_at_end=True
        # )
        
        # TODO: Create trainer
        # trainer = Trainer(
        #     model=self.model,
        #     args=training_args,
        #     train_dataset=train_dataset,
        #     eval_dataset=val_dataset
        # )
        
        # TODO: Train
        # trainer.train()
        
        pass
    
    def predict(self, texts: List[str]) -> List[Dict]:
        """
        Predict labels for texts.
        
        Returns:
            List of predictions with labels and scores
        """
        # TODO: Create pipeline for inference
        # classifier = pipeline('text-classification', model=self.model, tokenizer=self.tokenizer)
        
        # TODO: Predict
        # predictions = classifier(texts)
        
        pass

# Test transformer classifier
# TODO: Test with sample data

---

## Part 4: NLP Evaluation

### 4.1 Evaluation Metrics for NLP

**Classification Metrics:**
- Accuracy, Precision, Recall, F1-score
- Per-class metrics
- Confusion matrix

**Text Generation Metrics:**
- BLEU, ROUGE, METEOR
- Perplexity

### TODO 4.1: Implement NLP Evaluator

In [None]:
class NLPEvaluator:
    """
    Evaluation utilities for NLP tasks.
    """
    
    @staticmethod
    def evaluate_classification(y_true: List[int], y_pred: List[int], 
                               class_names: List[str] = None) -> Dict:
        """
        Evaluate classification performance.
        
        Returns:
            Dictionary of metrics
        """
        # TODO: Compute metrics
        metrics = {
            'accuracy': None,  # accuracy_score(y_true, y_pred)
            'report': None  # classification_report(y_true, y_pred, target_names=class_names)
        }
        
        # TODO: Print report
        # print(metrics['report'])
        
        return metrics
    
    @staticmethod
    def plot_confusion_matrix(y_true: List[int], y_pred: List[int], class_names: List[str]):
        """
        Plot confusion matrix.
        """
        # TODO: Compute confusion matrix
        # cm = confusion_matrix(y_true, y_pred)
        
        # TODO: Plot heatmap
        pass
    
    @staticmethod
    def analyze_errors(texts: List[str], y_true: List[int], 
                       y_pred: List[int], class_names: List[str], num_samples: int = 5):
        """
        Analyze and display misclassified samples.
        """
        # TODO: Find misclassified indices
        # misclassified = np.where(np.array(y_true) != np.array(y_pred))[0]
        
        # TODO: Sample and display
        pass

# Test evaluator
# TODO: Test with sample predictions

---

## Part 5: Document Intelligence Engine (Project)

### 5.1 Problem Statement

Build a system that can:
- Classify documents by category
- Extract key information
- Analyze sentiment
- Summarize content

### TODO 5.1: Build Document Intelligence System

In [None]:
class DocumentIntelligenceEngine:
    """
    Complete document intelligence system.
    """
    
    def __init__(self, model_name: str = 'distilbert-base-uncased'):
        """
        Initialize the engine.
        """
        # TODO: Initialize components
        self.preprocessor = TextPreprocessor()
        self.classifier = None  # TransformerClassifier(model_name)
        self.embedding_handler = None  # EmbeddingHandler(model_name)
        
        # TODO: Initialize additional pipelines
        # self.sentiment_analyzer = pipeline('sentiment-analysis')
        # self.summarizer = pipeline('summarization')
    
    def classify_document(self, text: str) -> Dict:
        """
        Classify document category.
        
        Returns:
            Dictionary with label and confidence
        """
        # TODO: Preprocess and classify
        pass
    
    def analyze_sentiment(self, text: str) -> Dict:
        """
        Analyze document sentiment.
        
        Returns:
            Sentiment label and score
        """
        # TODO: Use sentiment analyzer
        pass
    
    def extract_keywords(self, text: str, top_n: int = 10) -> List[Tuple[str, float]]:
        """
        Extract key terms from document.
        
        Returns:
            List of (keyword, importance) tuples
        """
        # TODO: Use TF-IDF or other method
        pass
    
    def summarize(self, text: str, max_length: int = 150) -> str:
        """
        Generate document summary.
        
        Returns:
            Summary text
        """
        # TODO: Use summarization pipeline
        pass
    
    def find_similar_documents(self, query: str, documents: List[str], top_k: int = 5) -> List[int]:
        """
        Find most similar documents to query.
        
        Returns:
            Indices of top-k similar documents
        """
        # TODO: Compute embeddings and similarity
        pass
    
    def analyze_document(self, text: str) -> Dict:
        """
        Complete document analysis.
        
        Returns:
            Dictionary with all analysis results
        """
        results = {
            'classification': self.classify_document(text),
            'sentiment': self.analyze_sentiment(text),
            'keywords': self.extract_keywords(text),
            'summary': self.summarize(text)
        }
        return results

# TODO: Build and test the complete system

---

## Part 6: Advanced NLP Techniques

### 6.1 Named Entity Recognition (NER)

### TODO 6.1: Implement NER System

In [None]:
class NERSystem:
    """
    Named Entity Recognition system.
    """
    
    def __init__(self):
        # TODO: Load NER pipeline
        # self.ner = pipeline('ner', aggregation_strategy='simple')
        self.ner = None
    
    def extract_entities(self, text: str) -> List[Dict]:
        """
        Extract named entities from text.
        
        Returns:
            List of entities with type and score
        """
        # TODO: Use NER pipeline
        pass
    
    def visualize_entities(self, text: str):
        """
        Visualize entities in text.
        """
        # TODO: Extract and highlight entities
        pass

# Test NER system
# TODO: Test with sample text

---

## Summary & Key Takeaways

### What You Learned This Week

1. **Text Preprocessing**
   - Tokenization, cleaning, normalization
   - Stopword removal and lemmatization
   - Building preprocessing pipelines

2. **Text Representations**
   - TF-IDF vectorization
   - Word embeddings
   - Contextual embeddings

3. **Transformer Architecture**
   - Attention mechanisms
   - Pre-trained models (BERT, DistilBERT)
   - Fine-tuning for downstream tasks

4. **NLP Applications**
   - Text classification
   - Sentiment analysis
   - Document summarization
   - Named entity recognition

5. **Document Intelligence System**
   - End-to-end NLP pipeline
   - Multiple analysis capabilities
   - Production-ready design

### Engineering Best Practices

- ✅ Preprocess text consistently
- ✅ Use pre-trained models when possible
- ✅ Fine-tune on domain-specific data
- ✅ Evaluate on held-out test sets
- ✅ Handle out-of-vocabulary words
- ✅ Consider inference latency and cost
- ✅ Monitor model performance over time

### Next Week Preview

**Week 7: Large Language Model Systems**
- LLM architecture and capabilities
- Prompt engineering patterns
- LLM APIs and usage
- Output evaluation
- Building AI writing assistants

---

## Additional Resources

- **Hugging Face Documentation**: https://huggingface.co/docs
- **NLTK Book**: Natural Language Processing with Python
- **Attention Is All You Need**: Original Transformer paper
- **BERT Paper**: Pre-training of Deep Bidirectional Transformers

---

**Remember**: Modern NLP is about leveraging pre-trained models and adapting them to your specific use case. Focus on data quality, evaluation, and production readiness.