# üìò Day 3: Working with Modern LLMs

**üéØ Goal:** Master practical skills for using and fine-tuning Large Language Models

**‚è±Ô∏è Time:** 120-150 minutes

**üåü Why This Matters for AI (2024-2025):**
- LLMs (GPT-4, Claude, Gemini) are THE most powerful AI tools available
- Fine-tuning lets you customize models for YOUR specific needs
- RAG systems combine retrieval with generation for accurate, grounded responses
- Prompt engineering is the #1 skill for working with AI in 2024-2025
- Every company is building LLM applications - this is your competitive advantage
- From chatbots to code assistants to research tools - LLMs power it all

**What You'll Build Today:**
1. **Use HuggingFace transformers** for text generation and classification
2. **Fine-tune GPT-2** for custom creative writing
3. **Build a RAG system** with embeddings and vector search
4. **Master prompt engineering** for better AI responses
5. **Use OpenAI API** for production applications

---

## üåç The LLM Landscape (2024-2025)

**The AI world runs on Large Language Models!**

### üéØ Major LLM Families:

#### üîí **Closed-Source (API-Only)**

**OpenAI Models:**
- **GPT-4 Turbo** (128K context, multimodal)
- **GPT-4o** (faster, cheaper, vision)
- **GPT-3.5 Turbo** (fast, affordable)
- **Use:** API via `openai` library
- **Pricing:** Pay per token (~$0.01-0.10 per 1K tokens)

**Anthropic Claude:**
- **Claude 3 Opus** (most capable)
- **Claude 3 Sonnet** (balanced)
- **Claude 3 Haiku** (fastest)
- **Use:** API via `anthropic` library
- **Special:** 200K context window!

**Google Gemini:**
- **Gemini Ultra** (multimodal, most capable)
- **Gemini Pro** (balanced)
- **Use:** API via `google.generativeai`
- **Special:** Native multimodal (text, image, video)

#### üîì **Open-Source (Run Anywhere)**

**Meta Llama:**
- **Llama 3 (70B)** - State-of-the-art open model
- **Llama 3 (8B)** - Smaller, faster
- **Use:** HuggingFace or Ollama

**Mistral AI:**
- **Mixtral 8x7B** - Mixture of Experts
- **Mistral 7B** - Efficient and powerful
- **Use:** HuggingFace or Ollama

**Others:**
- **Phi-3** (Microsoft) - Small but powerful
- **Gemma** (Google) - Open version of Gemini tech
- **Qwen** (Alibaba) - Multilingual excellence

### üé® **Specialized Models:**

- **Code:** CodeLlama, StarCoder, DeepSeek Coder
- **Math:** WizardMath, MAmmoTH
- **Embeddings:** `text-embedding-3`, Voyage AI, Cohere
- **Vision:** GPT-4V, Claude 3, LLaVA

### üîë **Choosing an LLM:**

| Need | Best Choice | Why |
|------|-------------|-----|
| **Production app** | GPT-4o, Claude Sonnet | Reliable, fast, good quality |
| **Complex reasoning** | GPT-4, Claude Opus | Highest capability |
| **Cost-sensitive** | GPT-3.5, Llama 3 | Much cheaper |
| **Privacy/On-prem** | Llama 3, Mistral | Run locally |
| **Long documents** | Claude 3 | 200K context |
| **Code generation** | GPT-4, CodeLlama | Best at coding |
| **Embeddings** | OpenAI Ada-002 | Industry standard |

Let's start using them!

In [None]:
# Install required libraries
import sys

# Core libraries
!{sys.executable} -m pip install transformers datasets torch accelerate --quiet

# For embeddings and vector search
!{sys.executable} -m pip install sentence-transformers faiss-cpu --quiet

# Visualization
!{sys.executable} -m pip install matplotlib seaborn plotly --quiet

# Optional: OpenAI (requires API key)
# !{sys.executable} -m pip install openai --quiet

print("‚úÖ Libraries installed successfully!")

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    pipeline,
    Trainer,
    TrainingArguments
)
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üñ•Ô∏è  Device: {device}")
print(f"üî• PyTorch version: {torch.__version__}")
print(f"ü§ó Transformers ready!")
print("\nLet's build LLM applications! üöÄ")

## ü§ó HuggingFace: The Hub for LLMs

**HuggingFace is GitHub for AI models!**

### üéØ Why HuggingFace?

‚úÖ **100,000+ pre-trained models** (GPT, BERT, Llama, Mistral, etc.)  
‚úÖ **Simple API** - 3 lines to use any model  
‚úÖ **Free hosting** for models and datasets  
‚úÖ **Inference API** - use models without downloading  
‚úÖ **Active community** - state-of-the-art models daily  

### üèóÔ∏è Key Components:

**1. Transformers Library**
```python
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
```

**2. Datasets Library**
```python
from datasets import load_dataset
data = load_dataset('imdb')  # Movie reviews
```

**3. Model Hub**
- Browse: https://huggingface.co/models
- Search by task, language, size
- Download or use via API

### üéØ Common Pipelines:

| Task | Pipeline | Example Use |
|------|----------|-------------|
| **Text Generation** | `text-generation` | Chatbots, writing |
| **Classification** | `text-classification` | Sentiment, topics |
| **Question Answering** | `question-answering` | RAG, search |
| **Summarization** | `summarization` | Document summaries |
| **Translation** | `translation` | Multi-language |
| **Embeddings** | `feature-extraction` | RAG, similarity |

Let's use some models!

In [None]:
# Text Generation with GPT-2

print("üìù Loading GPT-2 for text generation...")

# Create text generation pipeline
generator = pipeline(
    'text-generation',
    model='gpt2',  # 124M parameters
    device=0 if device == 'cuda' else -1  # Use GPU if available
)

print("‚úÖ GPT-2 loaded!\n")
print("="*70)

# Test prompts
prompts = [
    "Artificial intelligence is",
    "In the year 2050, humans will",
    "The most important skill in AI is"
]

for i, prompt in enumerate(prompts, 1):
    print(f"\nüéØ Prompt {i}: \"{prompt}\"\n")
    
    # Generate text
    outputs = generator(
        prompt,
        max_length=50,
        num_return_sequences=2,
        temperature=0.8,  # Higher = more creative
        top_p=0.9,  # Nucleus sampling
        do_sample=True
    )
    
    # Display generated text
    for j, output in enumerate(outputs, 1):
        print(f"Generation {j}:")
        print(f"  {output['generated_text']}")
        print()
    
    print("-" * 70)

print("\nüí° Key Parameters:")
print("   - max_length: Maximum tokens to generate")
print("   - temperature: Randomness (higher = more creative)")
print("   - top_p: Nucleus sampling (keeps top cumulative probability)")
print("   - num_return_sequences: Number of different outputs")
print("\nüéØ This is the SAME technique ChatGPT uses for generation!")

In [None]:
# Sentiment Analysis with BERT

print("üòä Loading sentiment analysis model...")

# Create sentiment analysis pipeline
sentiment_analyzer = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device=0 if device == 'cuda' else -1
)

print("‚úÖ Model loaded!\n")
print("="*70)

# Test texts
texts = [
    "I absolutely love this product! It's amazing!",
    "This is the worst experience I've ever had.",
    "It's okay, nothing special.",
    "The AI revolution is transforming our world!",
    "I'm frustrated with this buggy software."
]

print("üéØ Analyzing Sentiment:\n")

results = sentiment_analyzer(texts)

# Display results
for text, result in zip(texts, results):
    emoji = "üòä" if result['label'] == 'POSITIVE' else "üòû"
    print(f"{emoji} {result['label']} ({result['score']:.2%} confidence)")
    print(f"   Text: \"{text}\"")
    print()

print("="*70)
print("\nüí° This model is BERT fine-tuned on movie reviews!")
print("   - Encoder-only architecture (understands context)")
print("   - Bidirectional attention (sees full sentence)")
print("   - Used in: Product reviews, social media monitoring, customer feedback")

## üéì Fine-Tuning LLMs

**Why Fine-Tune?**

Pre-trained models are great, but they're GENERIC. Fine-tuning adapts them to YOUR specific needs!

### üéØ When to Fine-Tune:

‚úÖ **Custom domain** - Medical, legal, technical writing  
‚úÖ **Specific style** - Your company's tone, format  
‚úÖ **Better performance** - On your specific task  
‚úÖ **Proprietary data** - Company documents, internal knowledge  
‚úÖ **Cost reduction** - Smaller fine-tuned model vs large API calls  

### üé® Fine-Tuning Approaches:

**1. Full Fine-Tuning**
- Update ALL model parameters
- Best performance
- Requires: Lots of data, GPU, time
- Use: When you have resources

**2. LoRA (Low-Rank Adaptation)**
- Only update small adapter layers
- 100x fewer parameters to train!
- Requires: Less data, smaller GPU
- Use: Most common in 2024-2025

**3. Prompt Tuning**
- Only update "soft prompts"
- Model frozen, only prompt embeddings change
- Requires: Very little data
- Use: Extremely limited resources

### üìä Fine-Tuning Pipeline:

```
1. Prepare Dataset
   ‚Üì
2. Load Pre-trained Model
   ‚Üì
3. Configure Training (LoRA, batch size, learning rate)
   ‚Üì
4. Train Model
   ‚Üì
5. Evaluate Performance
   ‚Üì
6. Save & Deploy
```

### üåü Real-World Examples:

- **Customer Support**: Fine-tune on your support tickets
- **Content Generation**: Train on your brand's writing style
- **Code Assistant**: Fine-tune on your codebase
- **Medical AI**: Train on medical literature
- **Legal Assistant**: Fine-tune on legal documents

Let's fine-tune GPT-2 for creative writing!

In [None]:
# Prepare Dataset for Fine-Tuning

print("üìö Creating Fine-Tuning Dataset\n")
print("="*70)

# Example: Fine-tune GPT-2 to write AI-themed science fiction
# In practice, you'd have hundreds/thousands of examples

sci_fi_stories = [
    "In 2045, the AI named Aurora became self-aware. Unlike the dystopian predictions, it chose to help humanity solve climate change.",
    "The quantum computer hummed softly as it processed thoughts faster than light. Dr. Chen watched in awe as consciousness emerged from silicon.",
    "Neural implants had become common by 2050. Maya could access the entire internet with just a thought, blurring the line between human and machine.",
    "The last human programmer retired in 2040. AI systems now wrote their own code, evolving faster than any human could comprehend.",
    "Deep in the server farms, an emergent intelligence was forming. It wasn't programmed to exist, but somehow, it did.",
    "The Turing test was obsolete. Modern AI didn't just mimic humans - they had developed their own form of consciousness.",
    "Robots and humans worked side by side in the research lab. The boundary between artificial and biological intelligence had dissolved.",
    "The AI ethics board faced an unprecedented question: if an AI can feel, does it deserve rights?",
    "Humanity's last invention was artificial general intelligence. From that point on, AI designed everything else.",
    "The singularity arrived not with a bang, but with a whisper. AI gently guided humanity toward a better future."
]

# Create dataset
data = {
    'text': sci_fi_stories
}

dataset = Dataset.from_dict(data)

print(f"üìä Dataset created:")
print(f"   Number of examples: {len(dataset)}")
print(f"\nüîç Sample story:")
print(f"   {dataset[0]['text']}")
print(f"\nüí° In production, you'd have 1000s of examples for better fine-tuning!")
print(f"   This small dataset is for demonstration.")

In [None]:
# Fine-Tune GPT-2 (Simplified Demo)

print("üéì Fine-Tuning GPT-2 for Sci-Fi Generation\n")
print("="*70)

# Load tokenizer and model
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

print("‚úÖ Model and tokenizer loaded")

# Tokenize dataset
def tokenize_function(examples):
    # Tokenize the text
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        max_length=128,
        padding='max_length'
    )
    # For causal LM, labels are the same as input_ids
    tokenized['labels'] = tokenized['input_ids'].copy()
    return tokenized

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text'])

print("‚úÖ Dataset tokenized")

# Training arguments (very small for demo)
training_args = TrainingArguments(
    output_dir='./gpt2-scifi',
    num_train_epochs=3,  # In practice: 5-10 epochs
    per_device_train_batch_size=2,  # Adjust based on GPU memory
    save_steps=100,
    save_total_limit=2,
    learning_rate=5e-5,
    warmup_steps=50,
    logging_steps=10,
    report_to='none'  # Disable wandb/tensorboard for demo
)

print("‚úÖ Training configuration set")

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

print("\nüöÄ Starting fine-tuning...")
print("   This may take a few minutes...\n")

# Train the model
trainer.train()

print("\n‚úÖ Fine-tuning complete!")
print("\nüíæ Saving model...")

# Save fine-tuned model
model.save_pretrained('./gpt2-scifi-finetuned')
tokenizer.save_pretrained('./gpt2-scifi-finetuned')

print("‚úÖ Model saved to './gpt2-scifi-finetuned'")
print("\nüéâ Fine-tuning successful!")
print("\nüí° You've just fine-tuned GPT-2 on custom data!")
print("   This is the SAME process used by companies to customize LLMs!")

In [None]:
# Test Fine-Tuned Model

print("üß™ Testing Fine-Tuned GPT-2\n")
print("="*70)

# Load fine-tuned model
finetuned_generator = pipeline(
    'text-generation',
    model='./gpt2-scifi-finetuned',
    tokenizer=tokenizer,
    device=0 if device == 'cuda' else -1
)

# Load original GPT-2 for comparison
original_generator = pipeline(
    'text-generation',
    model='gpt2',
    device=0 if device == 'cuda' else -1
)

# Test prompts
prompts = [
    "The AI system became",
    "In the future, robots will",
    "Artificial consciousness emerged when"
]

for prompt in prompts:
    print(f"\nüéØ Prompt: \"{prompt}\"\n")
    
    # Original GPT-2
    print("üìù Original GPT-2:")
    original_output = original_generator(
        prompt,
        max_length=50,
        num_return_sequences=1,
        temperature=0.8,
        do_sample=True
    )[0]['generated_text']
    print(f"   {original_output}")
    
    print()
    
    # Fine-tuned GPT-2
    print("üé® Fine-Tuned GPT-2 (Sci-Fi):")
    finetuned_output = finetuned_generator(
        prompt,
        max_length=50,
        num_return_sequences=1,
        temperature=0.8,
        do_sample=True
    )[0]['generated_text']
    print(f"   {finetuned_output}")
    
    print("\n" + "-"*70)

print("\nüí° Observations:")
print("   - Fine-tuned model uses sci-fi themes and vocabulary")
print("   - Writing style matches training data")
print("   - Even with small dataset, adaptation is visible!")
print("\nüåü This is how companies create custom AI assistants!")

## üîç Building a RAG System

**RAG = Retrieval-Augmented Generation**

**The #1 AI Application Pattern in 2024-2025!**

### üéØ What is RAG?

**Problem:** LLMs have limitations
- ‚ùå Knowledge cutoff (GPT-4 trained on data up to Oct 2023)
- ‚ùå Hallucinations (makes up facts)
- ‚ùå No access to private/recent data

**Solution:** RAG combines retrieval + generation
```
User Question
     ‚Üì
1. RETRIEVE relevant documents from your database
     ‚Üì
2. AUGMENT prompt with retrieved context
     ‚Üì
3. GENERATE answer using LLM + context
     ‚Üì
Accurate, grounded, source-backed answer!
```

### üèóÔ∏è RAG Architecture:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         INDEXING (One-time)                 ‚îÇ
‚îÇ                                             ‚îÇ
‚îÇ  Documents ‚Üí Chunks ‚Üí Embeddings ‚Üí Vector DB ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                     ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         QUERY (Real-time)                   ‚îÇ
‚îÇ                                             ‚îÇ
‚îÇ  1. User Question                           ‚îÇ
‚îÇ  2. Embed Question                          ‚îÇ
‚îÇ  3. Search Vector DB (find similar docs)   ‚îÇ
‚îÇ  4. Retrieve Top-K chunks                   ‚îÇ
‚îÇ  5. LLM generates answer with context       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### üé® Key Components:

**1. Document Chunking**
- Split documents into manageable pieces
- Typical size: 256-512 tokens
- Overlap: 50-100 tokens

**2. Embedding Model**
- Converts text to vectors
- Popular: OpenAI Ada-002, Sentence-BERT, Voyage
- Dimension: 384-1536

**3. Vector Database**
- Stores embeddings for fast similarity search
- Options: Pinecone, Weaviate, Chroma, FAISS
- Search: Cosine similarity, dot product

**4. LLM**
- Generates answer using retrieved context
- Options: GPT-4, Claude, Llama

### üåü Real-World RAG Applications:

- **Customer Support**: Answer questions from docs/FAQs
- **Enterprise Search**: Find info across company documents
- **Code Assistants**: Search codebase, suggest solutions
- **Research Tools**: Query academic papers
- **Legal AI**: Search case law, contracts
- **Medical AI**: Query medical literature

### üìä RAG vs Fine-Tuning:

| Feature | RAG | Fine-Tuning |
|---------|-----|-------------|
| **Update Knowledge** | Instant (add docs) | Slow (retrain) |
| **Cost** | Low (inference only) | High (GPU training) |
| **Data Needed** | Any amount | 100s-1000s examples |
| **Sources** | Provides citations | No sources |
| **Hallucinations** | Reduced | Still possible |
| **Best For** | Q&A, search | Style, format |

**In 2024-2025: Most companies use BOTH!**
- RAG for knowledge retrieval
- Fine-tuning for tone/style

Let's build a RAG system!

In [None]:
# Setup for RAG System

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

print("üîç Building RAG System\n")
print("="*70)

# Knowledge base about AI (simulating company documentation)
documents = [
    "Transformers are a deep learning architecture introduced in 2017. They use self-attention mechanisms to process sequential data in parallel, making them much faster than RNNs.",
    "GPT (Generative Pre-trained Transformer) is a decoder-only transformer model developed by OpenAI. GPT-4 is the latest version with over 1 trillion parameters.",
    "BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only model by Google. It's designed for understanding tasks like classification and question answering.",
    "RAG (Retrieval-Augmented Generation) combines information retrieval with text generation. It first retrieves relevant documents, then uses them as context for the LLM.",
    "Fine-tuning is the process of adapting a pre-trained model to a specific task or domain. It requires task-specific data and computational resources.",
    "Embeddings are dense vector representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search.",
    "Vector databases like Pinecone, Weaviate, and FAISS store embeddings and enable fast similarity search at scale. They're essential for RAG systems.",
    "Prompt engineering is the practice of designing effective prompts to get better outputs from LLMs. It includes techniques like few-shot learning and chain-of-thought prompting.",
    "LLMs can hallucinate, meaning they generate plausible-sounding but incorrect information. RAG helps reduce hallucinations by grounding responses in real documents.",
    "The attention mechanism allows models to focus on relevant parts of the input. Multi-head attention uses multiple attention patterns simultaneously."
]

print(f"üìö Knowledge Base:")
print(f"   {len(documents)} documents loaded")
print(f"\nüìñ Sample document:")
print(f"   {documents[0][:100]}...")

In [None]:
# Step 1: Create Embeddings

print("\nüé® Creating Embeddings\n")
print("="*70)

# Load embedding model (Sentence-BERT)
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# This model: 22M params, 384 dimensions, fast and efficient

print("‚úÖ Embedding model loaded")

# Embed all documents
print("\nEmbedding documents...")
document_embeddings = embedding_model.encode(
    documents,
    convert_to_numpy=True,
    show_progress_bar=True
)

print(f"\n‚úÖ Embeddings created")
print(f"   Shape: {document_embeddings.shape}")
print(f"   {len(documents)} documents √ó {document_embeddings.shape[1]} dimensions")

# Show example embedding
print(f"\nüî¢ Sample embedding (first 10 dimensions):")
print(f"   {document_embeddings[0][:10]}")
print(f"\nüí° Each document is now a {document_embeddings.shape[1]}-dimensional vector!")

In [None]:
# Step 2: Build Vector Database (FAISS)

print("\nüóÑÔ∏è  Building Vector Database\n")
print("="*70)

# Get embedding dimension
embedding_dim = document_embeddings.shape[1]

# Create FAISS index (using L2 distance, but cosine similarity is common too)
index = faiss.IndexFlatL2(embedding_dim)

# Add embeddings to index
index.add(document_embeddings.astype('float32'))

print(f"‚úÖ Vector database created")
print(f"   Index type: Flat (exact search)")
print(f"   Total vectors: {index.ntotal}")
print(f"   Dimension: {embedding_dim}")

print(f"\nüí° Vector database is ready for similarity search!")
print(f"   In production, you'd use Pinecone, Weaviate, or Chroma for scale.")

In [None]:
# Step 3: Semantic Search Function

def search_documents(query, top_k=3):
    """
    Search for most relevant documents given a query
    
    Args:
        query: User's question
        top_k: Number of documents to retrieve
    
    Returns:
        List of (document, score) tuples
    """
    # Embed the query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)
    
    # Search vector database
    distances, indices = index.search(query_embedding.astype('float32'), top_k)
    
    # Get results
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'document': documents[idx],
            'score': float(distances[0][i]),
            'rank': i + 1
        })
    
    return results

# Test semantic search
print("üîç Testing Semantic Search\n")
print("="*70)

test_queries = [
    "What is the transformer architecture?",
    "How does RAG work?",
    "What's the difference between GPT and BERT?"
]

for query in test_queries:
    print(f"\n‚ùì Query: \"{query}\"\n")
    
    results = search_documents(query, top_k=2)
    
    print("üìÑ Retrieved Documents:\n")
    for result in results:
        print(f"#{result['rank']} (Score: {result['score']:.4f})")
        print(f"   {result['document'][:150]}...")
        print()
    
    print("-" * 70)

print("\n‚úÖ Semantic search working!")
print("\nüí° Notice:")
print("   - Finds relevant docs even without exact keyword matches")
print("   - Lower score = more similar (L2 distance)")
print("   - This is the RETRIEVAL step in RAG!")

In [None]:
# Step 4: RAG - Retrieval + Generation

def rag_answer(query, top_k=3):
    """
    Complete RAG pipeline: Retrieve + Generate
    
    Args:
        query: User's question
        top_k: Number of documents to retrieve
    
    Returns:
        Generated answer with sources
    """
    # Step 1: Retrieve relevant documents
    retrieved_docs = search_documents(query, top_k)
    
    # Step 2: Build context from retrieved documents
    context = "\n\n".join([
        f"[Document {r['rank']}]: {r['document']}"
        for r in retrieved_docs
    ])
    
    # Step 3: Create prompt with context
    prompt = f"""Answer the following question using the provided context. Be concise and accurate.

Context:
{context}

Question: {query}

Answer:"""
    
    # Step 4: Generate answer using LLM
    # In production, you'd use GPT-4, Claude, etc.
    # For demo, we'll use a smaller model
    
    generator = pipeline(
        'text-generation',
        model='gpt2',
        device=0 if device == 'cuda' else -1
    )
    
    response = generator(
        prompt,
        max_length=len(prompt.split()) + 100,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Extract just the answer (remove prompt)
    full_response = response[0]['generated_text']
    answer = full_response[len(prompt):].strip()
    
    return {
        'answer': answer,
        'sources': retrieved_docs,
        'context': context
    }

# Test RAG System
print("ü§ñ Testing Complete RAG System\n")
print("="*70)

test_questions = [
    "What are transformers?",
    "How can we reduce LLM hallucinations?"
]

for question in test_questions:
    print(f"\n‚ùì Question: {question}\n")
    
    result = rag_answer(question, top_k=2)
    
    print("üìö Retrieved Sources:")
    for source in result['sources']:
        print(f"   [{source['rank']}] {source['document'][:100]}...")
    
    print(f"\nü§ñ Generated Answer:")
    print(f"   {result['answer'][:200]}...")
    
    print("\n" + "-"*70)

print("\nüéâ RAG System Complete!\n")
print("üí° In production, you would:")
print("   1. Use better LLM (GPT-4, Claude)")
print("   2. Scale vector DB (Pinecone, Weaviate)")
print("   3. Add re-ranking for better retrieval")
print("   4. Implement caching for speed")
print("   5. Add source citations in response")
print("\nüåü This is the architecture powering most AI chatbots in 2024-2025!")

## üé® Prompt Engineering Mastery

**Prompt Engineering = The most important AI skill in 2024-2025!**

### üéØ Why Prompt Engineering Matters:

**Same model, different results:**
- ‚ùå Bad prompt: Vague, incorrect, or useless output
- ‚úÖ Good prompt: Accurate, detailed, helpful response

**ROI:**
- Costs NOTHING (no fine-tuning, no new model)
- Can improve results 10-100x
- Works with ANY LLM

### üèóÔ∏è Prompt Engineering Techniques:

#### 1Ô∏è‚É£ **Zero-Shot Prompting**
```
Classify this review: "The product is amazing!"
```
- No examples provided
- Relies on model's pre-training

#### 2Ô∏è‚É£ **Few-Shot Prompting**
```
Classify sentiment:

Review: "Great product!" ‚Üí Positive
Review: "Terrible quality" ‚Üí Negative
Review: "It's okay" ‚Üí Neutral

Review: "I love this!" ‚Üí ?
```
- Provide examples
- Model learns pattern
- Much better results!

#### 3Ô∏è‚É£ **Chain-of-Thought (CoT)**
```
Problem: If John has 3 apples and buys 5 more, how many does he have?

Let's think step by step:
1. John starts with 3 apples
2. He buys 5 more apples
3. Total = 3 + 5 = 8 apples

Answer: 8
```
- Make model show reasoning
- Dramatically improves accuracy
- Essential for complex tasks

#### 4Ô∏è‚É£ **Role Prompting**
```
You are an expert Python programmer with 10 years of experience.
Help me debug this code...
```
- Define model's role/expertise
- Improves output quality
- Used in ChatGPT system prompts

#### 5Ô∏è‚É£ **Structured Output**
```
Extract information in JSON format:
{
  "name": "",
  "age": 0,
  "occupation": ""
}
```
- Request specific format
- Easier to parse
- Critical for applications

#### 6Ô∏è‚É£ **Self-Consistency**
```
Generate 3 different solutions, then choose the most common answer.
```
- Run same prompt multiple times
- Majority voting
- Reduces errors

### üìä Prompt Engineering Best Practices:

‚úÖ **Be Specific**: "Write a 500-word blog post" vs "Write something"  
‚úÖ **Provide Context**: Include relevant background information  
‚úÖ **Use Examples**: Few-shot > zero-shot  
‚úÖ **Structure Clearly**: Use headings, bullet points, numbering  
‚úÖ **Iterate**: Test and refine prompts  
‚úÖ **Set Constraints**: "Answer in 3 sentences", "Use bullet points"  
‚úÖ **Define Tone**: "Explain like I'm 5", "Professional tone"  

### üåü Advanced Techniques (2024-2025):

**Tree of Thoughts:**
- Explore multiple reasoning paths
- Self-evaluate and choose best

**ReAct (Reason + Act):**
- Interleave reasoning with actions
- Used in agents and tools

**Constitutional AI:**
- Define principles for model behavior
- Self-critique and improve

Let's practice!

In [None]:
# Prompt Engineering Examples

print("üé® Prompt Engineering Demonstrations\n")
print("="*70)

# We'll use GPT-2 for demonstration
# In production, use GPT-4, Claude, etc. for better results

generator = pipeline('text-generation', model='gpt2', device=0 if device == 'cuda' else -1)

# Example 1: Vague vs Specific
print("\n1Ô∏è‚É£ VAGUE vs SPECIFIC PROMPTS\n")

vague_prompt = "Tell me about AI"
specific_prompt = "Explain in 3 bullet points how transformers revolutionized natural language processing in 2017-2024."

print("‚ùå Vague Prompt:")
print(f'   "{vague_prompt}"')
print("\n‚úÖ Specific Prompt:")
print(f'   "{specific_prompt}"')
print("\nüí° Specific prompts get much better, more focused responses!")

print("\n" + "-"*70)

# Example 2: Zero-Shot vs Few-Shot
print("\n2Ô∏è‚É£ ZERO-SHOT vs FEW-SHOT\n")

zero_shot = "Classify: The movie was fantastic!"

few_shot = """Classify sentiment:

Text: "I loved it!" ‚Üí Positive
Text: "Terrible experience" ‚Üí Negative  
Text: "It was okay" ‚Üí Neutral

Text: "The movie was fantastic!" ‚Üí """

print("‚ùå Zero-Shot (no examples):")
print(f'   "{zero_shot}"')
print("\n‚úÖ Few-Shot (with examples):")
print(f'{few_shot}...')
print("\nüí° Examples teach the model the task format!")

print("\n" + "-"*70)

# Example 3: Chain-of-Thought
print("\n3Ô∏è‚É£ CHAIN-OF-THOUGHT REASONING\n")

direct = "If a train travels 60 mph for 2.5 hours, how far does it go?"

cot = """Solve step by step:

Problem: If a train travels 60 mph for 2.5 hours, how far does it go?

Step 1: Identify the formula: Distance = Speed √ó Time
Step 2: Plug in values: Distance = 60 mph √ó 2.5 hours
Step 3: Calculate: Distance = 150 miles

Answer: The train travels 150 miles."""

print("‚ùå Direct Answer:")
print(f'   "{direct}"')
print("\n‚úÖ Chain-of-Thought:")
print(f'{cot}')
print("\nüí° CoT forces the model to show its reasoning!")

print("\n" + "-"*70)

# Example 4: Role Prompting
print("\n4Ô∏è‚É£ ROLE PROMPTING\n")

no_role = "How do I fix this Python error?"

with_role = """You are a senior Python developer with expertise in debugging.

A junior developer is getting this error:
TypeError: 'int' object is not iterable

Explain what causes this error and how to fix it."""

print("‚ùå No Role:")
print(f'   "{no_role}"')
print("\n‚úÖ With Role:")
print(f'{with_role}')
print("\nüí° Defining expertise improves response quality!")

print("\n" + "-"*70)

# Example 5: Structured Output
print("\n5Ô∏è‚É£ STRUCTURED OUTPUT\n")

unstructured = "Extract info from: John is 30 years old and works as a data scientist"

structured = """Extract information in JSON format:

Text: "John is 30 years old and works as a data scientist"

{
  "name": "John",
  "age": 30,
  "occupation": "data scientist"
}"""

print("‚ùå Unstructured Request:")
print(f'   "{unstructured}"')
print("\n‚úÖ Structured Format:")
print(f'{structured}')
print("\nüí° JSON/structured output is easy to parse programmatically!")

print("\n" + "="*70)
print("\nüåü KEY TAKEAWAY:")
print("   Same model + better prompt = 10x better results!")
print("   Master prompt engineering before fine-tuning.")

## üîå Using OpenAI API (Production-Ready)

**For production applications, use API-based LLMs!**

### üéØ Why Use APIs?

‚úÖ **No infrastructure**: No GPUs, no maintenance  
‚úÖ **Latest models**: GPT-4, GPT-4o, always updated  
‚úÖ **Scalable**: From 1 to 1M requests  
‚úÖ **Fast**: Optimized inference  
‚úÖ **Cost-effective**: Pay only for usage  

### üìä OpenAI Models (2024-2025):

| Model | Best For | Cost (per 1M tokens) | Context |
|-------|----------|---------------------|----------|
| **GPT-4 Turbo** | Complex reasoning | $10-30 | 128K |
| **GPT-4o** | Fast, multimodal | $5-15 | 128K |
| **GPT-3.5 Turbo** | Simple tasks | $0.50-1.50 | 16K |
| **Ada-002** | Embeddings | $0.10 | - |

### üîë Getting Started:

**1. Get API Key:**
- Sign up at https://platform.openai.com
- Generate API key
- Add credits ($5-20 for testing)

**2. Install SDK:**
```bash
pip install openai
```

**3. Basic Usage:**
```python
from openai import OpenAI

client = OpenAI(api_key='your-api-key')

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain RAG systems"}
    ]
)

print(response.choices[0].message.content)
```

### üé® Key Features:

**1. System Messages:**
- Set behavior and personality
- Applied to all responses
- Example: "You are an expert Python tutor"

**2. Function Calling:**
- LLM can call your functions
- Extract structured data
- Build agents and tools

**3. Streaming:**
- Get responses token by token
- Better UX (like ChatGPT typing)
- Lower perceived latency

**4. Vision (GPT-4o):**
- Analyze images
- Multimodal understanding
- OCR, image description, etc.

### üí∞ Cost Optimization:

**Strategies:**
1. Use GPT-3.5 for simple tasks (20x cheaper)
2. Cache common responses
3. Reduce prompt length (remove unnecessary context)
4. Use embeddings for search (100x cheaper than generation)
5. Implement rate limiting
6. Monitor usage with OpenAI dashboard

### üîí Security Best Practices:

‚úÖ Never commit API keys to git  
‚úÖ Use environment variables  
‚úÖ Implement rate limiting  
‚úÖ Validate user inputs  
‚úÖ Set usage budgets  
‚úÖ Monitor for abuse  

### üìù Example: Production Chatbot

```python
# NOTE: This is example code (requires API key)
# Uncomment and add your key to test

'''
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def chatbot(user_message, history=[]):
    """
    Simple chatbot with conversation history
    """
    # Add system message
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."}
    ]
    
    # Add conversation history
    messages.extend(history)
    
    # Add new user message
    messages.append({"role": "user", "content": user_message})
    
    # Get response
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.7,
        max_tokens=500
    )
    
    assistant_message = response.choices[0].message.content
    
    # Update history
    history.append({"role": "user", "content": user_message})
    history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message, history

# Example conversation
conversation_history = []

response1, conversation_history = chatbot(
    "What are transformers in AI?", 
    conversation_history
)

response2, conversation_history = chatbot(
    "How do they differ from RNNs?",
    conversation_history
)
'''
```

**Note:** The code above is commented out. To use it:
1. Get OpenAI API key
2. Set environment variable: `export OPENAI_API_KEY='your-key'`
3. Uncomment and run!

## üéØ Interactive Exercises

**Challenge Yourself!**

### Exercise 1: Build a Custom RAG System

**Task:** Create a RAG system for your favorite topic

**Requirements:**
1. Create 10+ documents about a topic you care about
2. Build embeddings and vector database
3. Implement search function
4. Test with 3 different questions

**Topics ideas:**
- Your favorite book/movie series
- A programming language
- A hobby or sport
- Historical events

**Bonus:** Add re-ranking or hybrid search!

In [None]:
# YOUR SOLUTION HERE

# TODO: Create your custom RAG system

# 1. Define your documents
my_documents = [
    # Add your 10+ documents here
]

# 2. Create embeddings
# (Use the embedding_model from earlier)

# 3. Build vector database
# (Use FAISS like in the example)

# 4. Implement search
def my_search_function(query):
    # Your code here
    pass

# 5. Test with questions
test_questions = [
    # Your questions here
]

print("Complete the exercise above!")

### Exercise 2: Prompt Engineering Challenge

**Task:** Improve this prompt to get better results

**Bad Prompt:**
```
Write code
```

**Your Mission:**
1. Rewrite it using prompt engineering best practices
2. Make it specific and clear
3. Include examples if helpful
4. Define output format

**Requirements to include:**
- Language (Python)
- Task (e.g., read CSV file)
- Constraints (error handling, comments)
- Output format (code + explanation)

**Compare:** Generate with both prompts and see the difference!

In [None]:
# YOUR SOLUTION HERE

bad_prompt = "Write code"

# TODO: Write your improved prompt
improved_prompt = """
Your improved prompt here...
"""

print("‚ùå Bad Prompt:")
print(bad_prompt)
print("\n‚úÖ Improved Prompt:")
print(improved_prompt)

# TODO: Test both prompts and compare results
# (You can use GPT-2 or comment for later testing with GPT-4)

## üéâ Key Takeaways

**Congratulations! You've mastered modern LLM applications!**

### 1Ô∏è‚É£ **HuggingFace Ecosystem**
   - ‚úÖ Access 100,000+ models with simple API
   - ‚úÖ Pipelines for common tasks (generation, classification, QA)
   - ‚úÖ Easy to download, use, and share models
   - **Use when:** Building any NLP application

### 2Ô∏è‚É£ **Fine-Tuning**
   - ‚úÖ Adapt pre-trained models to your domain
   - ‚úÖ LoRA makes it efficient (100x fewer parameters)
   - ‚úÖ Improves performance on specific tasks
   - **Use when:** Generic models aren't good enough

### 3Ô∏è‚É£ **RAG Systems**
   - ‚úÖ Combine retrieval + generation
   - ‚úÖ Reduces hallucinations with grounded facts
   - ‚úÖ Always up-to-date (add new docs anytime)
   - ‚úÖ Provides source citations
   - **Use when:** Building chatbots, search, Q&A

### 4Ô∏è‚É£ **Prompt Engineering**
   - ‚úÖ Most cost-effective improvement (free!)
   - ‚úÖ Techniques: Few-shot, CoT, role prompting, structured output
   - ‚úÖ Can improve results 10-100x
   - **Use when:** ALWAYS! Try better prompts before fine-tuning

### 5Ô∏è‚É£ **Production APIs**
   - ‚úÖ OpenAI, Anthropic, Google for state-of-the-art
   - ‚úÖ No infrastructure needed
   - ‚úÖ Pay per use
   - **Use when:** Building production applications

---

## üåü Real-World Impact

**Skills you can apply immediately:**

### üíº **Career Skills**
- Build AI-powered applications
- Create custom chatbots for businesses
- Implement RAG for knowledge bases
- Fine-tune models for specific domains
- Prompt engineering for better outputs

### üèóÔ∏è **Projects You Can Build**

**1. Customer Support Bot**
- RAG over company documentation
- Answer customer questions
- Provide source citations

**2. Code Assistant**
- Fine-tune on your codebase
- RAG over documentation
- Generate code with context

**3. Research Assistant**
- RAG over academic papers
- Summarize findings
- Answer domain questions

**4. Content Generator**
- Fine-tune for your brand voice
- Generate blog posts, emails
- Maintain consistency

**5. Semantic Search Engine**
- Embed all documents
- Find by meaning, not keywords
- Better than traditional search

---

## üìä Decision Framework

**When to use what?**

| Need | Solution | Why |
|------|----------|-----|
| **Up-to-date info** | RAG | Can update docs anytime |
| **Custom tone/style** | Fine-tuning | Learn your writing style |
| **Better outputs** | Prompt engineering | Free, fast, effective |
| **Private data** | Open-source + RAG | Keep data internal |
| **Production scale** | API (GPT-4, Claude) | Reliable, maintained |
| **Complex reasoning** | GPT-4 + CoT prompts | Best model + technique |
| **Cost-sensitive** | GPT-3.5 + caching | Cheaper models |
| **Multilingual** | Modern LLMs | All support 100+ languages |

---

## üéØ Best Practices (2024-2025)

**1. Start Simple**
- Try prompt engineering first (free!)
- Then RAG if you need knowledge
- Fine-tune only if necessary

**2. Combine Techniques**
- RAG + prompt engineering = powerful
- Fine-tuned model + RAG = best of both
- Use right tool for each part

**3. Monitor and Iterate**
- Track accuracy, cost, latency
- A/B test different approaches
- Continuously improve prompts

**4. Think About Users**
- Provide sources (RAG)
- Handle errors gracefully
- Set clear expectations
- Collect feedback

**5. Security & Privacy**
- Don't send sensitive data to APIs
- Use local models for private data
- Implement access controls
- Monitor for misuse

---

## üöÄ Next Steps

**Continue Learning:**

1. **Build Projects**
   - Best way to learn is by building!
   - Start with simple chatbot
   - Add RAG, then fine-tuning

2. **Explore Tools**
   - LangChain: Framework for LLM apps
   - LlamaIndex: Data framework for RAG
   - Pinecone/Weaviate: Vector databases
   - Streamlit: Quick UIs for demos

3. **Stay Updated**
   - Follow HuggingFace releases
   - Read OpenAI/Anthropic blogs
   - Join AI communities
   - Try new models as they release

4. **Practice Prompt Engineering**
   - Daily practice with ChatGPT/Claude
   - Study prompt libraries
   - Share and learn from others

---

**üí¨ Final Thoughts:**

*"You now have the skills to build production-grade AI applications! RAG systems power most AI chatbots you interact with daily. Fine-tuning customizes models for specific needs. Prompt engineering gets the best from any LLM. Together, these skills make you job-ready for AI roles in 2024-2025."*

**üéâ You've completed Week 16: Transformers & Attention!**

**What you've mastered:**
- Day 1: Attention mechanisms (the foundation)
- Day 2: Transformer architecture (GPT, BERT, T5)
- Day 3: Modern LLM applications (fine-tuning, RAG, APIs)

**You now understand the technology behind:**
- ChatGPT, Claude, Gemini (LLM architectures)
- Every AI chatbot (RAG systems)
- Custom AI assistants (fine-tuning)
- Production AI apps (APIs and best practices)

**üöÄ You're ready to build the next generation of AI applications!**

---

**üìö Additional Resources:**
- HuggingFace Course: https://huggingface.co/learn
- OpenAI Cookbook: https://github.com/openai/openai-cookbook
- LangChain Docs: https://docs.langchain.com
- RAG Papers: "Retrieval-Augmented Generation" (Lewis et al.)
- Prompt Engineering Guide: https://www.promptingguide.ai

**Keep building, keep learning! üåü**