<a href="https://colab.research.google.com/github/oviya-raja/ist-402-assignments/blob/main/assignments/W3/exercises/W3__Prompt_Engineering_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<button onclick="Jupyter.notebook.clear_all_output()" style="background-color:#4CAF50;color:white;padding:10px 20px;border:none;border-radius:5px;cursor:pointer;font-size:14px;margin-top:10px;">üßπ Clear All Outputs</button>

<script>
// For Jupyter Notebook (local)
if (typeof Jupyter !== 'undefined') {
    console.log('Jupyter environment detected');
} 
// For Google Colab
else if (typeof google !== 'undefined' && google.colab) {
    console.log('Colab environment detected');
    // Colab uses different API - button will work via Python
}
</script>

# Prompt Engineering Basics

**IST402 - AI Agents & RAG Systems**

---

<div style="background-color:#e8f5e9;padding:12px;border-left:4px solid #4CAF50;border-radius:5px;margin-bottom:20px;">
<strong>üßπ Quick Actions:</strong> Run <strong>Cell 2</strong> below to clear all outputs if needed!
</div>

---

## üìã What You Need

- **HuggingFace Token**: Get from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
- **Google Colab** (recommended) or local Python environment

---

## üöÄ Quick Start

### Step 1: Open in Colab
[Click here to open in Colab](https://colab.research.google.com/github/oviya-raja/ist-402-assignments/blob/main/assignments/W3/exercises/W3__Prompt_Engineering_Basics.ipynb)

**Or manually:**
1. Go to [Google Colab](https://colab.research.google.com/)
2. **File** ‚Üí **Open notebook** ‚Üí **GitHub** tab
3. Enter: `oviya-raja/ist-402-assignments`
4. Navigate to: `assignments/W3/exercises/W3__Prompt_Engineering_Basics.ipynb`

### Step 2: Enable GPU (Recommended)
1. **Runtime** ‚Üí **Change runtime type** ‚Üí Select **GPU** ‚Üí **Save**
2. **Runtime** ‚Üí **Restart runtime**

### Step 3: Set Up Token
**In Colab:**
1. Run **Cell 3** (token setup cell)
2. Use Colab's `userdata.get('HUGGINGFACE_HUB_TOKEN')` or set environment variable

**Locally:**
1. Create `.env` file: `HUGGINGFACE_HUB_TOKEN=your_token_here`
2. Run the token setup cell

---

## ‚ñ∂Ô∏è Getting Started

1. Run **Cell 1**: Check environment
2. Run **Cell 2**: Install packages
3. Run **Cell 3**: Set up token
4. Continue with remaining cells in order

---

## üìñ What You'll Learn

- **Prompt Engineering**: Creating effective system prompts and user messages
- **Pipeline vs Direct Model**: Two ways to interact with AI models
- **Device Optimization**: Automatic CPU/GPU configuration
- **Class Exercises**: Build business-specific AI assistants

---

**Ready? Start with Cell 1! üéâ**


In [None]:
# üßπ Clear All Outputs
# Run this cell to clean up the notebook outputs

import os

# Check environment
try:
    import google.colab
    IN_COLAB = True
    print("üì± Running in Google Colab")
    print("\nüí° To clear all outputs in Colab:")
    print("   Go to: Runtime ‚Üí Restart runtime")
    print("   Or: Runtime ‚Üí Restart and clear output")
except ImportError:
    IN_COLAB = False
    print("üíª Running in Jupyter Notebook")
    try:
        from IPython.display import clear_output, display, HTML
        # Clear current cell output
        clear_output(wait=True)
        print("‚úÖ Output cleared!")
        print("\nüí° To clear ALL outputs:")
        print("   Go to: Kernel ‚Üí Restart & Clear Output")
    except:
        print("üí° To clear outputs:")
        print("   Go to: Kernel ‚Üí Restart & Clear Output")


In [None]:
# Google Colab Setup Verification
# Run this cell FIRST to check if everything is set up correctly

import sys
print("üîç Checking Google Colab environment...")
print(f"   Python version: {sys.version.split()[0]}")

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("   ‚úÖ Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("   ‚ö†Ô∏è  Not running in Google Colab (local environment)")

# Check GPU availability
try:
    import torch
    if torch.cuda.is_available():
        print(f"   ‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
        print(f"   ‚úÖ CUDA Version: {torch.version.cuda}")
    else:
        print("   ‚ö†Ô∏è  GPU NOT detected")
        if IN_COLAB:
            print("   üí° TIP: Go to Runtime ‚Üí Change runtime type ‚Üí Select GPU ‚Üí Save")
            print("   üí° Then: Runtime ‚Üí Restart runtime")
except ImportError:
    print("   ‚ö†Ô∏è  PyTorch not installed yet (will be installed in next cell)")

print("\nüìã Next Steps:")
print("   1. If GPU not detected in Colab: Enable GPU runtime and restart")
print("   2. Run Cell 2: Install packages")
print("   3. Run Cell 3: Set up Hugging Face token")
print("   4. Continue with remaining cells")


In [None]:
# Install required packages - run this cell first
# Note: FAISS package will be installed conditionally based on GPU availability in Cell 3

# Core packages (always needed)
%pip install transformers torch sentence-transformers datasets python-dotenv

# FAISS will be installed conditionally in Cell 3 based on device (CPU/GPU)
%pip install faiss-cpu

In [None]:
# This cell automatically handles both Colab and local environments

from google.colab import userdata
hf_token = userdata.get('HUGGINGFACE_HUB_TOKEN')

print("‚úÖ Hugging Face token loaded successfully!")
print(f"   Token preview: {hf_token[:10]}...{hf_token[-4:] if len(hf_token) > 14 else '****'}")


In [None]:
# Import libraries we need
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import torch
import json
import numpy as np
import faiss
import time

print("All libraries imported successfully!")

---

## üì¶ METHOD 1: Pipeline Approach (The Easy Way)

**Think of it like: Using a vending machine**
- You put in your request (message)
- The machine does everything automatically
- You get your result (response)

**‚úÖ Pros:** Simple, fast to code, less error-prone  
**‚ùå Cons:** Less control, can't customize much  
**üéØ Best for:** Learning, quick tests, simple projects

**üí° Key Point:** Pipeline = Easy but less control (like using a library function)

---

### üìù Prompts We're Using:
- **System Prompt:** "You are Tom and I am Jerry" (sets the AI's role)
- **User Prompt:** "Who are you?" (the question we're asking)

**See the code below to see how simple it is! üëá**


---

## üîß METHOD 2: Direct Model Approach (The Detailed Way)

**Think of it like: Cooking from scratch**
- You prepare ingredients (tokenize text)
- You cook step by step (run model)
- You plate the food (decode response)

**‚úÖ Pros:** Full control, can customize everything  
**‚ùå Cons:** More code, more things that can go wrong  
**üéØ Best for:** Advanced projects, research, custom needs

**üí° Key Point:** Direct = More work but full control (like writing your own function)

---

### üìù Prompt We're Using:
- **User Prompt:** "What's the weather like in Paris?" (no system prompt in this example)

**See the code below to see each step manually! üëá**


In [None]:
# Automatically detect and configure device (CPU or GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

---

## ‚úÖ Summary: What You Learned

Congratulations! You've now seen both approaches to using AI models:

### üì¶ Pipeline Approach (Example 1)
- **What it does:** Everything automatically in one function call
- **When to use:** Quick prototyping, learning, simple projects
- **Key takeaway:** Simple but less control

### üîß Direct Model Approach (Example 2)  
- **What it does:** Manual control over each step (tokenize ‚Üí generate ‚Üí decode)
- **When to use:** Advanced projects, custom needs, research
- **Key takeaway:** More work but full control

### üí° Remember:
- Both approaches use the **same model** - just different ways to interact with it
- Pipeline = Easy but less control (like using a library function)
- Direct = More work but full control (like writing your own function)

**Now you're ready to try the class exercises below!** üéâ


In [None]:
# Specify which Mistral model to use from Hugging Face
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# ‚ö†Ô∏è PERFORMANCE INFO:
# Mistral-7B is a LARGE model (7 billion parameters, ~14GB)
# Settings are automatically optimized based on device (CPU/GPU) detected above
# The code automatically switches between CPU and GPU optimizations

print(f"\n‚è≥ Loading Mistral-7B model...")
if device == "cpu":
    print(f"   ‚è±Ô∏è  Expected load time: 5-15 minutes")
    print(f"   ‚è±Ô∏è  Expected generation: 30-60 seconds per response")

    device_info = "Intel/AMD CPU"
    torch_dtype = torch.float32     # safest for CPUs

    max_new_tokens = 256            # reduce memory usage on CPU

else:  # GPU
    print(f"   ‚è±Ô∏è  Expected load time: 1-2 minutes")
    print(f"   ‚è±Ô∏è  Expected generation: 2-5 seconds per response")

    device_info = torch.cuda.get_device_name(0)
    torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

    max_new_tokens = 512

print(f"   Device: {device} ({device_info})")
print(f"   Torch: {device} ({torch_dtype})")
print(f"   üì¶ Model size: ~14GB (will download on first run)")

# Create a conversation with system prompt and user message
# System prompt defines the AI's role/personality
# User message is what the person is asking
messages = [
    {"role": "system", "content": "You are Tom and I am Jerry"},
    {"role": "user", "content": "Who are you?"},
]


# Set up the text generation pipeline with device-optimized parameters
# Settings automatically adapt based on device (CPU/GPU) detected in Cell 4
chatbot = pipeline(
    "text-generation",                              # Task type: generating text
    model=model_id,                                 # Which model to use
    token=hf_token,                                 # Authentication token for Hugging Face
    dtype=torch_dtype,                              # Automatically set: bfloat16 (GPU) or float32 (CPU)
    device_map="auto",                              # Automatically use GPU if available
    max_new_tokens=max_new_tokens,                  # Automatically set: 512 (GPU) or 256 (CPU)
    do_sample=True,                                 # Use random sampling for more creative responses
    top_k=10,                                       # Consider top 10 most likely next words
    num_return_sequences=1,                         # Generate only 1 response
    eos_token_id=2,                                 # Token ID that signals end of response
)


print("\n‚úÖ Model loaded! Generating response...")
if device == "cpu":
    print("   ‚è±Ô∏è  This may take 30-60 seconds on CPU...")
else:
    print("   ‚è±Ô∏è  This should take 2-5 seconds on GPU...")

# Generate response using the pipeline and print the result
import time
start_time = time.time()
result = chatbot(messages)
generation_time = time.time() - start_time

print(f"\n‚úÖ Response generated in {generation_time:.2f} seconds")
print("\n" + "="*60)
print(result)
print("="*60)

In [None]:
# Generate the response and store the full result
result = chatbot(messages)

# Extract just the assistant's response from the complex output structure
# result[0] gets the first (and only) generated sequence
# ["generated_text"] gets the conversation history with the new response
# [-1] gets the last message in the conversation (the assistant's reply)
# ["content"] gets just the text content without the role information
assistant_reply = result[0]["generated_text"][-1]["content"]

# Print only the clean assistant response (without all the extra structure)
print(assistant_reply)

In [None]:
# Load the tokenizer (converts text to numbers that the model understands)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)

# Load the actual model with device-optimized settings
# torch_dtype is automatically set in Cell 4: bfloat16 (GPU) or float32 (CPU)
model = AutoModelForCausalLM.from_pretrained(
    model_id,                    # Which model to load
    token=hf_token,             # Authentication token
    dtype=torch.bfloat16,       # Use 16-bit precision for faster processing
    device_map="auto"           # Automatically use GPU if available
)

# Create a simple conversation (just user input, no system prompt this time)
conversation = [{"role": "user", "content": "What's the weather like in Paris?"}]

# Convert the conversation into the format the model expects
# This applies the model's chat template and converts to tensors
inputs = tokenizer.apply_chat_template(
    conversation,                # The conversation to format
    add_generation_prompt=True,  # Add prompt to signal the model should respond
    return_dict=True,           # Return as dictionary
    return_tensors="pt",        # Return as PyTorch tensors
).to(model.device)             # Move to same device as model (GPU/CPU)

# Generate the response using the model directly
outputs = model.generate(
    **inputs,                           # Pass all the formatted inputs
    max_new_tokens=1000,               # Maximum length of response
    pad_token_id=tokenizer.eos_token_id # Token to use for padding
)

In [None]:
# Print the raw model output tensor (this shows token IDs/numbers, not readable text yet)
print(outputs)

In [None]:
# Convert the token IDs back to readable text and print the result
# outputs[0] gets the first generated sequence, skip_special_tokens removes formatting tokens
print(tokenizer.decode(outputs[0], skip_special_tokens=True))