# Prompt Engineering Activity

---

## üìã Prerequisites

Before starting, make sure you have:
- **HuggingFace Token**: Get from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
- **Google Colab Account** (Free tier works!) - *Optional, can run locally too*

---

## üîó Quick Links

**üîó Open in Colab**: [Click here](https://colab.research.google.com/github/oviya-raja/ist-402-assignments/blob/main/IST402/assignments/W3/reference/W3__Prompt_Engineering%20w_QA%20Applications-2.ipynb)

**üìÇ View on GitHub**: [Click here](https://github.com/oviya-raja/ist-402-assignments/blob/main/IST402/assignments/W3/reference/W3__Prompt_Engineering%20w_QA%20Applications-2.ipynb)

> **‚ö†Ô∏è Note**: Colab link requires the repository to be public on GitHub. If you get a 404 error, see troubleshooting below.

---

## üöÄ Setup Instructions

### Option 1: Google Colab (Recommended for GPU)

#### Step 1: Open Notebook
- **Method A**: Click the "Open in Colab" link above
- **Method B**:
  1. Go to [Google Colab](https://colab.research.google.com/)
  2. Click **File** ‚Üí **Open notebook** ‚Üí **GitHub** tab
  3. Enter: `oviya-raja/ist-402-assignments`
  4. Navigate to: `assignments/W3/exercises/W3__Prompt_Engineering w_QA Applications-2.ipynb`

#### Step 2: Enable GPU (Recommended)
1. Go to **Runtime** ‚Üí **Change runtime type**
2. Select **GPU** ‚Üí **Save**
3. **Runtime** ‚Üí **Restart runtime**

#### Step 3: Set Up Token (See Token Setup section below)

---

## üîê Token Setup

### For Google Colab Users

**Recommended Method: Using .env file**
1. Run **Cell 4** ‚Üí It will automatically create a `.env` file
2. Click the **folder icon (üìÅ)** in the left sidebar
3. Find and click `.env` file
4. Replace `your_token_here` with your actual token
5. Save (Ctrl+S or Cmd+S)
6. Re-run **Cell 4** ‚Üí Token loaded! ‚úÖ

**Quick Method: Direct environment variable**
Run this in a new cell before Cell 4:
```python
import os
os.environ["HUGGINGFACE_HUB_TOKEN"] = "your_actual_token_here"
```

**Get your token**: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

> **üìñ Learn more**: See `ENV_IN_COLAB.md` for detailed explanation of how `.env` files work in Colab

---

## üõ†Ô∏è Troubleshooting

### 404 Error When Opening from GitHub

**Possible causes:**
- Repository doesn't exist yet ‚Üí Use **Option 2** (Upload to Colab)
- Repository is private ‚Üí Make it public or use **Option 2**
- Wrong branch ‚Üí Try changing `main` to `master` in the link

**Solution**: Upload the notebook directly to Colab:
1. Download this notebook
2. Go to [Google Colab](https://colab.research.google.com/)
3. Click **File** ‚Üí **Upload notebook**
4. Select the downloaded file

### GPU Not Detected in Colab

1. Go to **Runtime** ‚Üí **Change runtime type**
2. Select **GPU** ‚Üí **Save**
3. **Runtime** ‚Üí **Restart runtime**
4. Re-run Cell 1 to verify

### Token Not Loading

- Check that `.env` file exists and has correct format: `HUGGINGFACE_HUB_TOKEN=token` (no spaces)
- Make sure you re-ran Cell 4 after creating/editing `.env`
- Verify token is valid at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

---



In [None]:
# Google Colab Setup Verification
# Run this cell FIRST to check if everything is set up correctly

import sys
print("üîç Checking Google Colab environment...")
print(f"   Python version: {sys.version.split()[0]}")

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("   ‚úÖ Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("   ‚ö†Ô∏è  Not running in Google Colab (local environment)")

# Check GPU availability
try:
    import torch
    if torch.cuda.is_available():
        print(f"   ‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
        print(f"   ‚úÖ CUDA Version: {torch.version.cuda}")
    else:
        print("   ‚ö†Ô∏è  GPU NOT detected")
        if IN_COLAB:
            print("   üí° TIP: Go to Runtime ‚Üí Change runtime type ‚Üí Select GPU ‚Üí Save")
            print("   üí° Then: Runtime ‚Üí Restart runtime")
except ImportError:
    print("   ‚ö†Ô∏è  PyTorch not installed yet (will be installed in next cell)")

print("\nüìã Next Steps:")
print("   1. If GPU not detected in Colab: Enable GPU runtime and restart")
print("   2. Run Cell 2: Install packages")
print("   3. Run Cell 3: Set up Hugging Face token")
print("   4. Continue with remaining cells")


In [None]:
# Install required packages - run this cell first
# Note: FAISS package will be installed conditionally based on GPU availability in Cell 3

# Core packages (always needed)
%pip install transformers torch sentence-transformers datasets python-dotenv faiss-cpu


In [None]:
from google.colab import userdata
userdata.get('HUGGINGFACE_HUB_TOKEN')


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import libraries we need
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import torch
import json
import numpy as np
import faiss
import time

print("All libraries imported successfully!")

In [None]:
# Automatically detect and configure device (CPU or GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

In [None]:
# Specify which Mistral model to use from Hugging Face
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# Define device info
if device == "cuda":
    device_info = torch.cuda.get_device_name(0)
else:
    device_info = "CPU"

# Automatically choose dtype
if device == "cuda":
    torch_dtype = torch.bfloat16      # Faster + supported on A100
    max_new_tokens = 512
else:
    torch_dtype = torch.float32
    max_new_tokens = 256

print(f"\n‚è≥ Loading Mistral-7B model...")
print(f"   Device: {device} ({device_info})")
if device == "cpu":
    print("   ‚è±Ô∏è  Expected load time: 5-15 minutes")
    print("   ‚è±Ô∏è  Expected generation: 30-60 seconds per response")
else:
    print("   ‚è±Ô∏è  Expected load time: 1-2 minutes")
    print("   ‚è±Ô∏è  Expected generation: 2-5 seconds per response")
print("   üì¶ Model size: ~14GB (will download on first run)")

# Hugging Face Token
hf_token = userdata.get("HUGGINGFACE_HUB_TOKEN")

# Conversation
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

# Load the model pipeline
chatbot = pipeline(
    "text-generation",
    model=model_id,
    token=hf_token,
    dtype=torch_dtype,
    device_map="auto",
    max_new_tokens=max_new_tokens,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=2,
)

print("\n‚úÖ Model loaded! Generating response...")
if device == "cpu":
    print("   ‚è±Ô∏è  This may take 30-60 seconds on CPU...")
else:
    print("   ‚è±Ô∏è  This should take 2-5 seconds on GPU...")

# Generate response
import time
start_time = time.time()
result = chatbot(messages)
generation_time = time.time() - start_time

print(f"\n‚úÖ Response generated in {generation_time:.2f} seconds")
print("\n" + "="*60)
print(result)
print("="*60)


In [None]:
# Generate the response and store the full result
result = chatbot(messages)

# Extract just the assistant's response from the complex output structure
# result[0] gets the first (and only) generated sequence
# ["generated_text"] gets the conversation history with the new response
# [-1] gets the last message in the conversation (the assistant's reply)
# ["content"] gets just the text content without the role information
assistant_reply = result[0]["generated_text"][-1]["content"]

# Print only the clean assistant response (without all the extra structure)
print(assistant_reply)

In [None]:
# Specify the Mistral model we want to use
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# Load the tokenizer (converts text to numbers that the model understands)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)

# Load the actual model with device-optimized settings
# torch_dtype is automatically set in Cell 4: bfloat16 (GPU) or float32 (CPU)
model = AutoModelForCausalLM.from_pretrained(
    model_id,                    # Which model to load
    token=hf_token,             # Authentication token
    dtype=torch.bfloat16,       # Use 16-bit precision for faster processing
    device_map="auto"           # Automatically use GPU if available
)

# Create a simple conversation (just user input, no system prompt this time)
conversation = [{"role": "user", "content": "What's the weather like in Paris?"}]

# Convert the conversation into the format the model expects
# This applies the model's chat template and converts to tensors
inputs = tokenizer.apply_chat_template(
    conversation,                # The conversation to format
    add_generation_prompt=True,  # Add prompt to signal the model should respond
    return_dict=True,           # Return as dictionary
    return_tensors="pt",        # Return as PyTorch tensors
).to(model.device)             # Move to same device as model (GPU/CPU)

# Generate the response using the model directly
outputs = model.generate(
    **inputs,                           # Pass all the formatted inputs
    max_new_tokens=1000,               # Maximum length of response
    pad_token_id=tokenizer.eos_token_id # Token to use for padding
)

In [None]:
# Print the raw model output tensor (this shows token IDs/numbers, not readable text yet)
print(outputs)

In [None]:
# Convert the token IDs back to readable text and print the result
# outputs[0] gets the first generated sequence, skip_special_tokens removes formatting tokens
print(tokenizer.decode(outputs[0], skip_special_tokens=True))