# MSR Internship - Baseline Evaluation
## Llama-3.2-1B on lambada_openai Task

This notebook runs the baseline evaluation for the Microsoft Research Internship exercise.

**Prerequisites:**
- ‚úÖ Google Colab with GPU runtime (T4/A100/H100)
- ‚úÖ Hugging Face account with Llama-3.2-1B access
- ‚úÖ HF access token

**Expected Baseline Result:** ~62.24% accuracy

## Step 1: Verify GPU Runtime

‚ö†Ô∏è **IMPORTANT:** Before running any cells, make sure you're using a GPU runtime:
1. Click the dropdown in top-right corner (near RAM/Disk)
2. Select "Change runtime type"
3. Choose **T4 GPU** (or A100/H100 if available)

In [None]:
# Check GPU availability
!nvidia-smi

## Step 2: Clone Project Repository

In [None]:
# Clone the project repository
!git clone https://github.com/pavannn16/msr-intern-project.git
%cd msr-intern-project

## Step 3: Run Automated Setup

This will:
- Clone transformers (v4.57.6) and microxcaling repositories
- Install lm-eval and dependencies
- Set up environment variables

‚è±Ô∏è **Estimated time:** 3-5 minutes

In [None]:
# Run the automated setup script
!bash scripts/setup_colab.sh

## Step 4: Set Hugging Face Token

‚ö†Ô∏è **REPLACE `your_token_here` WITH YOUR ACTUAL HF TOKEN**

Get your token from: https://huggingface.co/settings/tokens

In [None]:
# Set your Hugging Face token
import os
os.environ['HF_TOKEN'] = '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'

# Verify token is set
print("‚úì HF_TOKEN set successfully")

## Step 5: Quick Test (10% of Dataset)

Test the setup with a small subset to verify everything works.

‚è±Ô∏è **Estimated time:** 1-2 minutes

In [None]:
# Quick test with 10% of the dataset
!lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-1B \
  --tasks lambada_openai \
  --device cuda \
  --batch_size 32 \
  --limit 0.1

## Step 6: Full Baseline Evaluation

Run the complete evaluation on 100% of the lambada_openai dataset.

‚è±Ô∏è **Estimated time:** 10-15 minutes  
üéØ **Expected accuracy:** ~62.24%

In [None]:
# Full baseline evaluation
!lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-1B \
  --tasks lambada_openai \
  --device cuda \
  --batch_size 32

## Step 7: Save Results

Save the baseline results for comparison with quantized models.

In [None]:
# Save baseline results to file
import datetime

timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

with open('results/baseline_results.txt', 'w') as f:
    f.write(f"Baseline Evaluation Results\n")
    f.write(f"=" * 50 + "\n")
    f.write(f"Timestamp: {timestamp}\n")
    f.write(f"Model: meta-llama/Llama-3.2-1B\n")
    f.write(f"Task: lambada_openai\n")
    f.write(f"Device: CUDA\n")
    f.write(f"Batch Size: 32\n")
    f.write(f"\n")
    f.write(f"Results:\n")
    f.write(f"Accuracy: [TO BE FILLED FROM ABOVE OUTPUT]\n")
    f.write(f"Runtime: [TO BE FILLED FROM ABOVE OUTPUT]\n")
    f.write(f"\n")
    f.write(f"NOTE: Manually update this file with the actual results from the evaluation output above.\n")

print("‚úì Results template saved to results/baseline_results.txt")
print("\n‚ö†Ô∏è Remember to:")
print("  1. Copy the actual accuracy and runtime from the output above")
print("  2. Update the results file")
print("  3. Commit and push results to GitHub")

## ‚úÖ Baseline Evaluation Complete!

### Next Steps:
1. **Record your results** - Note the accuracy and runtime from the evaluation above
2. **Compare with expected** - Should be around 62.24% accuracy
3. **Save to GitHub** - If running in Colab, download the results file
4. **Move to Exercise 1** - Start implementing MX quantization for linear layers

### Troubleshooting:
- **HF Token Error**: Verify token has Read permission and Llama-3.2-1B access approved
- **CUDA Error**: Check GPU runtime is enabled (T4/A100/H100)
- **Out of Memory**: Reduce batch_size to 16 or 8
- **Import Errors**: Re-run the setup script

### Important Files:
- `results/baseline_results.txt` - Your baseline metrics
- `modified_files/` - Where Exercise 1 modifications will go
- `scripts/` - Helper scripts for setup and evaluation