In [None]:
# 1) (Optional) Mount Google Drive to persist checkpoints
try:
    from google.colab import drive
    drive.mount('/content/drive')
except Exception as e:
    print('Not running in Colab or drive mount failed:', e)

In [None]:
# 2) Clone the repository (or pull if already present)
import os, subprocess, sys
if not os.path.exists('270FT'):
    subprocess.check_call([sys.executable, '-m', 'git', 'clone', 'https://github.com/oleeveeuh/270FT.git'])
else:
    print('Repository already exists; fetching latest changes')
    try:
        subprocess.check_call(['git', '-C', '270FT', 'pull'])
    except Exception as e:
        print('git pull failed:', e)
# Change working directory to repo root
os.chdir('270FT')
print('CWD:', os.getcwd())

In [None]:
# 3) Install required Python packages (may take a few minutes).
# Install core packages first, then attempt bitsandbytes which may need a matching CUDA runtime.
!pip install -q transformers datasets peft evaluate pyyaml huggingface_hub wandb
# Try to install bitsandbytes; if it fails, you can retry with a wheel matching the runtime's CUDA
try:
    get_ipython().system('pip install -q bitsandbytes')
    print('bitsandbytes installed')
except Exception as e:
    print('bitsandbytes install failed (you may still proceed if using a different runtime):', e)
# Show GPU info if available
!nvidia-smi || true

If `bitsandbytes` or other installs fail due to CUDA mismatches, try switching the Colab runtime GPU type (or use a notebook with a supported CUDA version). You can also skip installing `bitsandbytes` if you only want CPU runs, but the full QLoRA workflow requires a CUDA GPU.

In [None]:
# 4) (Optional) Preprocess raw data into processed JSONL files.
# This will create/update data/processed/train.jsonl, validation.jsonl, and test.jsonl
!python preprocess/load_and_prepare.py --raw_dir data/raw --processed_dir data/processed --validation_split 0.15 --test_split 0.15

In [None]:
# 5) (Optional) Inspect processed files sizes and line counts
!ls -lh data/processed || true
!wc -l data/processed/*.jsonl || true

In [None]:
# 6) Provide Hugging Face token (if required).
# Use this cell to securely input a token if you need to access gated models.
from getpass import getpass
token = getpass('Hugging Face token (leave blank if not needed): ')
import os
if token:
    os.environ['HF_TOKEN'] = token
    # Also login via huggingface-cli for convenience
    try:
        get_ipython().system('huggingface-cli login --token "$HF_TOKEN"')
    except Exception as e:
        print('Automatic huggingface-cli login failed; token stored in HF_TOKEN')
    print('HF_TOKEN set in environment')
else:
    print('No token provided; proceeding without HF token')

In [None]:
# 6b) Disable W&B interactive logging to avoid login prompts during automated runs
import os
os.environ['WANDB_MODE'] = 'offline'
print('W&B offline mode set')

Now run the full training script below. The script `training/train_dual_lora.py` will look for `data/processed/train.jsonl`, `validation.jsonl`, and `test.jsonl` in `data/processed` and will use the models listed in `configs/training_config.yaml`.

**Note about test data**: The test set is reserved for human-in-the-loop evaluation after training. During training, only the validation set is used for evaluation metrics. The training will log:
- **Training loss** every 10 steps
- **Validation loss** at the end of each epoch
- Metrics to W&B (if enabled) or console output

In [None]:
# 7) Run the training script (this will start QLoRA fine-tuning).
# Note: training will log to W&B if enabled in the config; we set W&B to offline above to avoid interactive prompts.
# Use unbuffered output so logs stream in Colab
!python -u training/train_dual_lora.py

**Training Monitoring:**
The training script logs losses and metrics in multiple ways:
1. **Console output**: Training loss logged every 10 steps, validation metrics at each epoch
2. **W&B (if enabled)**: Full training/validation curves, model checkpoints
3. **Checkpoints**: Saved every 500 steps to the output directory (keeping last 3)

**Troubleshooting tips:**
- If you see `FileNotFoundError` complaining about missing `test.jsonl`, re-run the preprocessing cell or ensure `data/processed/test.jsonl` exists.
- If you get `bitsandbytes` import errors, try installing a different `bitsandbytes` wheel that matches the Colab CUDA runtime or switch runtime.
- If training runs out of VRAM, reduce `batch_size` in `configs/training_config.yaml` or switch to a larger GPU.
- If you prefer to persist checkpoints to Google Drive, create a folder in Drive and update `configs/training_config.yaml` output paths to point inside `/content/drive/MyDrive/...`.

**Next steps after training:**
- Run inference on test questions using the trained adapter
- Use human-in-the-loop review to evaluate the quality of generated solutions
- The test set at `data/processed/test.jsonl` contains questions without solutions (intentional for unbiased evaluation)

## Understanding Your Results

### Automated Metrics (for questions with reference solutions)
- **Exact Match Rate**: Percentage of solutions that exactly match your reference (strict comparison)
- **BLEU Score**: Token-level similarity score (0-1 scale)
  - 0.0-0.1: Poor match
  - 0.1-0.3: Fair match  
  - 0.3-0.5: Good match
  - 0.5+: Excellent match

### Quality Checks (for all questions)
All generated solutions are automatically checked for:
- ✓/✗ Has algorithm/pseudocode section
- ✓/✗ Has runtime analysis (Big-O notation)
- ✓/✗ Has proof keywords (proof, correctness, invariant, etc.)
- ✓/✗ Has code structure (for/while/if statements)
- Detected complexity notations
- Length validation

### Human Review Workflow
For questions without reference solutions:
1. Download the `human_review_*.csv` file (cell above)
2. Open in Excel or Google Sheets
3. Review each generated solution
4. Fill in the "Rating (1-5)" column:
   - 1 = Incorrect/useless
   - 2 = Major issues
   - 3 = Acceptable but flawed
   - 4 = Good with minor issues
   - 5 = Excellent
5. Add comments explaining your rating
6. Calculate average rating and % of items rated 4-5

For complete documentation, see [EVALUATION_GUIDE.md](../EVALUATION_GUIDE.md) and [EVALUATION_QUICKSTART.md](../EVALUATION_QUICKSTART.md)

In [None]:
# 11) Download results for human review
# This cell creates downloadable files:
# - evaluation_*.json: Full automated metrics and quality checks
# - human_review_*.csv: Template for manual review of items without solutions

from google.colab import files
from pathlib import Path
import os

results_dir = Path('results')
if results_dir.exists():
    # Download JSON results
    json_files = list(results_dir.glob('evaluation_*.json'))
    if json_files:
        latest_json = max(json_files, key=os.path.getmtime)
        print(f"Downloading: {latest_json.name}")
        files.download(str(latest_json))
    
    # Download CSV for human review
    csv_files = list(results_dir.glob('human_review_*.csv'))
    if csv_files:
        latest_csv = max(csv_files, key=os.path.getmtime)
        print(f"Downloading: {latest_csv.name}")
        files.download(str(latest_csv))
        print("\\n✓ Open the CSV in Excel or Google Sheets to complete human review")
        print("  Fill in 'Rating (1-5)' and 'Comments' columns")
    else:
        print("No CSV for human review (all items have reference solutions)")
else:
    print("Results directory not found. Run evaluation first.")

In [None]:
# 10) View evaluation results summary
import json
from pathlib import Path
import os

results_dir = Path('results')
if results_dir.exists():
    # Find the most recent evaluation JSON
    json_files = list(results_dir.glob('evaluation_*.json'))
    if json_files:
        latest_json = max(json_files, key=os.path.getmtime)
        
        with open(latest_json, 'r') as f:
            results = json.load(f)
        
        print(f"{'='*60}")
        print(f"EVALUATION RESULTS: {results['model_name']}")
        print(f"{'='*60}\\n")
        
        print(f"Total Test Items: {results['total_items']}")
        print(f"  Items with reference solutions: {results['items_with_solutions']}")
        print(f"  Items needing human review: {results['items_without_solutions']}\\n")
        
        if results['items_with_solutions'] > 0:
            print(f"AUTOMATED METRICS (on {results['items_with_solutions']} items with solutions):")
            print(f"  Exact Match Rate: {results['automated_metrics']['exact_match_rate']:.4f} ({results['automated_metrics']['exact_matches']}/{results['items_with_solutions']} matched)")
            print(f"  Average BLEU Score: {results['automated_metrics']['avg_bleu_score']:.4f}")
            print(f"    (0.0-0.1: Poor, 0.1-0.3: Fair, 0.3-0.5: Good, 0.5+: Excellent)\\n")
        
        if results['items_without_solutions'] > 0:
            print(f"HUMAN REVIEW NEEDED ({results['items_without_solutions']} items):")
            print(f"  CSV template exported for manual review")
            print(f"  Quality pre-checks completed\\n")
        
        print(f"\\nDetailed results saved to: {latest_json}")
    else:
        print("No evaluation results found. Run evaluation first.")
else:
    print("Results directory not found. Run evaluation first.")

In [None]:
# 9) Run hybrid evaluation (automated metrics + quality checks)
# This will:
# - Generate solutions for all test questions using your fine-tuned model
# - Run automated metrics (BLEU, exact match) for questions with reference solutions
# - Run quality checks (structure, completeness) for all questions
# - Flag questions without solutions for human review
# - Export results to JSON and CSV

!python -u evaluation/evaluate_with_solutions.py

In [None]:
# 8b) Run the reference solutions script
# Only run this after editing REFERENCE_SOLUTIONS in the cell above
!python temp_add_solutions.py

In [None]:
# 8) Add reference solutions for test questions (where you have them)
# Edit the REFERENCE_SOLUTIONS dictionary below to add your solutions

reference_solutions_script = """
import json
from pathlib import Path

# ADD YOUR REFERENCE SOLUTIONS HERE
# Format: question ID (0-indexed) -> reference solution
# Leave empty string "" for questions without solutions (will need human review)
REFERENCE_SOLUTIONS = {
    0: \"\"\"
    # Your reference solution for test question 0 goes here
    # Include algorithm, runtime analysis, and proof
    # Or leave as empty string if you don't have a solution
    \"\"\",

    # Add more as needed...
    # 1: "",  # No solution - will be flagged for human review
    # 2: "\"\"\"Your reference solution for question 2\"\"\",
}

# Load existing test data
project_root = Path.cwd()
test_input_path = project_root / "data" / "processed" / "test.jsonl"
test_output_path = project_root / "data" / "processed" / "test_with_solutions.jsonl"

if not test_input_path.exists():
    print(f"Error: {test_input_path} not found")
    exit(1)

# Load questions
questions = []
with open(test_input_path, 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if line:
            questions.append(json.loads(line))

print(f"Loaded {len(questions)} test questions")

# Add solutions where available
items_with_solutions = 0
items_without_solutions = 0

output_items = []
for idx, item in enumerate(questions):
    solution = REFERENCE_SOLUTIONS.get(idx, "").strip()

    if solution:
        item["solution"] = solution
        items_with_solutions += 1
    else:
        # Don't add solution field - will trigger human review
        items_without_solutions += 1

    output_items.append(item)

# Save updated test data
with open(test_output_path, 'w', encoding='utf-8') as f:
    for item in output_items:
        f.write(json.dumps(item, ensure_ascii=False) + '\\n')

print(f"\\nSaved to: {test_output_path}")
print(f"  {items_with_solutions} items with reference solutions")
print(f"  {items_without_solutions} items without solutions (will need human review)")
"""

# Write the script temporarily
with open('temp_add_solutions.py', 'w') as f:
    f.write(reference_solutions_script)

print("✓ Script created. Edit REFERENCE_SOLUTIONS in the cell above, then run:")
print("  !python temp_add_solutions.py")
print("\\nOr skip this cell if you don't have reference solutions (all items will need human review)")

## Hybrid Evaluation: Automated Metrics + Human Review

After training, evaluate your model using a hybrid approach:
- **Automated metrics** (BLEU, exact match) for questions where you have reference solutions
- **Quality checks** for all generated solutions
- **Human review** for questions without reference solutions

This section will:
1. Add your reference solutions to the test data
2. Run evaluation (automated + quality checks)
3. Generate results in JSON and CSV formats for review