# Data Visualization Critic - Phase 2: Training on Combined Dataset

**COMS 4995 Final Project**

---

## Training Configuration

- **Dataset:** Combined V1 + V2 (738 examples)
- **Model:** Llama-3-8B-Instruct
- **Method:** LoRA fine-tuning (4-bit quantization)
- **Output:** `lora_model_combined/`

**Estimated Time:** 3-4 hours

In [None]:
# Install packages
!pip install -q transformers datasets peft accelerate bitsandbytes trl

import json
import os
import torch
import pandas as pd
from datetime import datetime
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
from google.colab import userdata, drive
import gc

print("✅ Packages installed")
print(f"   PyTorch: {torch.__version__}")
print(f"   CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

# CRITICAL: Mount Drive FIRST
print("\n📁 Mounting Google Drive...")
drive.mount('/content/drive')

# Paths
PROJECT_FOLDER = '/content/drive/MyDrive/DataVizCritic'
DATA_PATH = f'{PROJECT_FOLDER}/training_data_combined.jsonl'
OUTPUT_DIR = f'{PROJECT_FOLDER}/lora_model_combined'

print(f"✅ Drive mounted")
print(f"   Data: {DATA_PATH}")
print(f"   Output: {OUTPUT_DIR}")

# Verify data exists
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"❌ Combined dataset not found at {DATA_PATH}")
else:
    with open(DATA_PATH, 'r') as f:
        data_count = len(f.readlines())
    print(f"✅ Found {data_count} training examples")

✅ Packages installed
   PyTorch: 2.9.0+cu126
   CUDA available: True
   GPU: Tesla T4

📁 Mounting Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Drive mounted
   Data: /content/drive/MyDrive/DataVizCritic/training_data_combined.jsonl
   Output: /content/drive/MyDrive/DataVizCritic/lora_model_combined
✅ Found 738 training examples


In [None]:
print("📊 Loading and formatting training data...")

# Load data
with open(DATA_PATH, 'r') as f:
    raw_data = [json.loads(line) for line in f]

print(f"   Loaded {len(raw_data)} examples")

# Format for training
def format_example(example):
    """Convert to instruction-following format."""

    flawed_code = example.get('flawed_code', '')
    critique = example.get('critique', {})
    detailed_explanation = critique.get('detailed_explanation', '') if isinstance(critique, dict) else ''
    corrected_code = example.get('corrected_code', '')

    # User message
    user_msg = f"""Review this Python code for statistical and visualization errors:
`````python
{flawed_code}
`````

Identify any issues and provide:
1. Summary of main errors
2. Why they're problematic
3. How to fix them"""

    # Assistant message
    assistant_msg = f"""{detailed_explanation}

Corrected code:
`````python
{corrected_code}
````"""

    return {
        'user': user_msg,
        'assistant': assistant_msg
    }

# Format all examples
formatted_data = []
for ex in raw_data:
    try:
        formatted = format_example(ex)
        if len(formatted['user']) > 100 and len(formatted['assistant']) > 100:
            formatted_data.append(formatted)
    except:
        continue

print(f"   Formatted {len(formatted_data)} examples")
print(f"   Dropped {len(raw_data) - len(formatted_data)} invalid examples")

# Create HuggingFace Dataset
dataset = Dataset.from_list(formatted_data)

print(f"\n✅ Dataset ready")
print(f"   Total examples: {len(dataset)}")
print(f"   Train split: {len(dataset)} (100%)")

📊 Loading and formatting training data...
   Loaded 738 examples
   Formatted 738 examples
   Dropped 0 invalid examples

✅ Dataset ready
   Total examples: 738
   Train split: 738 (100%)


In [None]:
print("📥 Loading Llama-3-8B-Instruct...")

# Clear any existing models
gc.collect()
torch.cuda.empty_cache()

# Get HF token
try:
    hf_token = userdata.get('HF_TOKEN')
    print("✅ HF token loaded")
except:
    hf_token = None
    print("⚠️ No HF token - may fail for gated models")

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float32,
    bnb_4bit_use_double_quant=True,
)

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    token=hf_token,
    trust_remote_code=True
)

print(f"\n✅ Model loaded")
print(f"   Memory: {base_model.get_memory_footprint() / 1e9:.2f} GB")
print(f"   Device: {base_model.device}")

📥 Loading Llama-3-8B-Instruct...
✅ HF token loaded


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]


✅ Model loaded
   Memory: 5.59 GB
   Device: cuda:0


In [None]:
print("🔧 Configuring LoRA...")

# Prepare model for k-bit training
base_model = prepare_model_for_kbit_training(base_model)

# LoRA config
lora_config = LoraConfig(
    r=16,                                    # Rank
    lora_alpha=32,                           # Alpha (scaling)
    target_modules=[                         # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,                       # Dropout
    bias="none",                             # No bias training
    task_type="CAUSAL_LM"                    # Task type
)

# Apply LoRA
model = get_peft_model(base_model, lora_config)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print(f"\n✅ LoRA applied")
print(f"   Trainable params: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")
print(f"   Total params: {total_params:,}")
print(f"   Memory savings: ~{(1 - trainable_params/total_params)*100:.1f}%")

🔧 Configuring LoRA...

✅ LoRA applied
   Trainable params: 41,943,040 (0.92%)
   Total params: 4,582,543,360
   Memory savings: ~99.1%


In [None]:
print("⚙️ Configuring training...")

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,

    # Training schedule
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,          # Effective batch size: 4

    # Optimizer
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    optim="paged_adamw_8bit",
    max_grad_norm=0.3,

    # Precision (important for 4-bit!)
    bf16=False,                              # Must be False for 4-bit
    fp16=False,                              # Must be False for 4-bit
    gradient_checkpointing=True,             # Memory savings

    # Logging & saving
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=2,                      # Keep only 2 checkpoints

    # Other
    report_to="none",                        # No wandb/tensorboard
    remove_unused_columns=False,
)

print(f"✅ Training config ready")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Total steps: ~{len(dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs}")

⚙️ Configuring training...
✅ Training config ready
   Epochs: 3
   Effective batch size: 4
   Learning rate: 0.0002
   Total steps: ~552


In [None]:
print("👨‍🏫 Creating trainer...")

def formatting_func(example):
    """Format examples for SFTTrainer."""
    messages = [
        {"role": "user", "content": example['user']},
        {"role": "assistant", "content": example['assistant']}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return text

# Create trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    formatting_func=formatting_func,
)

print(f"✅ Trainer ready")
print(f"   Dataset size: {len(dataset)}")
print(f"   Training will take ~3-4 hours")

👨‍🏫 Creating trainer...


Applying formatting function to train dataset:   0%|          | 0/738 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/738 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/738 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/738 [00:00<?, ? examples/s]

✅ Trainer ready
   Dataset size: 738
   Training will take ~3-4 hours


In [None]:
print("="*80)
print("🚀 STARTING TRAINING ON COMBINED DATASET (738 examples)")
print("="*80)
print(f"   Output: {OUTPUT_DIR}")
print(f"   Estimated time: 3-4 hours")
print(f"   Keep this tab open!")
print("="*80)

# Record start time
import time
start_time = time.time()

# Train!
trainer.train()

# Record end time
end_time = time.time()
duration = end_time - start_time
hours = int(duration // 3600)
minutes = int((duration % 3600) // 60)
seconds = int(duration % 60)

print(f"\n{'='*80}")
print(f"✅ TRAINING COMPLETE!")
print(f"{'='*80}")
print(f"   Duration: {hours}h {minutes}m {seconds}s")
print(f"   Model saved to: {OUTPUT_DIR}")
print(f"{'='*80}")

# Save final model
final_output = f"{OUTPUT_DIR}/final_model"
trainer.model.save_pretrained(final_output)
tokenizer.save_pretrained(final_output)

print(f"\n💾 Final model saved to: {final_output}")

In [None]:
print("📊 Training Results Summary")
print("="*80)

# Get training logs
logs = trainer.state.log_history

# Extract losses
train_losses = [log['loss'] for log in logs if 'loss' in log]

if train_losses:
    print(f"\nTraining Loss Progression:")
    print(f"   Initial loss: {train_losses[0]:.4f}")
    print(f"   Final loss: {train_losses[-1]:.4f}")
    print(f"   Reduction: {(1 - train_losses[-1]/train_losses[0])*100:.1f}%")

    # Show key milestones
    print(f"\nLoss at key steps:")
    for i, loss in enumerate(train_losses):
        step = (i + 1) * 10  # Logging every 10 steps
        if step % 50 == 0 or i == len(train_losses) - 1:
            print(f"   Step {step:3d}: {loss:.4f}")

# Verify model weights changed
print(f"\n✅ Training completed successfully!")
print(f"   Combined dataset: V1 (300) + V2 (438) = 738 examples")
print(f"   Model location: {final_output}")

print(f"\n{'='*80}")
print("Next steps:")
print("  1. Test inference (Cell 9)")
print("  2. Compare with V1 model performance")
print("  3. Use best model for demo")
print("="*80)

📊 Training Results Summary

✅ Training completed successfully!
   Combined dataset: V1 (300) + V2 (438) = 738 examples


NameError: name 'final_output' is not defined

In [None]:
print("🧪 Testing Combined Model Inference")
print("="*80)

# CRITICAL: Disable gradient checkpointing for inference
model.config.use_cache = True
if hasattr(model, 'gradient_checkpointing_disable'):
    model.gradient_checkpointing_disable()

# Put model in eval mode
model.eval()

# Test prompt
test_code = """import matplotlib.pyplot as plt

quarters = ['Q1', 'Q2', 'Q3', 'Q4']
profits = [98, 99, 100, 101]

plt.bar(quarters, profits)
plt.ylim(97, 102)
plt.title('MASSIVE Profit Growth!')
plt.show()"""

prompt = f"""Review this Python code for statistical and visualization errors:
```python
{test_code}
```

Identify any issues and provide:
1. Summary of main errors
2. Why they're problematic
3. How to fix them"""

# Format with chat template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

print("Generating response (with gradient checkpointing disabled)...")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=400,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        use_cache=True,  # Enable KV cache
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

print("\n" + "="*80)
print("MODEL RESPONSE:")
print("="*80)
print(response)
print("="*80)

if "truncat" in response.lower() or "axis" in response.lower() or "ylim" in response.lower():
    print("\n✅ SUCCESS! Model works correctly!")
else:
    print("\n⚠️ Check response quality")

🧪 Testing Combined Model Inference
Generating response (with gradient checkpointing disabled)...

MODEL RESPONSE:
**Summary of main errors:**

1. The code is missing a label for the x-axis.
2. The y-axis limits are not suitable for the data.
3. The title is not descriptive and may be misleading.

**Why they're problematic:**

1. The x-axis label is missing, making it difficult to understand what the bars represent.
2. The y-axis limits are too narrow and do not provide enough context for the data. The maximum value is 101, but the limits are set to 97-102, which may not accurately represent the data.
3. The title "MASSIVE Profit Growth!" is misleading and may not accurately represent the data. The title should be descriptive and provide context about the plot.

**How to fix them:**

1. Add a label for the x-axis using the `plt.xlabel()` function:
```python
plt.xlabel('Quarter')
```
2. Adjust the y-axis limits to accurately represent the data. For example, you can set the limits to the 