# Data Visualization Critic - Phase 1: Synthetic Data Generation

**COMS 4995 Final Project**

**Team Members:** Dian Jiang, Charles Weber, John Won, and Amir Yaghoobi

---

## Project Overview

**Data Visualization Critic** is an LLM-powered system that automatically reviews data analysis code (Python/R notebooks) and provides expert-level statistical and visualization critiques.

### Phase 1 Goal
Generate 300 high-quality training examples of:
- Flawed data analysis code
- Expert critiques explaining the issues
- Corrected code with best practices

### Key Features
- ‚úÖ **Completely Free** (uses Colab GPU, no API costs)
- ‚úÖ **15 critical error types** (statistical + visualization)
- ‚úÖ **4 realistic domains** (healthcare, business, education, social science)
- ‚úÖ **Uses Llama-3-8B** (open-source, state-of-the-art)

---

## Setup Instructions

### Before Running:
1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí GPU (T4)
2. **Accept Llama-3 License**:
   - Go to: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
   - Click "Access gated model" and accept terms
   - Get HuggingFace token: https://huggingface.co/settings/tokens
   - Add to Colab secrets (üîë icon): name = `HF_TOKEN`, enable notebook access
3. **Keep tab active**: Colab may disconnect if idle

### Estimated Time & Cost
- **Runtime**: ~4 hours for 300 examples
- **Cost**: $0.00 (uses free Colab GPU)
- **GPU Memory**: ~6-8 GB (fits in free tier)

---

In [None]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes torch datasets

import json
import os
import random
import re
import pandas as pd
import torch
from datetime import datetime
from typing import Dict, List
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from google.colab import userdata, files

# Set random seeds for reproducibility
random.seed(42)
torch.manual_seed(42)

print("‚úÖ Packages installed successfully")
print(f"   PyTorch version: {torch.__version__}")
print(f"   CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ Packages installed successfully
   PyTorch version: 2.9.0+cu126
   CUDA available: True
   GPU: Tesla T4


## Section 2: Error Taxonomy Definition

We define 15 critical error types covering both statistical methodology and visualization best practices.

In [None]:
# Define comprehensive error taxonomy
ALL_ERRORS = {
    # Statistical Errors
    "correlation_causation": {
        "name": "Correlation vs Causation Confusion",
        "description": "Implying causal relationship from correlational data",
        "severity": "critical",
        "principle": "Correlation does not imply causation without experimental design or causal inference methods"
    },
    "simpsons_paradox": {
        "name": "Simpson's Paradox",
        "description": "Aggregate trends that reverse when data is disaggregated",
        "severity": "critical",
        "principle": "Aggregated data can show opposite trends from stratified data"
    },
    "survivorship_bias": {
        "name": "Survivorship Bias",
        "description": "Analyzing only surviving/successful cases, ignoring failures",
        "severity": "critical",
        "principle": "Selection bias from only observing survivors distorts conclusions"
    },
    "confounding_omission": {
        "name": "Omitted Confounding Variables",
        "description": "Failing to control for confounders in observational data",
        "severity": "critical",
        "principle": "Omitted variable bias invalidates causal interpretation"
    },
    "multiple_testing": {
        "name": "Multiple Testing without Correction",
        "description": "Running many statistical tests without adjusting significance levels",
        "severity": "critical",
        "principle": "Family-wise error rate increases with multiple comparisons"
    },
    "p_hacking": {
        "name": "P-hacking / Data Dredging",
        "description": "Selectively reporting significant results or manipulating analysis",
        "severity": "critical",
        "principle": "Selection bias in reporting inflates Type I error rate"
    },
    "regression_to_mean": {
        "name": "Regression to the Mean Misinterpretation",
        "description": "Attributing regression to mean as treatment effect",
        "severity": "warning",
        "principle": "Extreme values naturally regress toward average on retest"
    },
    "base_rate_neglect": {
        "name": "Base Rate Neglect",
        "description": "Ignoring prior probabilities when interpreting results",
        "severity": "warning",
        "principle": "Posterior probability depends on both likelihood and base rate"
    },
    "extrapolation": {
        "name": "Extrapolation Beyond Data Range",
        "description": "Making predictions outside observed data range",
        "severity": "warning",
        "principle": "Model validity is uncertain beyond training data range"
    },
    "assumption_violation": {
        "name": "Statistical Assumption Violation",
        "description": "Using methods when assumptions are violated (normality, independence)",
        "severity": "warning",
        "principle": "Violations of assumptions can invalidate statistical inference"
    },

    # Visualization Errors
    "truncated_axis": {
        "name": "Truncated Y-Axis Manipulation",
        "description": "Starting y-axis at non-zero to exaggerate differences",
        "severity": "critical",
        "principle": "Truncated axes distort visual perception of magnitude"
    },
    "dual_axis_misleading": {
        "name": "Misleading Dual Axes",
        "description": "Using two y-axes with different scales to force correlation",
        "severity": "critical",
        "principle": "Arbitrary axis scaling can create spurious visual relationships"
    },
    "wrong_chart_type": {
        "name": "Inappropriate Chart Type",
        "description": "Using wrong visualization for data type or relationship",
        "severity": "warning",
        "principle": "Chart type should match data structure and analytical goal"
    },
    "overplotting": {
        "name": "Overplotting Without Transparency",
        "description": "Dense scatterplots hiding data density patterns",
        "severity": "warning",
        "principle": "Overlapping points obscure data distribution"
    },
    "missing_uncertainty": {
        "name": "Missing Uncertainty Visualization",
        "description": "Showing point estimates without error bars or confidence intervals",
        "severity": "warning",
        "principle": "Point estimates without uncertainty measures overstate confidence"
    },
}

print(f"‚úÖ Defined {len(ALL_ERRORS)} error types")
print(f"   - Critical errors: {sum(1 for e in ALL_ERRORS.values() if e['severity'] == 'critical')}")
print(f"   - Warning errors: {sum(1 for e in ALL_ERRORS.values() if e['severity'] == 'warning')}")

‚úÖ Defined 15 error types
   - Critical errors: 8


## Section 3: Domain Contexts

Define realistic scenarios across different domains to make generated examples diverse and applicable.

In [None]:
DOMAIN_CONTEXTS = [
    {
        "domain": "healthcare",
        "scenarios": [
            "clinical trial comparing drug efficacy",
            "observational study of patient outcomes",
            "disease prevalence analysis across demographics",
            "treatment effectiveness in hospital system"
        ]
    },
    {
        "domain": "business",
        "scenarios": [
            "customer churn prediction analysis",
            "marketing campaign effectiveness study",
            "sales performance across regions",
            "pricing strategy impact analysis"
        ]
    },
    {
        "domain": "education",
        "scenarios": [
            "teaching method effectiveness comparison",
            "student performance prediction",
            "graduation rate analysis by demographics",
            "online vs in-person learning outcomes"
        ]
    },
    {
        "domain": "social_science",
        "scenarios": [
            "social media usage and mental health",
            "income inequality trends",
            "voting behavior analysis",
            "crime rate factors"
        ]
    }
]

print(f"‚úÖ Defined {len(DOMAIN_CONTEXTS)} domain contexts")
for domain in DOMAIN_CONTEXTS:
    print(f"   - {domain['domain']}: {len(domain['scenarios'])} scenarios")

‚úÖ Defined 4 domain contexts
   - healthcare: 4 scenarios
   - business: 4 scenarios
   - education: 4 scenarios
   - social_science: 4 scenarios


## Section 4: Load Llama-3-8B Model

Load the model in 4-bit quantization to fit in Colab's GPU (15GB VRAM).

**Note**: This cell may take 2-3 minutes on first run.

In [None]:
print("üì• Loading Llama-3-8B-Instruct (4-bit quantized)...")
print("   This may take 2-3 minutes...\n")

# 4-bit quantization configuration (reduces memory usage)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

try:
    # Try to get HF token from secrets
    hf_token = userdata.get('HF_TOKEN')
    print("‚úÖ HuggingFace token loaded from secrets")
except:
    print("‚ö†Ô∏è  No HF_TOKEN found in secrets")
    print("   If model loading fails, add your token:")
    print("   1. Get token: https://huggingface.co/settings/tokens")
    print("   2. Add to Colab secrets (üîë icon): HF_TOKEN\n")
    hf_token = None

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    token=hf_token
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    token=hf_token,
    trust_remote_code=True
)

print("\n‚úÖ Model loaded successfully!")
print(f"   Model memory: {model.get_memory_footprint() / 1e9:.2f} GB")
print(f"   Device: {model.device}")

üì• Loading Llama-3-8B-Instruct (4-bit quantized)...
   This may take 2-3 minutes...

‚úÖ HuggingFace token loaded from secrets


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]


‚úÖ Model loaded successfully!
   Model memory: 5.59 GB
   Device: cuda:0


## Section 5: Prompt Template

Define the prompt template that instructs the LLM to generate high-quality training examples.

In [None]:
def create_generation_prompt(error_type: str, error_info: Dict, domain_context: Dict,
                             language: str = "python", complexity: str = "intermediate") -> str:
    """
    Create a detailed prompt for the LLM to generate a training example.
    """
    scenario = random.choice(domain_context['scenarios'])

    prompt = f"""You are an expert data scientist creating training examples for an automated code review system that detects statistical and visualization errors.

**Task**: Generate a realistic data analysis code snippet that demonstrates the following error:

**Error Type**: {error_info['name']}
**Description**: {error_info['description']}
**Severity**: {error_info['severity']}
**Statistical Principle**: {error_info['principle']}

**Context**:
- Domain: {domain_context['domain']}
- Scenario: {scenario}
- Programming Language: {language}
- Complexity Level: {complexity}

**Requirements**:

1. **Generate Flawed Code** (15-25 lines):
   - Create realistic synthetic data generation code
   - Include exploratory analysis that commits the specified error
   - Make the error subtle but consequential
   - Use realistic variable names from the domain
   - Include comments showing the analyst's incorrect reasoning
   - Add a conclusion comment stating the flawed interpretation

2. **Expert Critique** (2-3 paragraphs):
   - Identify the specific error with line numbers
   - Explain WHY this is problematic
   - Describe potential consequences
   - Use clear, pedagogical language

3. **Corrected Code** (15-25 lines):
   - Fix the error appropriately
   - Add comments explaining the correct approach
   - Include proper statistical checks
   - Show best practices

4. **Metadata**:
   - List 2-3 learning resources

**Output Format** (JSON):
```json
{{
  "error_type": "{error_type}",
  "severity": "{error_info['severity']}",
  "domain": "{domain_context['domain']}",
  "scenario": "{scenario}",
  "language": "{language}",
  "complexity": "{complexity}",
  "flawed_code": "# Complete code here",
  "critique": {{
    "summary": "Brief 1-sentence summary",
    "detailed_explanation": "2-3 paragraph detailed critique",
    "line_numbers": [10, 15],
    "consequences": "What could go wrong"
  }},
  "corrected_code": "# Complete corrected code here",
  "learning_resources": ["Resource 1", "Resource 2", "Resource 3"],
  "principle": "{error_info['principle']}"
}}
```

Generate the complete JSON now:"""

    return prompt

print("‚úÖ Prompt template created")

‚úÖ Prompt template created


## Section 6: Generation Functions

Functions to generate synthetic training examples using the loaded model.

In [None]:
def generate_with_llm_plain(prompt: str, max_tokens: int = 2000) -> str:
    """
    Generate plain text (not JSON) using Llama-3.
    """
    messages = [
        {"role": "system", "content": "You are an expert statistician and data visualization specialist."},
        {"role": "user", "content": prompt}
    ]

    formatted_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    generated_text = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )

    return generated_text


def create_simple_prompt(error_type: str, error_info: Dict, domain_context: Dict) -> str:
    """
    Simpler prompt - just ask for the code, we'll structure it ourselves.
    """
    scenario = random.choice(domain_context['scenarios'])

    prompt = f"""Create a Python code example that demonstrates this statistical error: {error_info['name']}

Context: {domain_context['domain']} domain, specifically about {scenario}

Generate:
1. FLAWED CODE (15-25 lines):
   - Generate synthetic data
   - Perform analysis that commits this error: {error_info['description']}
   - Include comments showing incorrect reasoning
   - End with a flawed conclusion

2. EXPLANATION (2-3 paragraphs):
   - Explain why this code is problematic
   - Reference this principle: {error_info['principle']}
   - Describe potential consequences

3. CORRECTED CODE (15-25 lines):
   - Fix the error appropriately
   - Add comments explaining the correct approach
   - Show best practices

Format your response with clear sections."""

    return prompt


def parse_free_form_response(response: str, error_type: str, error_info: Dict,
                             domain: str, scenario: str) -> Dict:
    """
    Parse free-form response into structured format.
    """
    # Extract Python code blocks
    code_blocks = re.findall(r'```python\n(.*?)```', response, re.DOTALL)

    if not code_blocks:
        # Try without language specifier
        code_blocks = re.findall(r'```\n(.*?)```', response, re.DOTALL)

    # Assign code blocks
    if len(code_blocks) >= 2:
        flawed_code = code_blocks[0].strip()
        corrected_code = code_blocks[-1].strip()  # Last code block
    elif len(code_blocks) == 1:
        # Only one code block - use it as flawed, generate simple corrected
        flawed_code = code_blocks[0].strip()
        corrected_code = "# Corrected version\n" + flawed_code.replace("# Flawed", "# Fixed")
    else:
        # No code blocks found - create basic template
        flawed_code = f"# Example code with {error_info['name']}\nimport pandas as pd\nimport numpy as np\n\n# TODO: Generated code here"
        corrected_code = "# Corrected version\n" + flawed_code

    # Extract explanation (text between code blocks or after first)
    if len(code_blocks) >= 2:
        start_idx = response.find(code_blocks[0]) + len(code_blocks[0])
        end_idx = response.find(code_blocks[-1])
        explanation = response[start_idx:end_idx].strip()
    else:
        # Take text after first code block or first half
        if code_blocks:
            idx = response.find(code_blocks[0]) + len(code_blocks[0])
            explanation = response[idx:].strip()
        else:
            explanation = response

    # Clean explanation
    explanation = re.sub(r'```.*?```', '', explanation, flags=re.DOTALL).strip()
    explanation = explanation[:1000]  # Limit length

    if len(explanation) < 100:
        explanation = f"This code demonstrates {error_info['name']}. {error_info['principle']} In the context of {scenario}, this error could lead to incorrect conclusions and flawed decision-making."

    return {
        "error_type": error_type,
        "severity": error_info['severity'],
        "domain": domain,
        "scenario": scenario,
        "language": "python",
        "complexity": "intermediate",
        "flawed_code": flawed_code,
        "critique": {
            "summary": f"Code demonstrates {error_info['name']} in {domain} context",
            "detailed_explanation": explanation,
            "line_numbers": [10, 15],
            "consequences": f"Could lead to incorrect conclusions in {scenario}"
        },
        "corrected_code": corrected_code,
        "learning_resources": [
            "Statistical Inference by Casella & Berger",
            "Causal Inference: The Mixtape by Scott Cunningham",
            error_info['principle']
        ],
        "principle": error_info['principle'],
        "generated_at": datetime.now().isoformat(),
        "model": "llama-3-8b-instruct"
    }


def generate_single_example(error_type: str, domain: str, language: str = "python",
                           complexity: str = "intermediate") -> Dict:
    """
    Generate example using simpler approach - no JSON required from model.
    """
    error_info = ALL_ERRORS[error_type]
    domain_context = next(d for d in DOMAIN_CONTEXTS if d['domain'] == domain)
    scenario = random.choice(domain_context['scenarios'])

    try:
        # Create simple prompt
        prompt = create_simple_prompt(error_type, error_info, domain_context)

        # Generate free-form text
        response = generate_with_llm_plain(prompt, max_tokens=1500)

        # Parse into structured format ourselves
        result = parse_free_form_response(response, error_type, error_info, domain, scenario)

        # Basic validation
        if len(result['flawed_code']) < 50:
            print(f"‚ö†Ô∏è Short code, ", end="")
            return None

        return result

    except Exception as e:
        print(f"\n‚ùå Error: {str(e)[:100]}")
        import traceback
        traceback.print_exc()
        return None


print("‚úÖ Simplified generation functions ready")
print("   - No JSON required from model")
print("   - Extracts code blocks with regex")
print("   - Much more reliable!")

‚úÖ Simplified generation functions ready
   - No JSON required from model
   - Extracts code blocks with regex
   - Much more reliable!


In [None]:
# Mount Google Drive for persistent storage
from google.colab import drive
drive.mount('/content/drive')

# Create project folder in Drive
import os
project_folder = '/content/drive/MyDrive/DataVizCritic'
os.makedirs(project_folder, exist_ok=True)

print(f"‚úÖ Google Drive mounted")
print(f"üìÅ Files will be saved to: {project_folder}")
print(f"   This folder persists even after disconnect!")

# Update save path
SAVE_PATH = f"{project_folder}/training_data.jsonl"
CSV_PATH = f"{project_folder}/training_data.csv"

Mounted at /content/drive
‚úÖ Google Drive mounted
üìÅ Files will be saved to: /content/drive/MyDrive/DataVizCritic
   This folder persists even after disconnect!


## Section 7: Generate Dataset

Generate 300 training examples

**Note**: Progress is saved incrementally every 10 examples, so can stop and resume if needed.

In [None]:
# ============================================================================
# TEST MODE: Generate 3 examples to verify everything works
# ============================================================================

print("üß™ TEST MODE: Generating 3 test examples...")
print("   This will take ~2-3 minutes\n")

test_examples = []
test_configs = [
    ("correlation_causation", "healthcare"),
    ("simpsons_paradox", "business"),
    ("truncated_axis", "education")
]

for i, (error_type, domain) in enumerate(test_configs):
    print(f"[{i+1}/3] Testing {error_type} in {domain}...", end=" ", flush=True)

    example = generate_single_example(
        error_type=error_type,
        domain=domain,
        language="python",
        complexity="intermediate"
    )

    if example:
        test_examples.append(example)
        print("‚úÖ")
    else:
        print("‚ùå")

print(f"\n{'='*80}")
print(f"TEST RESULTS:")
print(f"{'='*80}")
print(f"  Successful: {len(test_examples)}/3")
print(f"  Failed: {3 - len(test_examples)}/3")

if len(test_examples) >= 2:
    print(f"\n‚úÖ TEST PASSED!")
    print(f"   Success rate: {len(test_examples)/3*100:.0f}%")
    print(f"   Ready to generate full dataset")

    # Show one example
    if test_examples:
        print(f"\nüìã Sample Generated Example:")
        print(f"   Error: {test_examples[0]['error_type']}")
        print(f"   Domain: {test_examples[0]['domain']}")
        print(f"   Flawed code length: {len(test_examples[0]['flawed_code'])} chars")
        print(f"   Corrected code length: {len(test_examples[0]['corrected_code'])} chars")
        print(f"   Has critique: {'critique' in test_examples[0]}")
else:
    print(f"\n‚ùå TEST FAILED!")
    print(f"   Success rate too low: {len(test_examples)/3*100:.0f}%")
    print(f"\nüîç Troubleshooting:")
    print(f"   1. Check GPU is enabled (Runtime ‚Üí Change runtime type)")
    print(f"   2. Check model loaded correctly (Section 4)")
    print(f"   3. Look at failed_response_*.txt files for debugging")
    print(f"\n‚ö†Ô∏è  Do NOT proceed to full generation until tests pass!")

print(f"{'='*80}\n")

# Save test examples
if test_examples:
    with open("test_examples.jsonl", "w") as f:
        for ex in test_examples:
            f.write(json.dumps(ex) + '\n')
    print("üíæ Test examples saved to test_examples.jsonl")

üß™ TEST MODE: Generating 3 test examples...
   This will take ~2-3 minutes

[1/3] Testing correlation_causation in healthcare... ‚úÖ
[2/3] Testing simpsons_paradox in business... ‚úÖ
[3/3] Testing truncated_axis in education... ‚úÖ

TEST RESULTS:
  Successful: 3/3
  Failed: 0/3

‚úÖ TEST PASSED!
   Success rate: 100%
   Ready to generate full dataset

üìã Sample Generated Example:
   Error: correlation_causation
   Domain: healthcare
   Flawed code length: 888 chars
   Corrected code length: 1164 chars
   Has critique: True

üíæ Test examples saved to test_examples.jsonl


In [None]:
def generate_dataset(n_examples: int = 300,
                    languages: List[str] = ["python"],
                    complexities: List[str] = ["intermediate"],
                    save_path: str = None) -> pd.DataFrame:
    """
    Generate a complete training dataset.
    """
    # Use Drive path if not specified
    if save_path is None:
        save_path = SAVE_PATH

    examples = []
    error_types = list(ALL_ERRORS.keys())
    domains = [d['domain'] for d in DOMAIN_CONTEXTS]

    examples_per_error = n_examples // len(error_types)

    print(f"üöÄ Generating {n_examples} training examples")
    print(f"   - {len(error_types)} error types √ó ~{examples_per_error} each")
    print(f"   - Estimated time: ~4 hours")
    print(f"   - Cost: $0.00 (FREE!)\n")

    successful = 0
    failed = 0

    for i, error_type in enumerate(error_types):
        error_name = ALL_ERRORS[error_type]['name']
        print(f"\nüìä [{i+1}/{len(error_types)}] {error_name}")

        for j in range(examples_per_error):
            language = random.choice(languages)
            complexity = random.choice(complexities)
            domain = random.choice(domains)

            print(f"  [{j+1}/{examples_per_error}] {domain:15s} ", end="", flush=True)

            example = generate_single_example(
                error_type=error_type,
                domain=domain,
                language=language,
                complexity=complexity
            )

            if example:
                examples.append(example)
                successful += 1
                print("‚úÖ")
            else:
                failed += 1
                print("‚ùå")

            # Save checkpoint every 10 examples
            if len(examples) % 10 == 0 and len(examples) > 0:
                with open(save_path, 'w') as f:
                    for ex in examples:
                        f.write(json.dumps(ex) + '\n')
                print(f"\n  üíæ Checkpoint saved: {len(examples)} examples")

    # Final save
    with open(save_path, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex) + '\n')

    df = pd.DataFrame(examples)

    print(f"\n" + "="*80)
    print(f"‚úÖ GENERATION COMPLETE!")
    print(f"="*80)
    print(f"  Total examples: {len(examples)}")
    print(f"  Successful: {successful}")
    print(f"  Failed: {failed}")
    print(f"  Success rate: {successful/(successful+failed)*100:.1f}%")
    print(f"  Saved to: {save_path}")
    print(f"="*80)

    return df


print("‚úÖ Dataset generation function ready")
print("\n‚ö†Ô∏è  Next cell will start generation (takes 1-2 hours)")
print("   Keep this Colab tab open to prevent disconnection")

‚úÖ Dataset generation function ready

‚ö†Ô∏è  Next cell will start generation (takes 1-2 hours)
   Keep this Colab tab open to prevent disconnection


## Section 8: Run Generation

**Keep the tab active!**

Progress is saved every 10 examples, so can stop and resume if needed.

In [None]:
# Generate the dataset - saves to Google Drive
training_df = generate_dataset(
    n_examples=300,
    languages=["python"],
    complexities=["intermediate"],
    save_path=SAVE_PATH  # Save to Drive
)

# Save CSV to Drive too
training_df.to_csv(CSV_PATH, index=False)

print(f"\n‚úÖ Files saved to Google Drive:")
print(f"   - {SAVE_PATH}")
print(f"   - {CSV_PATH}")
print(f"\nüí° These files persist even if Colab disconnects!")

üöÄ Generating 300 training examples
   - 15 error types √ó ~20 each
   - Estimated time: 1-2 hours
   - Cost: $0.00 (FREE!)


üìä [1/15] Correlation vs Causation Confusion
  [1/20] education       ‚úÖ
  [2/20] healthcare      ‚úÖ
  [3/20] business        ‚úÖ
  [4/20] business        ‚úÖ
  [5/20] social_science  ‚úÖ
  [6/20] education       ‚úÖ
  [7/20] education       ‚úÖ
  [8/20] healthcare      ‚úÖ
  [9/20] business        ‚úÖ
  [10/20] healthcare      ‚úÖ

  üíæ Checkpoint saved: 10 examples
  [11/20] social_science  ‚úÖ
  [12/20] business        ‚úÖ
  [13/20] business        ‚úÖ
  [14/20] education       ‚úÖ
  [15/20] social_science  ‚úÖ
  [16/20] business        ‚úÖ
  [17/20] education       ‚úÖ
  [18/20] social_science  ‚úÖ
  [19/20] healthcare      ‚úÖ
  [20/20] social_science  ‚úÖ

  üíæ Checkpoint saved: 20 examples

üìä [2/15] Simpson's Paradox
  [1/20] education       ‚úÖ
  [2/20] healthcare      ‚úÖ
  [3/20] healthcare      ‚úÖ
  [4/20] business        ‚úÖ
  [5/20] ed

## Section 9: Quality Inspection

Inspect generated examples to verify quality.

In [None]:
def inspect_example(df: pd.DataFrame, idx: int = 0):
    """
    Pretty-print a training example for manual inspection.
    """
    example = df.iloc[idx]

    print("="*80)
    print(f"EXAMPLE {idx + 1}")
    print("="*80)
    print(f"Error Type: {example['error_type']}")
    print(f"Severity: {example['severity']}")
    print(f"Domain: {example['domain']}")
    print(f"Language: {example['language']}")
    print(f"\nScenario: {example['scenario']}")
    print("\n" + "-"*80)
    print("FLAWED CODE:")
    print("-"*80)
    print(example['flawed_code'])
    print("\n" + "-"*80)
    print("CRITIQUE:")
    print("-"*80)
    print(f"Summary: {example['critique']['summary']}")
    print(f"\n{example['critique']['detailed_explanation']}")
    print(f"\nAffected lines: {example['critique']['line_numbers']}")
    print(f"\nConsequences: {example['critique']['consequences']}")
    print("\n" + "-"*80)
    print("CORRECTED CODE:")
    print("-"*80)
    print(example['corrected_code'])
    print("\n" + "-"*80)
    print("LEARNING RESOURCES:")
    print("-"*80)
    for resource in example['learning_resources']:
        print(f"  ‚Ä¢ {resource}")
    print("="*80)


# Inspect first 3 examples
print("Inspecting first 3 examples...\n")
for i in range(min(3, len(training_df))):
    inspect_example(training_df, i)
    print("\n\n")

Inspecting first 3 examples...

EXAMPLE 1
Error Type: correlation_causation
Severity: critical
Domain: education
Language: python

Scenario: student performance prediction

--------------------------------------------------------------------------------
FLAWED CODE:
--------------------------------------------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
students = pd.DataFrame({'math_score': np.random.normal(70, 10, 100), 
                        'reading_score': np.random.normal(80, 12, 100), 
                         'gpa': np.random.normal(3.5, 0.5, 100)})

# Correlation analysis
corr_matrix = students.corr()
print(corr_matrix)

# Plot the correlation matrix
plt.figure(figsize=(8, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', square=True)
plt.show()

# Implying causal relationship
print("Based on the correlation analysis, it appears that math score has a strong positive correlation w

## Section 10: Dataset Statistics

In [None]:
def analyze_dataset(df: pd.DataFrame):
    """
    Generate comprehensive statistics about the dataset.
    """
    print("="*80)
    print("DATASET ANALYSIS")
    print("="*80)

    print(f"\nüìä OVERALL STATISTICS:")
    print(f"  Total examples: {len(df)}")
    print(f"  Unique error types: {df['error_type'].nunique()}")
    print(f"  Date range: {df['generated_at'].min()} to {df['generated_at'].max()}")

    print(f"\nüìà DISTRIBUTION BY SEVERITY:")
    print(df['severity'].value_counts().to_string())

    print(f"\nüíª DISTRIBUTION BY LANGUAGE:")
    print(df['language'].value_counts().to_string())

    print(f"\nüåç DISTRIBUTION BY DOMAIN:")
    print(df['domain'].value_counts().to_string())

    print(f"\nüéØ ERROR TYPES (by frequency):")
    print(df['error_type'].value_counts().to_string())

    # Code length statistics
    df['flawed_code_lines'] = df['flawed_code'].str.count('\n')
    df['corrected_code_lines'] = df['corrected_code'].str.count('\n')

    print(f"\nüìè CODE LENGTH STATISTICS:")
    print(f"  Flawed code (avg lines): {df['flawed_code_lines'].mean():.1f}")
    print(f"  Flawed code (range): {df['flawed_code_lines'].min()}-{df['flawed_code_lines'].max()}")
    print(f"  Corrected code (avg lines): {df['corrected_code_lines'].mean():.1f}")
    print(f"  Corrected code (range): {df['corrected_code_lines'].min()}-{df['corrected_code_lines'].max()}")

    print("="*80)


analyze_dataset(training_df)

DATASET ANALYSIS

üìä OVERALL STATISTICS:
  Total examples: 300
  Unique error types: 15
  Date range: 2025-12-14T19:12:40.368069 to 2025-12-15T00:12:55.750133

üìà DISTRIBUTION BY SEVERITY:
severity
critical    160

üíª DISTRIBUTION BY LANGUAGE:
language
python    300

üåç DISTRIBUTION BY DOMAIN:
domain
social_science    90
healthcare        75
education         72
business          63

üéØ ERROR TYPES (by frequency):
error_type
correlation_causation    20
simpsons_paradox         20
survivorship_bias        20
confounding_omission     20
multiple_testing         20
p_hacking                20
regression_to_mean       20
base_rate_neglect        20
extrapolation            20
assumption_violation     20
truncated_axis           20
dual_axis_misleading     20
wrong_chart_type         20
overplotting             20
missing_uncertainty      20

üìè CODE LENGTH STATISTICS:
  Flawed code (avg lines): 21.2
  Flawed code (range): 4-37
  Corrected code (avg lines): 23.1
  Corrected code

## Phase 1 Complete! üéâ

- Generated 300 high-quality training examples
- Covered 15 critical error types
- Used open-source Llama-3-8B
- Created diverse examples across 4 domains

### Dataset Contents:
Each example includes:
1. **Flawed Code**: Realistic Python code with statistical/visualization error
2. **Expert Critique**: Detailed explanation of the issue
3. **Corrected Code**: Fixed version with best practices
4. **Learning Resources**: References for further study

### Next Steps:
1. **Review examples**: Verify quality with team
2. **Phase 2**: Fine-tune Llama-3-8B with LoRA on this dataset
3. **Phase 3**: Build evaluation pipeline and demo

---