# Charge Classifier: Data Analysis, Baseline Evaluation, and Fine-Tuning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jli-together/colab-test/blob/main/Charge_Classifier_Analysis_and_Finetuning.ipynb)

This notebook demonstrates how to:
1. **Data Analysis**: Explore the charge classifier dataset
2. **Train/Val Split**: Create stratified splits for training and evaluation
3. **Baseline Evaluation**: Test baseline models on the validation set
4. **Fine-Tuning**: Train a specialized classifier model
5. **Comparison**: Compare fine-tuned model with baseline performance

**Dataset**: `CHARGE_CLASSIFIER_JUDGE_251111_NO_SYNTH_COUNTY_CRIM_MINIMAL_PROMPT_train.jsonl`

**Key Features**:
- Testing mode (10 examples) to verify workflow before full run
- Full mode (8k+ examples) for complete analysis
- Baseline vs fine-tuned model comparison

## üì¶ Setup and Installation

In [None]:
%pip install -qU together datasets matplotlib seaborn pandas tqdm scikit-learn

In [None]:
import together
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import json
import numpy as np
from collections import defaultdict
from sklearn.model_selection import train_test_split
import os

# Initialize Together client
client = together.Client()

# Set style for visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Configuration
TESTING_MODE = True  # Set to False for full dataset (8k+ samples)
TESTING_SAMPLE_SIZE = 10  # Number of examples to use in testing mode

# Option 1: Use uploaded file from Together AI (for Colab/cloud environments)
USE_UPLOADED_FILE = True  # Set to True if file is already uploaded to Together AI
UPLOADED_FILE_ID = "file-67114292-64db-484b-ad28-53b764c1566d"  # Your uploaded file ID

# Option 2: Use local file path (for local environments)
DATASET_PATH = "CHARGE_CLASSIFIER_JUDGE_251111_NO_SYNTH_COUNTY_CRIM_MINIMAL_PROMPT_train.jsonl"

print(f"üîß Configuration:")
print(f"   Testing Mode: {TESTING_MODE}")
if TESTING_MODE:
    print(f"   Sample Size: {TESTING_SAMPLE_SIZE} examples")
else:
    print(f"   Full Dataset Mode")
if USE_UPLOADED_FILE:
    print(f"   Using uploaded file from Together AI: {UPLOADED_FILE_ID}")
else:
    print(f"   Using local file: {DATASET_PATH}")

## üìä Load and Explore Dataset

In [None]:
# Load the dataset
if USE_UPLOADED_FILE:
    # Download file from Together AI
    print(f"Downloading dataset from Together AI (file ID: {UPLOADED_FILE_ID})...")
    local_file_path = "downloaded_dataset.jsonl"
    try:
        client.files.retrieve_content(UPLOADED_FILE_ID, output=local_file_path)
        print(f"‚úì Downloaded file to {local_file_path}")
        DATASET_PATH = local_file_path
    except Exception as e:
        print(f"‚úó Error downloading file: {str(e)}")
        print("   Falling back to local file path")
        USE_UPLOADED_FILE = False

if not USE_UPLOADED_FILE:
    print(f"Loading dataset from {DATASET_PATH}...")

data = []
with open(DATASET_PATH, 'r') as f:
    for line in f:
        data.append(json.loads(line.strip()))

print(f"‚úì Loaded {len(data)} examples")

# Apply testing mode if enabled
if TESTING_MODE:
    print(f"\nüß™ TESTING MODE: Using first {TESTING_SAMPLE_SIZE} examples")
    data = data[:TESTING_SAMPLE_SIZE]
    print(f"‚úì Reduced to {len(data)} examples for testing")

# Convert to DataFrame for easier analysis
df = pd.DataFrame(data)

print(f"\nDataset structure:")
print(f"  Columns: {list(df.columns)}")
print(f"  Total examples: {len(df)}")
print(f"\nFirst example:")
print(json.dumps(data[0], indent=2))

## üîç Data Analysis

In [None]:
# Analyze completion distribution
completion_counts = df['completion'].value_counts()
completion_pct = df['completion'].value_counts(normalize=True) * 100

print("="*80)
print("üìä COMPLETION DISTRIBUTION")
print("="*80)
print(f"\nCounts:")
for label, count in completion_counts.items():
    print(f"  {label}: {count} ({completion_pct[label]:.2f}%)")

# Visualize distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
completion_counts.plot(kind='bar', ax=ax1, color=['#2ecc71', '#e74c3c'])
ax1.set_title('Completion Distribution (Counts)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Completion Label', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.tick_params(axis='x', rotation=0)

# Pie chart
completion_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%',
                       colors=['#2ecc71', '#e74c3c'], startangle=90)
ax2.set_title('Completion Distribution (Percentages)', fontsize=14, fontweight='bold')
ax2.set_ylabel('')

plt.tight_layout()
plt.show()

print(f"\n‚úì Dataset is {'balanced' if abs(completion_pct.iloc[0] - 50) < 10 else 'imbalanced'}")

In [None]:
# Analyze prompt characteristics
df['prompt_length'] = df['prompt'].str.len()
df['prompt_word_count'] = df['prompt'].str.split().str.len()

print("="*80)
print("üìä PROMPT STATISTICS")
print("="*80)
print(f"\nPrompt Length (characters):")
print(f"  Mean: {df['prompt_length'].mean():.1f}")
print(f"  Median: {df['prompt_length'].median():.1f}")
print(f"  Min: {df['prompt_length'].min()}")
print(f"  Max: {df['prompt_length'].max()}")

print(f"\nPrompt Word Count:")
print(f"  Mean: {df['prompt_word_count'].mean():.1f}")
print(f"  Median: {df['prompt_word_count'].median():.1f}")
print(f"  Min: {df['prompt_word_count'].min()}")
print(f"  Max: {df['prompt_word_count'].max()}")

# Visualize prompt length distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

df['prompt_length'].hist(bins=30, ax=ax1, color='skyblue', edgecolor='black')
ax1.set_title('Prompt Length Distribution (Characters)', fontsize=14, fontweight='bold')
ax1.set_xlabel('Character Count', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)

df['prompt_word_count'].hist(bins=30, ax=ax2, color='lightcoral', edgecolor='black')
ax2.set_title('Prompt Word Count Distribution', fontsize=14, fontweight='bold')
ax2.set_xlabel('Word Count', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Analyze by completion label
print("="*80)
print("üìä STATISTICS BY COMPLETION LABEL")
print("="*80)

for label in df['completion'].unique():
    label_df = df[df['completion'] == label]
    print(f"\n{label} ({len(label_df)} examples):")
    print(f"  Avg prompt length: {label_df['prompt_length'].mean():.1f} chars")
    print(f"  Avg word count: {label_df['prompt_word_count'].mean():.1f} words")

# Box plot comparison
fig, ax = plt.subplots(figsize=(10, 6))
df.boxplot(column='prompt_length', by='completion', ax=ax)
ax.set_title('Prompt Length by Completion Label', fontsize=14, fontweight='bold')
ax.set_xlabel('Completion Label', fontsize=12)
ax.set_ylabel('Prompt Length (characters)', fontsize=12)
plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

## üé≤ Create Train/Validation Split

In [None]:
# Create stratified train/validation split
# Use 80/20 split for train/val
TEST_SIZE = 0.2
RANDOM_STATE = 42

print("Creating stratified train/validation split...")
print("="*80)

# Stratified split to maintain label distribution
train_df, val_df = train_test_split(
    df,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=df['completion']  # Maintain label distribution
)

print(f"‚úì TRAIN dataset: {len(train_df)} examples ({len(train_df)/len(df)*100:.1f}%)")
print(f"‚úì VALIDATION dataset: {len(val_df)} examples ({len(val_df)/len(df)*100:.1f}%)")

# Show distribution in each split
print(f"\nüìä Train set distribution:")
train_dist = train_df['completion'].value_counts()
for label, count in train_dist.items():
    pct = count / len(train_df) * 100
    print(f"  {label}: {count} ({pct:.1f}%)")

print(f"\nüìä Validation set distribution:")
val_dist = val_df['completion'].value_counts()
for label, count in val_dist.items():
    pct = count / len(val_df) * 100
    print(f"  {label}: {count} ({pct:.1f}%)")

# Verify no overlap
train_ids = set(train_df.index)
val_ids = set(val_df.index)
overlap = train_ids & val_ids
print(f"\n‚úì Verification: {len(overlap)} overlapping samples (should be 0)")
print("="*80)

## ü§ñ Baseline Evaluation

We'll evaluate baseline models on the validation set to establish a performance baseline before fine-tuning.

In [None]:
# Define baseline models to evaluate
BASELINE_MODELS = [
    "Qwen/Qwen2.5-7B-Instruct",
]

# Judge system prompt for classification
JUDGE_SYSTEM_PROMPT = """You are a legal charge classifier. Your task is to evaluate whether a charge classification is Valid or Error.

Given:
- A charge description
- A state
- A classification output

Determine if the classification is Valid (correct) or Error (incorrect).

Respond with only "Valid" or "Error"."""

print("Selected Baseline Models:")
for i, model in enumerate(BASELINE_MODELS, 1):
    print(f"  {i}. {model}")

In [None]:
# Prepare validation data for evaluation
def prepare_for_evaluation(df_split):
    """Convert DataFrame to list of dicts for evaluation."""
    eval_data = []
    for idx, row in df_split.iterrows():
        eval_data.append({
            'id': str(idx),
            'prompt': row['prompt'],
            'ground_truth': row['completion']
        })
    return eval_data

val_eval_data = prepare_for_evaluation(val_df)

print(f"‚úì Prepared {len(val_eval_data)} validation examples for baseline evaluation")
print(f"\nSample evaluation data:")
print(json.dumps(val_eval_data[0], indent=2))

In [None]:
# Run baseline evaluation
print("Starting baseline evaluation on validation set...")
print("="*80)

baseline_results = {}

for model in BASELINE_MODELS:
    print(f"\nüîÑ Evaluating {model}...")
    model_results = []
    correct = 0
    total = 0

    for example in tqdm(val_eval_data, desc=f"  {model.split('/')[-1]}"):
        try:
            # Call the model
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
                    {"role": "user", "content": example['prompt']}
                ],
                temperature=0.0,
                max_tokens=10
            )

            predicted = response.choices[0].message.content.strip()
            ground_truth = example['ground_truth']

            # Normalize predictions (handle variations)
            predicted_normalized = predicted.upper()
            if 'VALID' in predicted_normalized:
                predicted_normalized = 'Valid'
            elif 'ERROR' in predicted_normalized:
                predicted_normalized = 'Error'
            else:
                predicted_normalized = predicted  # Keep original if unclear

            is_correct = (predicted_normalized == ground_truth)
            if is_correct:
                correct += 1
            total += 1

            model_results.append({
                'id': example['id'],
                'prompt': example['prompt'],
                'ground_truth': ground_truth,
                'predicted': predicted,
                'predicted_normalized': predicted_normalized,
                'correct': is_correct
            })

        except Exception as e:
            print(f"    ‚ö†Ô∏è Error on example {example['id']}: {str(e)}")
            model_results.append({
                'id': example['id'],
                'prompt': example['prompt'],
                'ground_truth': example['ground_truth'],
                'predicted': 'ERROR',
                'predicted_normalized': 'Error',
                'correct': False
            })
            total += 1

    accuracy = correct / total if total > 0 else 0
    baseline_results[model] = {
        'results': model_results,
        'accuracy': accuracy,
        'correct': correct,
        'total': total
    }

    print(f"  ‚úì Accuracy: {accuracy*100:.2f}% ({correct}/{total})")

print(f"\n{'='*80}")
print("‚úì Baseline evaluation complete")

In [None]:
# Visualize baseline results
print("="*80)
print("üìä BASELINE EVALUATION RESULTS")
print("="*80)

# Create results DataFrame
results_data = []
for model, metrics in baseline_results.items():
    model_name = model.split('/')[-1]
    results_data.append({
        'Model': model_name,
        'Accuracy': metrics['accuracy'] * 100,
        'Correct': metrics['correct'],
        'Total': metrics['total']
    })

results_df = pd.DataFrame(results_data).sort_values('Accuracy', ascending=False)

print("\nResults Summary:")
print(results_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(results_df['Model'], results_df['Accuracy'], color='steelblue')
ax.set_xlabel('Accuracy (%)', fontsize=12)
ax.set_title('Baseline Model Performance', fontsize=14, fontweight='bold')
ax.set_xlim(0, 100)

# Add value labels on bars
for i, (bar, acc) in enumerate(zip(bars, results_df['Accuracy'])):
    ax.text(acc + 1, i, f'{acc:.1f}%', va='center', fontsize=10)

plt.tight_layout()
plt.show()

# Find best baseline
best_baseline = results_df.iloc[0]
print(f"\nüèÜ Best Baseline Model: {best_baseline['Model']}")
print(f"   Accuracy: {best_baseline['Accuracy']:.2f}%")
print(f"   Correct: {int(best_baseline['Correct'])}/{int(best_baseline['Total'])}")

## üéì Fine-Tuning

Now we'll fine-tune a model on the training data to improve classification performance.

**Note**: If you have already uploaded a file to Together AI (file ID: `file-67114292-64db-484b-ad28-53b764c1566d`), you can:
1. Use it directly if it's already in fine-tuning format (with "messages" field)
2. Or convert the data below and upload new files in the correct format

The notebook will automatically download the uploaded file for analysis and can use it for fine-tuning.

In [None]:
# Prepare training data for fine-tuning
# Format: chat completion format for Together AI
# Note: If your uploaded file is already in fine-tuning format (with "messages" field),
# you can skip this conversion and use the file ID directly in the fine-tuning job.

def prepare_finetuning_data(df_split):
    """Convert DataFrame to fine-tuning format."""
    finetune_data = []
    for idx, row in df_split.iterrows():
        finetune_data.append({
            "messages": [
                {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
                {"role": "user", "content": row['prompt']},
                {"role": "assistant", "content": row['completion']}
            ]
        })
    return finetune_data

train_finetune_data = prepare_finetuning_data(train_df)
val_finetune_data = prepare_finetuning_data(val_df)

print(f"‚úì Prepared fine-tuning data:")
print(f"  Training examples: {len(train_finetune_data)}")
print(f"  Validation examples: {len(val_finetune_data)}")

# Show sample
print(f"\nüìã Sample fine-tuning example:")
print(json.dumps(train_finetune_data[0], indent=2))

In [None]:
# Save fine-tuning data to JSONL files
os.makedirs('finetune_data', exist_ok=True)

train_file = 'finetune_data/train.jsonl'
val_file = 'finetune_data/val.jsonl'

with open(train_file, 'w') as f:
    for example in train_finetune_data:
        f.write(json.dumps(example) + '\n')

with open(val_file, 'w') as f:
    for example in val_finetune_data:
        f.write(json.dumps(example) + '\n')

print(f"‚úì Saved fine-tuning data:")
print(f"  Training: {train_file} ({len(train_finetune_data)} examples)")
print(f"  Validation: {val_file} ({len(val_finetune_data)} examples)")

# Get file sizes
train_size = os.path.getsize(train_file) / (1024 * 1024)  # MB
val_size = os.path.getsize(val_file) / (1024 * 1024)  # MB
print(f"\nüìÇ File sizes:")
print(f"  Training: {train_size:.2f} MB")
print(f"  Validation: {val_size:.2f} MB")

In [None]:
# Upload files to Together AI (or use existing uploaded file)
print("Preparing fine-tuning files...")
print("="*80)
print("Note: The uploaded file should be in fine-tuning format (with 'messages' field).")
print("If your uploaded file is in the original format (prompt/completion),")
print("the converted files below will be uploaded instead.")
print("="*80)

# Option 1: Use the already uploaded file if it's already in fine-tuning format
USE_UPLOADED_FILE_FOR_FINETUNING = False  # Set to True if uploaded file is in fine-tuning format

if USE_UPLOADED_FILE_FOR_FINETUNING and USE_UPLOADED_FILE and UPLOADED_FILE_ID:
    print(f"Using already uploaded file: {UPLOADED_FILE_ID}")
    print("Note: This assumes the uploaded file is the training file.")
    print("If you need separate train/val files, upload them below.")

    # Use the uploaded file ID directly
    class FileObj:
        def __init__(self, file_id):
            self.id = file_id
    train_file_obj = FileObj(UPLOADED_FILE_ID)
    val_file_obj = None  # Validation can be done separately

    print(f"‚úì Using uploaded training file: {UPLOADED_FILE_ID}")
    print("‚ö†Ô∏è Note: Using the same file for training. For proper validation, upload a separate validation file.")
else:
    # Option 2: Upload converted files (original format -> fine-tuning format)
    print("Converting data to fine-tuning format and uploading...")
    print("Uploading fine-tuning data to Together AI...")
    try:
        train_file_obj = client.files.upload(train_file, check=True)
        print(f"‚úì Uploaded training file: {train_file_obj.id}")

        # Note: LoRA fine-tuning typically only needs training file
        # Validation can be done separately after training
        val_file_obj = client.files.upload(val_file, check=True)
        print(f"‚úì Uploaded validation file: {val_file_obj.id}")

        print(f"\n{'='*80}")
        print("‚úì Files uploaded successfully")

    except Exception as e:
        print(f"‚úó Error uploading files: {str(e)}")
        print("Note: You may need to upload files manually or check API credentials")
        train_file_obj = None
        val_file_obj = None

In [None]:
# Create LoRA fine-tuning job
# Select base model for fine-tuning
FINETUNE_BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"  # Can be changed
WANDB_API_KEY = os.getenv("WANDB_API_KEY", None)  # Optional: set your wandb key

print(f"Creating LoRA fine-tuning job...")
print(f"Base model: {FINETUNE_BASE_MODEL}")
print("="*80)

if train_file_obj:
    try:
        # Fine-tuning parameters
        # LoRA fine-tuning parameters (matching LoRA_Finetuning&Inference.ipynb)
        finetune_job = client.fine_tuning.create(
            training_file=train_file_obj.id,
            model=FINETUNE_BASE_MODEL,
            train_on_inputs="auto",
            n_epochs=3,
            n_checkpoints=1,
            wandb_api_key=WANDB_API_KEY,
            lora=True,  # Enable LoRA fine-tuning
            warmup_ratio=0,
            learning_rate=1e-5,
            suffix="charge-classifier",  # Custom suffix for model name
        )

        print(f"‚úì Fine-tuning job created!")
        print(f"  Job ID: {finetune_job.id}")
        print(f"  Output model name: {finetune_job.output_name}")
        print(f"\n‚è≥ Fine-tuning in progress...")
        print(f"   You can check status at: https://api.together.ai/jobs")
        print(f"   Or use: client.fine_tuning.retrieve('{finetune_job.id}')")

        finetune_job_id = finetune_job.id
        finetuned_model_name = finetune_job.output_name

    except Exception as e:
        print(f"‚úó Error creating fine-tuning job: {str(e)}")
        print("\nNote: Fine-tuning may require specific API permissions or account setup")
        finetune_job_id = None
        finetuned_model_name = None
else:
    print("‚ö†Ô∏è Skipping fine-tuning job creation (training file not uploaded)")
    finetune_job_id = None

In [None]:
# Wait for fine-tuning to complete (if job was created)
if finetune_job_id:
    print("Waiting for fine-tuning to complete...")
    print("="*80)
    print("üí° Tip: You can also check status at https://api.together.ai/jobs")
    print("="*80)

    import time

    while True:
        try:
            status = client.fine_tuning.retrieve(finetune_job_id)
            print(f"Status: {status.status}")

            if status.status == "completed":
                print(f"\n‚úì Fine-tuning completed!")
                print(f"  Fine-tuned model: {status.output_name}")
                print(f"  For LoRA inference, use: {status.output_name}-adapter")
                finetuned_model = status.output_name
                break
            elif status.status == "failed":
                print(f"\n‚úó Fine-tuning failed")
                print(f"  Error: {getattr(status, 'error', 'Unknown')}")
                finetuned_model = None
                break
            else:
                print(f"  Progress: {getattr(status, 'progress', 'N/A')}")
                time.sleep(30)  # Check every 30 seconds

        except Exception as e:
            print(f"Error checking status: {str(e)}")
            time.sleep(30)
else:
    print("‚ö†Ô∏è No fine-tuning job to wait for")
    print("üí° For testing purposes, you can manually set finetuned_model below")
    finetuned_model = None
    # Uncomment and set if you have a fine-tuned model:
    # finetuned_model = "your-finetuned-model-name"

## üìä Compare Fine-Tuned Model with Baseline

Evaluate the fine-tuned model on the validation set and compare with baseline performance.

In [None]:
# Evaluate fine-tuned model (using LoRA adapter)
if finetuned_model:
    # For LoRA fine-tuning, use model_name + '-adapter'
    lora_model_name = finetuned_model + "-adapter"
    print(f"Evaluating fine-tuned LoRA model: {lora_model_name}")
    print("="*80)

    finetuned_results = []
    correct = 0
    total = 0

    for example in tqdm(val_eval_data, desc="Fine-tuned model"):
        try:
            response = client.chat.completions.create(
                model=lora_model_name,  # Use LoRA adapter for inference
                messages=[
                    {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
                    {"role": "user", "content": example['prompt']}
                ],
                temperature=0.0,
                max_tokens=10
            )

            predicted = response.choices[0].message.content.strip()
            ground_truth = example['ground_truth']

            # Normalize predictions
            predicted_normalized = predicted.upper()
            if 'VALID' in predicted_normalized:
                predicted_normalized = 'Valid'
            elif 'ERROR' in predicted_normalized:
                predicted_normalized = 'Error'
            else:
                predicted_normalized = predicted

            is_correct = (predicted_normalized == ground_truth)
            if is_correct:
                correct += 1
            total += 1

            finetuned_results.append({
                'id': example['id'],
                'prompt': example['prompt'],
                'ground_truth': ground_truth,
                'predicted': predicted,
                'predicted_normalized': predicted_normalized,
                'correct': is_correct
            })

        except Exception as e:
            print(f"    ‚ö†Ô∏è Error on example {example['id']}: {str(e)}")
            finetuned_results.append({
                'id': example['id'],
                'prompt': example['prompt'],
                'ground_truth': example['ground_truth'],
                'predicted': 'ERROR',
                'predicted_normalized': 'Error',
                'correct': False
            })
            total += 1

    finetuned_accuracy = correct / total if total > 0 else 0

    print(f"\n‚úì Fine-tuned model evaluation complete")
    print(f"  Accuracy: {finetuned_accuracy*100:.2f}% ({correct}/{total})")
else:
    print("‚ö†Ô∏è No fine-tuned model available for evaluation")
    print("   Set finetuned_model variable or wait for fine-tuning to complete")
    finetuned_accuracy = None
    finetuned_results = []

In [None]:
# Compare baseline vs fine-tuned model
print("="*80)
print("üìä BASELINE vs FINE-TUNED COMPARISON")
print("="*80)

# Prepare comparison data
comparison_data = []
for model, metrics in baseline_results.items():
    model_name = model.split('/')[-1]
    comparison_data.append({
        'Model': model_name,
        'Type': 'Baseline',
        'Accuracy': metrics['accuracy'] * 100
    })

if finetuned_accuracy is not None:
    # Extract model name for display
    model_display_name = finetuned_model.split('/')[-1] if finetuned_model else 'Fine-tuned'
    comparison_data.append({
        'Model': f"{model_display_name} (LoRA)",
        'Type': 'Fine-tuned',
        'Accuracy': finetuned_accuracy * 100
    })

comparison_df = pd.DataFrame(comparison_data)

# Separate baseline and fine-tuned
baseline_df = comparison_df[comparison_df['Type'] == 'Baseline']
finetuned_df = comparison_df[comparison_df['Type'] == 'Fine-tuned']

print("\nBaseline Models:")
print(baseline_df[['Model', 'Accuracy']].to_string(index=False))

if not finetuned_df.empty:
    print("\nFine-tuned Model:")
    print(finetuned_df[['Model', 'Accuracy']].to_string(index=False))

    # Calculate improvement
    best_baseline_acc = baseline_df['Accuracy'].max()
    finetuned_acc = finetuned_df['Accuracy'].iloc[0]
    improvement = finetuned_acc - best_baseline_acc

    print(f"\nüìà Performance Comparison:")
    print(f"  Best Baseline: {best_baseline_acc:.2f}%")
    print(f"  Fine-tuned: {finetuned_acc:.2f}%")
    print(f"  Improvement: {improvement:+.2f} percentage points")

    if improvement > 0:
        print(f"  ‚úÖ Fine-tuning improved performance by {improvement:.2f}%")
    elif improvement < 0:
        print(f"  ‚ö†Ô∏è Fine-tuning decreased performance by {abs(improvement):.2f}%")
    else:
        print(f"  ‚û°Ô∏è Fine-tuning maintained baseline performance")

# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))

# Plot baseline models
baseline_colors = ['steelblue'] * len(baseline_df)
if not finetuned_df.empty:
    finetuned_color = ['green'] if finetuned_acc >= best_baseline_acc else ['red']
    colors = baseline_colors + finetuned_color
else:
    colors = baseline_colors

bars = ax.barh(comparison_df['Model'], comparison_df['Accuracy'], color=colors)
ax.set_xlabel('Accuracy (%)', fontsize=12)
ax.set_title('Baseline vs Fine-Tuned Model Performance', fontsize=14, fontweight='bold')
ax.set_xlim(0, 100)

# Add value labels
for i, (bar, acc) in enumerate(zip(bars, comparison_df['Accuracy'])):
    ax.text(acc + 1, i, f'{acc:.1f}%', va='center', fontsize=10)

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='steelblue', label='Baseline'),
]
if not finetuned_df.empty:
    legend_elements.append(Patch(facecolor='green' if finetuned_acc >= best_baseline_acc else 'red',
                                 label='Fine-tuned'))
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

In [None]:
# Detailed error analysis
if finetuned_accuracy is not None and finetuned_results:
    print("="*80)
    print("üîç ERROR ANALYSIS")
    print("="*80)

    # Compare errors between best baseline and fine-tuned
    best_baseline_model = baseline_df.loc[baseline_df['Accuracy'].idxmax(), 'Model']
    best_baseline_full_name = [m for m in baseline_results.keys() if best_baseline_model in m][0]
    best_baseline_results = baseline_results[best_baseline_full_name]['results']

    # Find examples where they differ
    baseline_correct = {r['id']: r['correct'] for r in best_baseline_results}
    finetuned_correct = {r['id']: r['correct'] for r in finetuned_results}

    # Cases where fine-tuned is better
    improved = [id for id in baseline_correct.keys()
                if not baseline_correct[id] and finetuned_correct.get(id, False)]

    # Cases where fine-tuned is worse
    regressed = [id for id in baseline_correct.keys()
                 if baseline_correct[id] and not finetuned_correct.get(id, False)]

    print(f"\nüìä Comparison with best baseline ({best_baseline_model}):")
    print(f"  Examples improved: {len(improved)}")
    print(f"  Examples regressed: {len(regressed)}")

    if improved:
        print(f"\n‚úÖ Examples where fine-tuned model improved:")
        for i, example_id in enumerate(improved[:3], 1):  # Show first 3
            example = [r for r in val_eval_data if r['id'] == example_id][0]
            print(f"\n  {i}. ID: {example_id}")
            print(f"     Prompt: {example['prompt'][:150]}...")
            print(f"     Ground truth: {example['ground_truth']}")

    if regressed:
        print(f"\n‚ö†Ô∏è Examples where fine-tuned model regressed:")
        for i, example_id in enumerate(regressed[:3], 1):  # Show first 3
            example = [r for r in val_eval_data if r['id'] == example_id][0]
            print(f"\n  {i}. ID: {example_id}")
            print(f"     Prompt: {example['prompt'][:150]}...")
            print(f"     Ground truth: {example['ground_truth']}")
else:
    print("‚ö†Ô∏è Fine-tuned model results not available for error analysis")

## üéâ Summary

### What We Accomplished

1. ‚úÖ **Data Analysis**: Explored the charge classifier dataset
   - Analyzed completion distribution (Valid vs Error)
   - Examined prompt characteristics
   - Identified dataset statistics

2. ‚úÖ **Train/Val Split**: Created stratified 80/20 split
   - Maintained label distribution across splits
   - Ensured no overlap between train and validation sets

3. ‚úÖ **Baseline Evaluation**: Tested multiple baseline models
   - Evaluated on validation set
   - Established performance benchmarks

4. ‚úÖ **Fine-Tuning**: Trained a specialized classifier
   - Prepared data in proper format
   - Created fine-tuning job
   - Monitored training progress

5. ‚úÖ **Comparison**: Compared fine-tuned vs baseline
   - Measured performance improvement
   - Analyzed error patterns

### Next Steps

1. **Full Dataset Run**: Set `TESTING_MODE = False` to run on full 8k+ dataset
2. **Hyperparameter Tuning**: Experiment with different learning rates, epochs, batch sizes
3. **Model Selection**: Try different base models for fine-tuning
4. **Production Deployment**: Deploy the best-performing model for production use