# GPQA-D Performance Regression Analysis

This notebook performs reverse regression analysis to understand how GPQA-D performance improves over time.

**Key Questions:**
1. How much does performance improve per year **holding price constant** (by price bin)?
2. How much does performance improve per year **without controlling for price** (overall trend)?

**Methodology:**
- Use logit-transformed GPQA-D scores: logit(p) = log(p / (1-p))
- Linear regression of logit(GPQA-D) vs time
- Focus on Pareto frontier (best models in each price bin over time)
- Report annual improvement rates in both logit units and approximate percentage points

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from matplotlib.dates import DateFormatter
from sklearn.linear_model import LinearRegression
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the data
df = pd.read_csv('data/price_reduction_models.csv')

# Convert Release Date to datetime
df['Release Date'] = pd.to_datetime(df['Release Date'])

# Clean GPQA-D column (epoch_gpqa) - convert percentage strings to floats
df['GPQA_D'] = df['epoch_gpqa'].astype(str).str.replace('%', '').astype(float)

# Clean Benchmark Cost USD - convert string with $ and commas to float
df['Price'] = df['Benchmark Cost USD'].astype(str).str.replace('[$,]', '', regex=True)
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Filter out rows with missing data
df_clean = df[['Model', 'Release Date', 'GPQA_D', 'Price']].dropna()
df_clean = df_clean[df_clean['Price'] > 0]

print(f"Total models with complete data: {len(df_clean)}")
print(f"Date range: {df_clean['Release Date'].min()} to {df_clean['Release Date'].max()}")
print(f"GPQA-D range: {df_clean['GPQA_D'].min():.1f}% to {df_clean['GPQA_D'].max():.1f}%")
print(f"Price range: ${df_clean['Price'].min():.2f} to ${df_clean['Price'].max():.2f}")

In [None]:
# Define logit transformation functions
def logit(p):
    """Convert probability to logit scale"""
    # Clip values to avoid log(0) or log(infinity)
    p_clipped = np.clip(p, 0.001, 0.999)
    return np.log(p_clipped / (1 - p_clipped))

def inverse_logit(logit_val):
    """Convert logit back to probability"""
    return 1 / (1 + np.exp(-logit_val))

# Add logit column (convert percentage to decimal first)
df_clean['GPQA_D_logit'] = logit(df_clean['GPQA_D'] / 100)

# Add ordinal date for regression
df_clean['Date_Ordinal'] = df_clean['Release Date'].map(datetime.toordinal)

print("Logit transformation complete.")
print(f"Logit range: {df_clean['GPQA_D_logit'].min():.2f} to {df_clean['GPQA_D_logit'].max():.2f}")

In [None]:
# Define price bins - same as in previous analysis
price_percentiles = [0, 33, 67, 100]
price_bins = np.percentile(df_clean['Price'], price_percentiles)

# Create bin labels
bin_labels = [
    f'Low Price (${price_bins[0]:.2f}-${price_bins[1]:.2f})',
    f'Mid Price (${price_bins[1]:.2f}-${price_bins[2]:.2f})',
    f'High Price (${price_bins[2]:.2f}-${price_bins[3]:.2f})'
]

# Assign bins
df_clean['Price_Bin'] = pd.cut(df_clean['Price'], bins=price_bins, labels=bin_labels, include_lowest=True)

print("\nPrice bins defined:")
print(df_clean['Price_Bin'].value_counts().sort_index())

## Regression Analysis: Performance Improvement Over Time

We perform linear regression of logit(GPQA-D) vs time for:
1. Each price bin (controlling for price)
2. All models combined (not controlling for price)

We can choose to use:
- **Pareto frontier only**: Best models in each bin over time
- **All models**: Every model in the dataset

In [None]:
def perform_regression_analysis(df, use_pareto_only=True, min_date=None):
    """
    Perform regression analysis of logit(GPQA-D) vs time.
    
    Parameters:
    -----------
    df : DataFrame
        Input data with columns: Release Date, GPQA_D, GPQA_D_logit, Price, Price_Bin, Date_Ordinal
    use_pareto_only : bool
        If True, use only Pareto frontier (best models over time in each bin)
        If False, use all models
    min_date : datetime or None
        Filter to models released on or after this date
    
    Returns:
    --------
    results_df : DataFrame
        Summary table with regression results for each price bin and overall
    """
    
    df_work = df.copy()
    
    # Apply date filter if specified
    if min_date is not None:
        if isinstance(min_date, str):
            min_date = pd.to_datetime(min_date)
        df_work = df_work[df_work['Release Date'] >= min_date]
    
    # Sort by date
    df_work = df_work.sort_values('Release Date')
    
    results = []
    
    # Analyze each price bin
    for bin_label in sorted(df_work['Price_Bin'].dropna().unique()):
        df_bin = df_work[df_work['Price_Bin'] == bin_label].copy()
        
        if len(df_bin) == 0:
            continue
        
        # Get Pareto frontier if requested
        if use_pareto_only:
            df_bin['Is_Best'] = df_bin['GPQA_D'].cummax() == df_bin['GPQA_D']
            df_analysis = df_bin[df_bin['Is_Best']].copy()
        else:
            df_analysis = df_bin.copy()
        
        if len(df_analysis) < 2:
            continue
        
        # Perform regression
        X = df_analysis['Date_Ordinal'].values.reshape(-1, 1)
        y = df_analysis['GPQA_D_logit'].values
        
        model = LinearRegression().fit(X, y)
        r_squared = model.score(X, y)
        
        # Calculate annual improvement rate in logit units
        annual_improvement_logit = model.coef_[0] * 365
        
        # Convert to approximate percentage points per year at mean performance
        mean_gpqa = df_analysis['GPQA_D'].mean()
        mean_logit = logit(mean_gpqa / 100)
        future_logit = mean_logit + annual_improvement_logit
        future_prob = inverse_logit(future_logit) * 100
        annual_improvement_pct = future_prob - mean_gpqa
        
        # Calculate 95% confidence interval for slope
        n = len(X)
        y_pred = model.predict(X)
        residuals = y - y_pred
        mse = np.sum(residuals**2) / (n - 2)
        se = np.sqrt(mse / np.sum((X - np.mean(X))**2))
        t_val = stats.t.ppf(0.975, n - 2)
        ci_lower_logit = (model.coef_[0] - t_val * se) * 365
        ci_upper_logit = (model.coef_[0] + t_val * se) * 365
        
        # Store results
        results.append({
            'Category': str(bin_label),
            'N_Models': len(df_analysis),
            'Date_Range': f"{df_analysis['Release Date'].min().strftime('%Y-%m-%d')} to {df_analysis['Release Date'].max().strftime('%Y-%m-%d')}",
            'GPQA-D_Range': f"{df_analysis['GPQA_D'].min():.1f}% to {df_analysis['GPQA_D'].max():.1f}%",
            'Mean_GPQA-D': f"{mean_gpqa:.1f}%",
            'Annual_Improvement_Logits': annual_improvement_logit,
            'Annual_Improvement_PctPts': annual_improvement_pct,
            'CI_Lower_Logits': ci_lower_logit,
            'CI_Upper_Logits': ci_upper_logit,
            'R_Squared': r_squared,
            'Price_Range': f"${df_analysis['Price'].min():.2f} to ${df_analysis['Price'].max():.2f}"
        })
    
    # Analyze overall (not controlling for price)
    if use_pareto_only:
        df_work['Is_Overall_Best'] = df_work['GPQA_D'].cummax() == df_work['GPQA_D']
        df_overall = df_work[df_work['Is_Overall_Best']].copy()
    else:
        df_overall = df_work.copy()
    
    if len(df_overall) >= 2:
        X = df_overall['Date_Ordinal'].values.reshape(-1, 1)
        y = df_overall['GPQA_D_logit'].values
        
        model = LinearRegression().fit(X, y)
        r_squared = model.score(X, y)
        
        annual_improvement_logit = model.coef_[0] * 365
        
        mean_gpqa = df_overall['GPQA_D'].mean()
        mean_logit = logit(mean_gpqa / 100)
        future_logit = mean_logit + annual_improvement_logit
        future_prob = inverse_logit(future_logit) * 100
        annual_improvement_pct = future_prob - mean_gpqa
        
        # Calculate 95% confidence interval
        n = len(X)
        y_pred = model.predict(X)
        residuals = y - y_pred
        mse = np.sum(residuals**2) / (n - 2)
        se = np.sqrt(mse / np.sum((X - np.mean(X))**2))
        t_val = stats.t.ppf(0.975, n - 2)
        ci_lower_logit = (model.coef_[0] - t_val * se) * 365
        ci_upper_logit = (model.coef_[0] + t_val * se) * 365
        
        results.append({
            'Category': 'Overall (No Price Control)',
            'N_Models': len(df_overall),
            'Date_Range': f"{df_overall['Release Date'].min().strftime('%Y-%m-%d')} to {df_overall['Release Date'].max().strftime('%Y-%m-%d')}",
            'GPQA-D_Range': f"{df_overall['GPQA_D'].min():.1f}% to {df_overall['GPQA_D'].max():.1f}%",
            'Mean_GPQA-D': f"{mean_gpqa:.1f}%",
            'Annual_Improvement_Logits': annual_improvement_logit,
            'Annual_Improvement_PctPts': annual_improvement_pct,
            'CI_Lower_Logits': ci_lower_logit,
            'CI_Upper_Logits': ci_upper_logit,
            'R_Squared': r_squared,
            'Price_Range': f"${df_overall['Price'].min():.2f} to ${df_overall['Price'].max():.2f}"
        })
    
    # Create DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df

## Results: Pareto Frontier Only

This analysis uses only the best models in each price bin over time (Pareto frontier).

In [None]:
# Perform regression analysis using Pareto frontier only
results_pareto = perform_regression_analysis(
    df_clean,
    use_pareto_only=True,
    min_date=datetime(2024, 4, 1)
)

print("\n" + "="*100)
print("REGRESSION ANALYSIS: PARETO FRONTIER ONLY")
print("Performance Improvement Over Time (Logit Scale)")
print("="*100 + "\n")

# Display with nice formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

display(results_pareto)

# Save to CSV
results_pareto.to_csv('results/gpqa_regression_pareto.csv', index=False)
print("\nResults saved to: results/gpqa_regression_pareto.csv")

## Results: All Models

This analysis uses all models in the dataset, not just the Pareto frontier.

In [None]:
# Perform regression analysis using all models
results_all = perform_regression_analysis(
    df_clean,
    use_pareto_only=False,
    min_date=datetime(2024, 4, 1)
)

print("\n" + "="*100)
print("REGRESSION ANALYSIS: ALL MODELS")
print("Performance Improvement Over Time (Logit Scale)")
print("="*100 + "\n")

display(results_all)

# Save to CSV
results_all.to_csv('results/gpqa_regression_all_models.csv', index=False)
print("\nResults saved to: results/gpqa_regression_all_models.csv")

## Formatted Summary Table

A clean, formatted version of the Pareto frontier results for presentation.

In [None]:
# Create formatted summary table
summary_table = results_pareto.copy()

# Round numerical columns
summary_table['Annual_Improvement_Logits'] = summary_table['Annual_Improvement_Logits'].round(3)
summary_table['Annual_Improvement_PctPts'] = summary_table['Annual_Improvement_PctPts'].round(2)
summary_table['R_Squared'] = summary_table['R_Squared'].round(3)

# Create confidence interval column
summary_table['95% CI (Logits/yr)'] = summary_table.apply(
    lambda row: f"[{row['CI_Lower_Logits']:.3f}, {row['CI_Upper_Logits']:.3f}]",
    axis=1
)

# Select and rename columns for presentation
presentation_table = summary_table[[
    'Category',
    'N_Models',
    'Mean_GPQA-D',
    'Annual_Improvement_Logits',
    'Annual_Improvement_PctPts',
    '95% CI (Logits/yr)',
    'R_Squared',
    'GPQA-D_Range'
]].copy()

presentation_table.columns = [
    'Category',
    'N',
    'Mean GPQA-D',
    'Annual Improvement (logits/yr)',
    'Annual Improvement (% pts/yr)',
    '95% CI (logits/yr)',
    'R²',
    'GPQA-D Range'
]

print("\n" + "="*120)
print("SUMMARY: GPQA-D Performance Improvement Rates (Pareto Frontier)")
print("="*120 + "\n")

display(presentation_table)

# Save formatted table
presentation_table.to_csv('results/gpqa_regression_summary.csv', index=False)
print("\nFormatted summary saved to: results/gpqa_regression_summary.csv")

## Key Insights

Extract and display key insights from the regression analysis.

In [None]:
print("\n" + "="*100)
print("KEY INSIGHTS")
print("="*100 + "\n")

# Compare price-controlled vs uncontrolled improvement rates
overall_row = results_pareto[results_pareto['Category'] == 'Overall (No Price Control)'].iloc[0]
price_bin_rows = results_pareto[results_pareto['Category'] != 'Overall (No Price Control)']

print("1. OVERALL IMPROVEMENT (Not Controlling for Price):")
print(f"   - Annual improvement: {overall_row['Annual_Improvement_Logits']:.3f} logits/yr")
print(f"   - Approximate: {overall_row['Annual_Improvement_PctPts']:.2f} percentage points/yr")
print(f"   - R² = {overall_row['R_Squared']:.3f}")
print(f"   - Based on {overall_row['N_Models']} record-breaking models\n")

print("2. IMPROVEMENT BY PRICE BIN (Controlling for Price):")
for idx, row in price_bin_rows.iterrows():
    print(f"\n   {row['Category']}:")
    print(f"   - Annual improvement: {row['Annual_Improvement_Logits']:.3f} logits/yr")
    print(f"   - Approximate: {row['Annual_Improvement_PctPts']:.2f} percentage points/yr")
    print(f"   - R² = {row['R_Squared']:.3f}")
    print(f"   - Based on {row['N_Models']} models")

print("\n" + "-"*100)

# Compare improvement rates across price bins
print("\n3. COMPARISON ACROSS PRICE BINS:")
sorted_bins = price_bin_rows.sort_values('Annual_Improvement_Logits', ascending=False)
print(f"\n   Fastest improving: {sorted_bins.iloc[0]['Category']}")
print(f"   - {sorted_bins.iloc[0]['Annual_Improvement_Logits']:.3f} logits/yr")
print(f"   - {sorted_bins.iloc[0]['Annual_Improvement_PctPts']:.2f} percentage points/yr")

print(f"\n   Slowest improving: {sorted_bins.iloc[-1]['Category']}")
print(f"   - {sorted_bins.iloc[-1]['Annual_Improvement_Logits']:.3f} logits/yr")
print(f"   - {sorted_bins.iloc[-1]['Annual_Improvement_PctPts']:.2f} percentage points/yr")

# Calculate ratio
ratio = sorted_bins.iloc[0]['Annual_Improvement_Logits'] / sorted_bins.iloc[-1]['Annual_Improvement_Logits']
print(f"\n   Ratio: {ratio:.2f}x faster improvement\n")

print("="*100)

## Interpretation Guide

**Annual Improvement (logits/yr)**: The slope of the regression line in logit space. Higher values = faster improvement.

**Annual Improvement (% pts/yr)**: Approximate improvement in percentage points per year, calculated at the mean performance level for that category. This is easier to interpret than logit units.

**R²**: Coefficient of determination. Higher values (closer to 1.0) indicate the linear trend explains more variance in the data.

**Controlling for Price**: When we analyze by price bin, we're holding price roughly constant and measuring how performance improves over time at that price level.

**Not Controlling for Price**: The overall trend includes all models regardless of price, so it captures both:
- Performance improvements at all price levels
- The effect of higher-priced models potentially having better performance