<a href="https://colab.research.google.com/github/leippold/HAI-Frontier/blob/main/master_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HAI-Frontier: Human-AI Collaboration Frontier Analysis

This notebook runs the complete analysis for the paper "The Human-AI Collaboration Frontier and the Quality of Science".

**Instructions:**
1. Click "Open in Colab" badge or open this notebook in Google Colab
2. Set your data folder path (can be Google Drive) in Section 1
3. Optionally set GitHub token to push results
4. Run all cells (Runtime ‚Üí Run all)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/leippold/HAI-Frontier/blob/main/master_analysis.ipynb)

## 1. Configuration

In [5]:
from google.colab import userdata

# Get token from Colab Secrets (set once in sidebar, persists across sessions)
try:
    GITHUB_TOKEN = userdata.get('GITHUB_TOKEN')
    print("GitHub token loaded from Colab Secrets")
except:
    GITHUB_TOKEN = None
    print("No GitHub token found - will skip git push operations")

GitHub token loaded from Colab Secrets


In [6]:
#@title Data & GitHub Configuration { display-mode: "form" }

#@markdown ### Data Location
#@markdown Set the path to your data folder (supports Google Drive):
DATA_PATH = "/content/drive/MyDrive/HAI_Data"  #@param {type:"string"}


#@markdown Expected files in DATA_PATH:
#@markdown - `retraction_watch.csv`
#@markdown - `all_problematic_papers.csv`
#@markdown - `iclr_pangram_submissions.csv` (optional)
#@markdown - `iclr_pangram_reviews.csv` (optional)

#@markdown ---
#@markdown ### GitHub Configuration (optional, for pushing results)
# GITHUB_TOKEN = ""  #@param {type:"string"}
GITHUB_USER = "leippold"  #@param {type:"string"}
REPO_NAME = "HAI-Frontier"  #@param {type:"string"}

# # Validate
# print(f"üìÇ Data folder: {DATA_PATH}")
# if not GITHUB_TOKEN:
#     print("‚ö†Ô∏è  No GitHub token set. You can still run analyses but cannot push changes.")
# else:
#     print("‚úì GitHub token configured")

## 2. Setup Environment

In [7]:
#@title Mount Google Drive (if using Drive for data)

# Only run this if your DATA_PATH is in Google Drive
if DATA_PATH.startswith("/content/drive"):
    from google.colab import drive
    drive.mount('/content/drive')
    print("‚úì Google Drive mounted")
else:
    print("‚ÑπÔ∏è  Skipping Drive mount (DATA_PATH is not in Drive)")

Mounted at /content/drive
‚úì Google Drive mounted


In [8]:
#@title Install Dependencies
!pip install lifelines scipy statsmodels seaborn --quiet
print("‚úì Dependencies installed")

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m349.3/349.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m117.3/117.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for autograd-gamma (setup.py) ... [?25l[?25hdone
‚úì Dependencies installed


In [9]:
#@title Clone Repository from GitHub
import os

# Ensure we're in a valid directory
%cd /content

# Clean up any existing clone
!rm -rf /content/{REPO_NAME}

# Clone the repo
if GITHUB_TOKEN:
    !git clone https://{GITHUB_TOKEN}@github.com/{GITHUB_USER}/{REPO_NAME}.git
else:
    !git clone https://github.com/{GITHUB_USER}/{REPO_NAME}.git

# Change to repo directory
%cd /content/{REPO_NAME}

# Show structure
print("\nüìÅ Repository structure:")
!find . -type f \( -name "*.py" -o -name "*.md" \) | head -20

/content
Cloning into 'HAI-Frontier'...
remote: Enumerating objects: 200, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 200 (delta 25), reused 24 (delta 19), pack-reused 159 (from 1)[K
Receiving objects: 100% (200/200), 205.53 KiB | 2.78 MiB/s, done.
Resolving deltas: 100% (99/99), done.
/content/HAI-Frontier

üìÅ Repository structure:
./README.md
./inline_display.py
./iclr_analysis/generate_tables.py
./iclr_analysis/analysis/collaboration_hypothesis.py
./iclr_analysis/analysis/run_all.py
./iclr_analysis/analysis/effort_proxies.py
./iclr_analysis/analysis/within_paper.py
./iclr_analysis/analysis/heterogeneity.py
./iclr_analysis/analysis/echo_chamber.py
./iclr_analysis/analysis/confidence.py
./iclr_analysis/analysis/__init__.py
./iclr_analysis/README.md
./iclr_analysis/src/stats_utils.py
./iclr_analysis/src/data_loading.py
./iclr_analysis/src/__init__.py
./iclr_analysis/src/plotting_enhanced.py
./iclr_analys

In [None]:
#@title Setup Python Path and Imports
import sys
import os
import importlib.util

# Repository path
REPO_PATH = f"/content/{REPO_NAME}"

# Output directory - use last_results/master_analysis for organized results
OUTPUT_DIR = f"{REPO_PATH}/last_results/master_analysis"

# Create output directories
os.makedirs(f"{OUTPUT_DIR}/figures", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/tables", exist_ok=True)

# Helper function to load modules by explicit path (avoids name conflicts)
def load_module_from_path(module_name, file_path):
    """Load a Python module from an explicit file path."""
    spec = importlib.util.spec_from_file_location(module_name, file_path)
    module = importlib.util.module_from_spec(spec)
    sys.modules[module_name] = module
    spec.loader.exec_module(module)
    return module

# Add paths for dependencies (but NOT for run_all - we'll load those explicitly)
sys.path.insert(0, f"{REPO_PATH}/retraction_analysis")
sys.path.insert(0, f"{REPO_PATH}/iclr_analysis")

# Common imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML, Markdown, Image
import warnings
warnings.filterwarnings('ignore')

# Configure matplotlib for inline display
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

print("‚úì Environment configured")
print(f"  Repository: {REPO_PATH}")
print(f"  Data folder: {DATA_PATH}")
print(f"  Output folder: {OUTPUT_DIR}")

In [11]:
#@title Verify Data Files
import os

print(f"üìä Checking data files in: {DATA_PATH}\n")

required_files = [
    ("retraction_watch.csv", "Retraction Watch data", True),
    ("all_problematic_papers.csv", "Problematic Papers data", True),
    ("iclr_pangram_submissions.csv", "ICLR Submissions", False),
    ("iclr_pangram_reviews.csv", "ICLR Reviews", False),
]

all_required_found = True
for filename, description, required in required_files:
    filepath = os.path.join(DATA_PATH, filename)
    if os.path.exists(filepath):
        size = os.path.getsize(filepath) / 1024 / 1024  # MB
        print(f"  ‚úì {filename} ({size:.1f} MB)")
    else:
        if required:
            print(f"  ‚ùå {filename} - REQUIRED but not found!")
            all_required_found = False
        else:
            print(f"  ‚ö†Ô∏è {filename} - optional, not found")

if all_required_found:
    print("\n‚úì All required files found!")
else:
    print("\n‚ùå Some required files missing. Please check DATA_PATH.")

üìä Checking data files in: /content/drive/MyDrive/HAI_Data

  ‚úì retraction_watch.csv (59.3 MB)
  ‚úì all_problematic_papers.csv (244.7 MB)
  ‚úì iclr_pangram_submissions.csv (7.6 MB)
  ‚úì iclr_pangram_reviews.csv (228.0 MB)

‚úì All required files found!


## 3. Load Data

In [None]:
#@title Load Datasets

# Helper function to clean ai_percentage column
def clean_ai_percentage(df, col='ai_percentage'):
    """Convert ai_percentage from string ('100%') to numeric (100.0)."""
    if col in df.columns and df[col].dtype == 'object':
        df[col] = df[col].astype(str).str.replace('%', '', regex=False)
        df[col] = pd.to_numeric(df[col], errors='coerce')
    return df

# Load retraction data
try:
    retraction_df = pd.read_csv(f"{DATA_PATH}/retraction_watch.csv")
    print(f"‚úì Retraction data loaded: {len(retraction_df):,} records")
    display(retraction_df.head(3))
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è  Retraction data not found: {e}")
    retraction_df = None

# Load problematic papers data
try:
    problematic_df = pd.read_csv(f"{DATA_PATH}/all_problematic_papers.csv")
    print(f"\n‚úì Problematic papers data loaded: {len(problematic_df):,} records")
except FileNotFoundError:
    print("‚ö†Ô∏è  Problematic papers data not found")
    problematic_df = None

# Load ICLR data if available
try:
    iclr_submissions = pd.read_csv(f"{DATA_PATH}/iclr_pangram_submissions.csv")
    iclr_reviews = pd.read_csv(f"{DATA_PATH}/iclr_pangram_reviews.csv")

    # Clean ai_percentage column (handles '100%' -> 100.0)
    iclr_submissions = clean_ai_percentage(iclr_submissions)
    iclr_reviews = clean_ai_percentage(iclr_reviews)

    # Clean other numeric columns
    numeric_cols = ['avg_rating', 'soundness', 'presentation', 'contribution', 'rating', 'confidence']
    for col in numeric_cols:
        if col in iclr_submissions.columns:
            iclr_submissions[col] = pd.to_numeric(iclr_submissions[col], errors='coerce')
        if col in iclr_reviews.columns:
            iclr_reviews[col] = pd.to_numeric(iclr_reviews[col], errors='coerce')

    print(f"\n‚úì ICLR submissions loaded: {len(iclr_submissions):,} records")
    print(f"‚úì ICLR reviews loaded: {len(iclr_reviews):,} records")
    print(f"  ai_percentage dtype: {iclr_submissions['ai_percentage'].dtype}")
except FileNotFoundError:
    print("\n‚ö†Ô∏è  ICLR data not found (optional)")
    iclr_submissions = None
    iclr_reviews = None

‚úì Retraction data loaded: 67,989 records


Unnamed: 0,Record ID,Title,Subject,Institution,Journal,Publisher,Country,Author,URLS,ArticleType,RetractionDate,RetractionDOI,RetractionPubMedID,OriginalPaperDate,OriginalPaperDOI,OriginalPaperPubMedID,RetractionNature,Reason,Paywalled,Notes
0,68582,Study of the ground state properties of 8Li an...,(PHY) Physics;,"Department of Physics, College of Science, Uni...",AIP Conference Proceedings,AIP Publishing,Iraq,Hawraa K Mahdi;Ahmed N Abdullah,,Conference Abstract/Paper;,10/24/2025 0:00,10.1063/5.0181894,0.0,12/22/2023 0:00,10.1063/5.0181894,0.0,Retraction,Concerns/Issues about Peer Review;Concerns/Iss...,No,Original articles updated to include retractio...
1,68581,Study of some biochemical indicators levels in...,(BLS) Parasitology;(HSC) Medicine - Infectious...,"Faculty of Applied Science, Samarra University...",AIP Conference Proceedings,AIP Publishing,Iraq,Wasan Abdulmunem Taha;Ohood Mozahim Shakir;Mar...,,Conference Abstract/Paper;,10/24/2025 0:00,10.1063/5.0182763,0.0,12/22/2023 0:00,10.1063/5.0182763,0.0,Retraction,Concerns/Issues about Peer Review;Concerns/Iss...,No,Original articles updated to include retractio...
2,68580,Study of histological structure of albino rat ...,(HSC) Medicine - Pharmacology;(HSC) Medicine -...,College Medical and health technologies Univer...,AIP Conference Proceedings,AIP Publishing,Iraq,Noor M Hasnawi;Jabbar Abadi Alaridhi;Douaa M M...,,Conference Abstract/Paper;,10/24/2025 0:00,10.1063/5.0182531,0.0,12/22/2023 0:00,10.1063/5.0182531,0.0,Retraction,Concerns/Issues about Peer Review;Concerns/Iss...,No,Original articles updated to include retractio...


---
# Part A: Retraction Analysis
---

In [None]:
#@title Run Retraction Analysis

# Load retraction_analysis/run_all.py explicitly by path (avoids conflict with iclr_analysis/analysis/run_all.py)
retraction_runner = load_module_from_path(
    "retraction_run_all",
    f"{REPO_PATH}/retraction_analysis/run_all.py"
)

# Run all retraction analyses
if retraction_df is not None:
    retraction_results = retraction_runner.run_all_analyses(
        retraction_path=f"{DATA_PATH}/retraction_watch.csv",
        problematic_path=f"{DATA_PATH}/all_problematic_papers.csv",
        output_dir=OUTPUT_DIR
    )
    print("\n‚úì Retraction analysis complete!")
else:
    print("‚ö†Ô∏è  Skipping retraction analysis - data not loaded")
    retraction_results = None

In [None]:
#@title Display Generated Figures (Retraction Analysis)
import glob
from IPython.display import Image, display

print("üìà Generated Figures:\n")

# Find all generated figures
figure_files = sorted(glob.glob(f"{OUTPUT_DIR}/figures/*.png"))

if figure_files:
    for fig_path in figure_files:
        fig_name = os.path.basename(fig_path)
        print(f"\n{'='*60}")
        print(f"üìä {fig_name}")
        print(f"{'='*60}")
        display(Image(filename=fig_path, width=800))
else:
    print("No figures generated yet. Run the analysis cells above first.")

In [None]:
#@title Display Generated LaTeX Tables

print("üìã Generated Tables:\n")

# Find all generated tables
table_files = sorted(glob.glob(f"{OUTPUT_DIR}/tables/*.tex"))

if table_files:
    for table_path in table_files:
        table_name = os.path.basename(table_path)
        print(f"\n{'='*60}")
        print(f"üìÑ {table_name}")
        print(f"{'='*60}")
        with open(table_path, 'r') as f:
            content = f.read()
            print(content[:2000])
            if len(content) > 2000:
                print("... [truncated]")
else:
    print("No tables generated yet. Run the analysis cells above first.")

In [None]:
#@title Generate Enhanced Retraction Figures (Publication Quality)
#@markdown Creates professional, publication-quality figures for the retraction analysis.
#@markdown Matches the style of the ICLR figures with KDE, gradient fills, and statistics boxes.

if retraction_results is not None:
    print("="*70)
    print("GENERATING ENHANCED RETRACTION FIGURES")
    print("="*70)

    # Load enhanced plotting module for retraction analysis
    from retraction_analysis_modules.plotting_enhanced import generate_all_retraction_figures

    figures_dir = f"{OUTPUT_DIR}/figures"

    # Get the processed dataframe from retraction_results
    if 'data' in retraction_results:
        df_for_figures = retraction_results['data']
    else:
        # Reload and process if not available
        from retraction_src.data_loading import load_data, define_ai_cohorts
        rw_df, prob_df = load_data(
            f"{DATA_PATH}/retraction_watch.csv",
            f"{DATA_PATH}/all_problematic_papers.csv",
            start_year=2005
        )
        df_for_figures = define_ai_cohorts(rw_df, prob_df)

    print(f"\nData for figures: {len(df_for_figures):,} records")

    # Generate ALL enhanced figures
    print("\nGenerating enhanced figures...")
    enhanced_figures = generate_all_retraction_figures(
        df=df_for_figures,
        output_dir=figures_dir,
        verbose=True
    )

    # Display all generated enhanced figures
    print("\n" + "="*70)
    print("DISPLAYING ENHANCED RETRACTION FIGURES")
    print("="*70)

    for name, path in enhanced_figures.items():
        print(f"\n{'‚îÄ'*60}")
        print(f"üìä {name}: {os.path.basename(path)}")
        print(f"{'‚îÄ'*60}")
        display(Image(filename=path, width=800))

    print("\n" + "="*70)
    print(f"‚úì Generated {len(enhanced_figures)} enhanced retraction figures!")
    print(f"  Location: {figures_dir}/")
    print("="*70)

else:
    print("‚ö†Ô∏è  Skipping enhanced figures - retraction analysis not run")

---
## Sample Construction Audit

This section provides a complete audit trail of sample construction to address any concerns about cohort definition inconsistencies.
---

In [None]:
#@title Run Sample Construction Audit
#@markdown This generates a complete audit trail of sample construction with step-by-step exclusion counts.

from retraction_analysis_modules.sample_construction import SampleConstructionAudit

if retraction_df is not None:
    print("="*70)
    print("SAMPLE CONSTRUCTION AUDIT")
    print("="*70)

    # Initialize audit
    audit = SampleConstructionAudit(
        retraction_path=f"{DATA_PATH}/retraction_watch.csv",
        problematic_path=f"{DATA_PATH}/all_problematic_papers.csv"
    )

    # Run the complete audit
    final_df = audit.load_and_process()

    # Display sample flow table
    print("\n" + "="*70)
    print("SAMPLE FLOW TABLE")
    print("="*70)
    flow_df = audit.get_flow_table()
    display(flow_df)

    # Display final cohort breakdown
    print("\n" + "="*70)
    print("FINAL COHORT BREAKDOWN")
    print("="*70)
    if final_df is not None and len(final_df) > 0:
        if 'is_ai' in final_df.columns:
            ai_papers = final_df['is_ai'].sum()
            human_papers = len(final_df) - ai_papers
            print(f"  Total analytic sample: N = {len(final_df):,}")
            print(f"  AI-assisted papers:    n = {ai_papers:,} ({100*ai_papers/len(final_df):.1f}%)")
            print(f"  Human-only papers:     n = {human_papers:,} ({100*human_papers/len(final_df):.1f}%)")

    # Generate outputs
    print("\n" + "="*70)
    print("GENERATING AUDIT OUTPUTS")
    print("="*70)
    audit.generate_full_report(output_dir=OUTPUT_DIR)

    # Display LaTeX table for paper
    print("\n" + "="*70)
    print("LATEX SAMPLE FLOW TABLE (for paper appendix)")
    print("="*70)
    latex_table = audit.to_latex_flow_table()
    print(latex_table)

    # Verify consistency with expected values
    print("\n" + "="*70)
    print("CONSISTENCY CHECK")
    print("="*70)
    # Check against the values mentioned in the paper
    verification = audit.verify_consistency(
        expected_total=58454,
        expected_ai=18159,
        expected_human=40295
    )

    print(f"  Actual total:  {verification['actual_total']:,}")
    print(f"  Actual AI:     {verification['actual_ai']:,}")
    print(f"  Actual Human:  {verification['actual_human']:,}")
    print()

    if verification['checks_passed']:
        print("  ‚úì All consistency checks PASSED")
    else:
        print("  ‚úó Consistency checks found discrepancies:")
        for disc in verification['discrepancies']:
            print(f"    - {disc}")
        print()
        print("  NOTE: Discrepancies may be due to:")
        print("    - Database updates since paper submission")
        print("    - Different extraction dates")
        print("    - Variations in filtering criteria")

    print("\n‚úì Sample construction audit complete!")
else:
    print("‚ö†Ô∏è  Skipping sample audit - retraction data not loaded")

---
# Part B: ICLR Analysis
---

In [None]:
#@title Run ICLR Analysis

if iclr_submissions is not None and iclr_reviews is not None:
    # Load iclr_analysis/analysis/run_all.py explicitly by path
    iclr_runner = load_module_from_path(
        "iclr_run_all",
        f"{REPO_PATH}/iclr_analysis/analysis/run_all.py"
    )

    iclr_results = iclr_runner.run_all(
        submissions_path=f"{DATA_PATH}/iclr_pangram_submissions.csv",
        reviews_path=f"{DATA_PATH}/iclr_pangram_reviews.csv",
        output_dir=OUTPUT_DIR
    )
    print("\n‚úì ICLR analysis complete!")
else:
    print("‚ö†Ô∏è  Skipping ICLR analysis - data not loaded")
    iclr_results = None

---
## Enhanced Visualizations (Publication Quality)

Generate professional, single-panel figures for the collaboration analysis.
---

In [None]:
#@title Generate ALL Publication-Quality Figures (Single Panels)
#@markdown Creates individual, professional figures for ALL analysis panels.
#@markdown Includes KDE-based within-paper analysis and handles integer clustering.

if iclr_submissions is not None and iclr_reviews is not None:
    print("="*70)
    print("GENERATING ALL PUBLICATION-QUALITY FIGURES")
    print("="*70)

    # Load enhanced plotting module
    from src.plotting_enhanced import generate_all_iclr_figures
    from analysis.within_paper import prepare_within_paper_data

    figures_dir = f"{OUTPUT_DIR}/figures"

    # Prepare within-paper data if possible
    try:
        paper_ratings = prepare_within_paper_data(iclr_reviews, iclr_submissions)
        print(f"\nWithin-paper data: {len(paper_ratings)} papers with both reviewer types")
    except Exception as e:
        print(f"\nWithin-paper data not available: {e}")
        paper_ratings = None

    # Generate ALL figures as individual files
    print("\nGenerating individual figures...")
    figures = generate_all_iclr_figures(
        submissions_df=iclr_submissions,
        reviews_df=iclr_reviews,
        output_dir=figures_dir,
        paper_ratings=paper_ratings,
        verbose=True
    )

    # Display all generated figures
    print("\n" + "="*70)
    print("DISPLAYING GENERATED FIGURES")
    print("="*70)

    for name, path in figures.items():
        print(f"\n{'‚îÄ'*60}")
        print(f"üìä {name}: {os.path.basename(path)}")
        print(f"{'‚îÄ'*60}")
        display(Image(filename=path, width=800))

    print("\n" + "="*70)
    print(f"‚úì Generated {len(figures)} publication-quality figures!")
    print(f"  Location: {figures_dir}/")
    print("="*70)

else:
    print("‚ö†Ô∏è  Skipping enhanced figures - ICLR data not loaded")

---
# Part C: Summary Statistics
---

In [None]:
#@title Generate Summary Statistics

print("="*60)
print("ANALYSIS SUMMARY")
print("="*60)

if retraction_df is not None:
    print(f"\nüìä Retraction Dataset:")
    print(f"   Total papers: {len(retraction_df):,}")
    if 'is_ai' in retraction_df.columns:
        ai_count = retraction_df['is_ai'].sum()
        print(f"   AI-flagged papers: {ai_count:,} ({100*ai_count/len(retraction_df):.1f}%)")
    if 'pub_year' in retraction_df.columns:
        print(f"   Year range: {retraction_df['pub_year'].min()} - {retraction_df['pub_year'].max()}")

if iclr_submissions is not None:
    print(f"\nüìä ICLR Dataset:")
    print(f"   Total submissions: {len(iclr_submissions):,}")
    print(f"   Total reviews: {len(iclr_reviews):,}")

# Count outputs
import glob
n_figures = len(glob.glob(f"{OUTPUT_DIR}/figures/*.png"))
n_tables = len(glob.glob(f"{OUTPUT_DIR}/tables/*.tex"))
print(f"\nüìÅ Generated Outputs:")
print(f"   {n_figures} figures")
print(f"   {n_tables} tables")

---
# Part D: Push Results to GitHub
---

In [None]:
#@title Push RESULTS to GitHub (Tables & Figures Only)
#@markdown Automatically pushes only the analysis outputs to GitHub.
#@markdown This ensures your paper always references the latest empirical results.

def push_results_to_github(
    repo_path="/content/HAI-Frontier",
    branch="main",
    commit_message=None,
    github_token=None
):
    """
    Push only RESULTS (figures, tables) to GitHub - not notebooks or code.
    
    This solves the 'fat finger problem' by ensuring reproducibility:
    - Run notebook ‚Üí Results auto-pushed ‚Üí Paper references match outputs
    
    Results are organized in: last_results/master_analysis/
    """
    import os
    import subprocess
    from datetime import datetime
    
    # Get token
    token = github_token
    if not token:
        try:
            from google.colab import userdata
            token = userdata.get('GITHUB_TOKEN')
        except:
            pass
    
    if not token:
        print("‚ùå No GitHub token! Add GITHUB_TOKEN to Colab secrets.")
        return False
    
    os.chdir(repo_path)
    
    # Configure git
    subprocess.run(['git', 'config', 'user.email', 'colab@notebook.local'], capture_output=True)
    subprocess.run(['git', 'config', 'user.name', 'Colab Analysis Runner'], capture_output=True)
    
    # Define ONLY result files to push (organized in last_results/master_analysis/)
    result_patterns = [
        'last_results/master_analysis/figures/*.png',
        'last_results/master_analysis/figures/*.pdf', 
        'last_results/master_analysis/tables/*.tex',
        'last_results/master_analysis/tables/*.csv',
        'last_results/master_analysis/*.csv',
        'last_results/master_analysis/*.tex',
    ]
    
    # Add result files
    print("üìä Adding result files from last_results/master_analysis/...")
    files_added = []
    for pattern in result_patterns:
        result = subprocess.run(f'git add {pattern} 2>/dev/null', shell=True, capture_output=True)
        # Check what was actually added
        import glob
        matched = glob.glob(os.path.join(repo_path, pattern))
        files_added.extend(matched)
    
    if not files_added:
        print("‚ÑπÔ∏è  No result files to push.")
        return True
    
    print(f"   Found {len(files_added)} result files")
    
    # Check if there are changes
    status = subprocess.run(['git', 'status', '--porcelain'], capture_output=True, text=True)
    if not status.stdout.strip():
        print("‚ÑπÔ∏è  No changes to commit (results unchanged)")
        return True
    
    # Show what's being committed
    print("\nüìù Files to commit:")
    for line in status.stdout.strip().split('\n')[:15]:
        print(f"   {line}")
    if len(status.stdout.strip().split('\n')) > 15:
        print(f"   ... and {len(status.stdout.strip().split(chr(10))) - 15} more")
    
    # Create commit message
    if not commit_message:
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M")
        commit_message = f"Update analysis results from master_analysis.ipynb ({timestamp})"
    
    # Commit
    result = subprocess.run(
        ['git', 'commit', '-m', commit_message],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        print(f"‚ùå Commit failed: {result.stderr}")
        return False
    
    print(f"\n‚úì Committed: {commit_message}")
    
    # Push
    print("\nüì§ Pushing to GitHub...")
    result = subprocess.run(
        ['git', 'push', 'origin', branch],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        print(f"‚ùå Push failed: {result.stderr}")
        return False
    
    print(f"‚úì Successfully pushed to {branch}")
    print(f"\nüîó View at: https://github.com/{GITHUB_USER}/{REPO_NAME}/tree/{branch}/last_results/master_analysis")
    return True

# ============================================================
# AUTO-PUSH RESULTS
# ============================================================
print("="*60)
print("PUSHING ANALYSIS RESULTS TO GITHUB")
print("="*60)

success = push_results_to_github(
    repo_path=f"/content/{REPO_NAME}",
    branch="main",
    github_token=GITHUB_TOKEN
)

if success:
    print("\n" + "="*60)
    print("‚úì Results synchronized with GitHub!")
    print("  Your paper now references the latest empirical outputs.")
    print("  Results location: last_results/master_analysis/")
    print("="*60)
else:
    print("\n‚ö†Ô∏è  Push failed - check token and permissions")

---
# Appendix: Download Outputs
---

In [None]:
#@title Download All Outputs as ZIP
from google.colab import files
import shutil

# Create zip of outputs
output_zip = "/content/analysis_outputs.zip"
shutil.make_archive("/content/analysis_outputs", 'zip', OUTPUT_DIR)

print(f"üì¶ Created: {output_zip}")
print("\nClick below to download:")
files.download(output_zip)