This is an in class lab we focused on the reproducibility of GitHub Repos.
Reproducibility Lab Submission was for Maia-2 (NeurIPS 2024)
This repository contains my reproducibility work for the NeurIPS 2024 paper:
Maia-2: A Unified Model for Human-AI Alignment in Chess Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson
Repository Contents
reproducibility_lab.ipynb — Executed Colab notebook with all baseline + variation experiments, plus reflections.
runs/ — Experiment outputs: JSON files for baseline/variation, both batch and position-wise inference.
logs/ — Environment metadata (env_meta.json) and package lockfile (requirements.lock.txt).
reproduce/ —
* Reproducibility.md — Full reproducibility report (environment, commands, results, reflection).
* run_all.sh — Script to automatically reproduce baseline + variation runs and save outputs/logs.
Environment Setup
Clone the upstream Maia-2 repo (not included here):
git clone https://github.com/CSSLab/maia2.git cd maia2
Create a fresh environment (Python 3.10+ recommended):
python -m venv .venv source .venv/bin/activate pip install -U pip
Install dependencies:
pip install chess==1.10.0 einops==0.8.0 gdown==5.2.0 numpy==2.1.3 pandas==2.2.3
pyzstd==0.15.9 requests==2.32.3 torch==2.4.0 tqdm==4.65.0
How to Reproduce Results
From the root of this repo, run the shell script:
bash reproduce/run_all.sh
This will:
-
Run batch inference (baseline + variation).
-
Run position-wise inference (baseline + variation).
-
Save all outputs under runs/.
-
Save logs and environment snapshots under logs/.
-
Alternatively, open reproducibility_lab.ipynb and re-execute cells manually in Colab or Jupyter.
Expected Results
Batch Inference: Accuracy numbers comparable between baseline and variation (with minor changes depending on batch size/model type).
Position-wise Inference: Predicted moves and win probabilities logged, showing changes when ELO parameters are varied.
Exact numbers may differ slightly depending on hardware (CPU vs GPU) and random seeds. Our experiment utilized a random seed of 42 and GPU A100 on Google Colab.
Notes
The upstream repo (CSSLab/maia2) is not included here — only reproducibility artifacts.
Logs and outputs in this repo were generated using Google Colab (GPU runtime).
Non-determinism may arise from GPU differences and API randomness.