ReproScore

Separating Readiness from Outcome in Research Software Reproducibility Assessment

ReproScore is a two-tier scoring framework for assessing the reproducibility of research software repositories. It separates reproducibility readiness (what a repository contains) from reproducibility outcome (whether the software actually runs), making the distinction explicit and measurable.

Author: Sheeba Samuel · sheeba.samuel@informatik.tu-chemnitz.de · Chemnitz University of Technology

Overview

Score	What it measures	When available
RRS	Static readiness — 26 sub-metrics across 5 categories	Always (no execution needed)
ROS	Execution outcome — up to 6 sandboxed probes	When execution infrastructure is available
RCS	Coverage-adaptive composite of RRS + ROS	When any ROS component is available

The core insight is that a repository can score well on static readiness yet fail to execute (e.g. pinned dependencies with a version conflict), and conversely execute successfully despite minimal static signals. ReproScore makes this readiness–outcome gap visible rather than conflating the two quantities.

Repository structure

reproscore/
├── ablation_analysis.py        ← Reproduce all evaluation statistics from scores.csv
├── config/
│   └── default_rubric.yaml     ← Category weights (w_i), gate parameters (τ, k), penalties
├── src/
│   ├── scoring/
│   │   ├── rrs.py              ← 26 sub-metric detectors, gate function, category aggregation
│   │   ├── ros.py              ← Execution outcome scoring (6 components)
│   │   ├── rcs.py              ← Composite score with coverage weight α
│   │   └── rubric.py           ← YAML rubric loader and validator
│   └── utils/
│       └── notebook_paths.py   ← Notebook discovery and exclusion filters
├── tests/
│   └── test_scoring.py         ← Unit tests for RRS (no execution required)
└── data/
    └── ablation/
        └── 20260511_101920/    ← Evaluation run (423 repositories)
            ├── scores.csv      ← Per-repository RRS scores + 26 sub-metrics + ground truth
            ├── analysis_results.json
            ├── analysis.log
            ├── provenance.json
            ├── logs/           ← Clone and score logs
            └── repos/          ← Per-repository score provenance (JSON, one file per repo)

Quick start

Install

git clone https://github.com/myVSR/reproscore.git
cd reproscore
pip install -r requirements.txt

Score a repository

from src.scoring.rrs import RRSScorer

result = RRSScorer().score("/path/to/repo")
print(f"RRS: {result.rrs:.1f}")
for sym, cat in result.category_scores.items():
    print(f"  {sym}: {cat.raw_score:.1f}")
for ev in result.evidence:
    if ev.fix_suggestion:
        print(f"  [{ev.metric_id}] {ev.fix_suggestion}")

Reproduce evaluation statistics

python ablation_analysis.py
# or point at the bundled run explicitly:
python ablation_analysis.py --run-dir data/ablation/20260511_101920

Run tests

pytest tests/test_scoring.py -v

Scoring model

RRS — 5 categories, 26 sub-metrics

Cat	Name	Weight	τ	k	Sub-metrics
E	Environment specification	0.30	40	1.5	dep_pinning, container_spec, env_bootstrap, python_version_declared
A	Data accessibility	0.25	30	1.5	data_description, data_pointer, workflow_orchestration, data_acquisition_script
D	Documentation	0.20	20	1.2	doc_structure, install_instructions, usage_examples, inline_explanation_density, execution_entry_point, docstring_coverage, reuse_metadata
C	Code portability	0.15	25	1.2	no_absolute_paths, import_resolvability, no_hardcoded_credentials, silent_failure_masking
S	Reproducibility signals	0.10	30	1.2	seed_management, notebook_exec_order, test_file_presence, expected_outputs, ci_presence, config_externalised, hardware_requirements

Within-category weights reflect execution failure pattern analysis; see src/scoring/rrs.py _aggregate_* docstrings for rationale.

Gate function

g(x, τ, k) = x / 100                    if x ≥ τ
           = (x / τ)^k · (τ / 100)      if x < τ

Penalises sub-threshold failures non-linearly. Core categories (E, A) use k = 1.5; quality categories (D, C, S) use k = 1.2.

Hard penalties

Condition	Penalty
E < 10 (no environment specification)	−20 pts
A < 10 (no data artefacts)	−15 pts
seed score < 50 (stochastic ops, no seeds)	−10 pts

Penalty magnitudes are calibrated to approximately the maximum weight contribution of the penalised category.

Community rubric

Override any weight or gate parameter via a YAML profile:

name: bioinformatics-v1
version: "1.0"
categories:
  E: {weight: 0.35, tau: 40, k: 1.5}
  A: {weight: 0.40, tau: 30, k: 1.5}
  D: {weight: 0.10, tau: 20, k: 1.2}
  C: {weight: 0.05, tau: 25, k: 1.2}
  S: {weight: 0.10, tau: 30, k: 1.2}

from src.scoring.rrs import RRSScorer
from src.scoring.rubric import load_rubric

rubric = load_rubric("my_rubric.yaml")
result = RRSScorer(rubric=rubric).score("/path/to/repo")

Evaluation dataset

The data/ablation/20260511_101920/ directory contains results for 423 Python/Jupyter GitHub repositories drawn from biomedical publications indexed in PubMed Central (Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications) stratified across five execution failure modes:

Failure mode	n	Description
success	84–85	All notebooks completed without error
install_dep	84–85	Install-time dependency conflict
missing_module	84–85	ModuleNotFoundError / ImportError at runtime
missing_data	84–85	FileNotFoundError / missing input data
code_error	84–85	TypeError / NameError / SyntaxError

scores.csv contains one row per repository with RRS, all 26 sub-metric scores, category scores, and the ground-truth failure mode label.

repos/ contains per-repository provenance JSON (sub-metric evidence, file-level detections) for all 423 repositories.

License

GNU General Public License v3.0 — see LICENSE.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, GigaScience, 13:giad113, 2024.
Sheeba Samuel and Daniel Mietchen. FAIR Jupyter: A Knowledge Graph Approach to Semantic Sharing and Granular Exploration of a Computational Notebook Reproducibility Dataset. In Special Issue on Resources for Graph Data and Knowledge. Transactions on Graph Data and Knowledge (TGDK), Volume 2, Issue 2, pp. 4:1-4:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/TGDK.2.2.4
Sheeba Samuel, & Daniel Mietchen. (2023). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.8226725

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReproScore

Overview

Repository structure

Quick start

Install

Score a repository

Reproduce evaluation statistics

Run tests

Scoring model

RRS — 5 categories, 26 sub-metrics

Gate function

Hard penalties

Community rubric

Evaluation dataset

License

References:

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data/ablation/20260511_101920		data/ablation/20260511_101920
notebooks		notebooks
src		src
tests		tests
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ablation_analysis.py		ablation_analysis.py
ablation_batch.py		ablation_batch.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ReproScore

Overview

Repository structure

Quick start

Install

Score a repository

Reproduce evaluation statistics

Run tests

Scoring model

RRS — 5 categories, 26 sub-metrics

Gate function

Hard penalties

Community rubric

Evaluation dataset

License

References:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages