Infrastructure for normalizing multi-source code-edit datasets plus scaffolding for sandbox/state/actor/teacher components referenced in PLAN_STAGE_1.md.
dataset/<name>/download.py– dataset-specific download entrypointsconfigs/– dataset catalog, sandbox/actors/teacher configs, local dataset storage mappingsconfigs/datasets.yaml– curated metadata for every upstream corpus we plan to ingestsrc/data– shared schemas and download utilitiesscripts/– orchestration / reporting helpersreports/– generated dataset notes and checkpoints
This repo assumes Python 3.10+ and uses a src/ layout. The easiest way to set up a local environment is with uv:
cd /path/to/ast-edit
# 1) Create a virtual environment
uv venv
# 2) Activate it (bash/zsh)
source .venv/bin/activate
# 3) Install runtime and dev dependencies
uv pip install -e ".[dev]"
# 4) Run tests to sanity-check the setup
pytest -qAfter this, commands like pytest and the python -m ... entrypoints below should work as-is.
All dataset helpers live inside Python packages under dataset/. Execute them as modules so imports stay isolated from sys.path hacks:
python -m dataset.commitpackft.download --help
python -m dataset.agentpack.downloadEach downloader stages files under dataset/<name>/content/ and writes _meta.json describing provenance and checksums.
To keep large corpora off the main disk, you can manage dataset storage locations via a local YAML config and symlinks:
-
Create
configs/dataset_storage.local.yaml(this file is gitignored), for example:base_dir: /mnt/18tb/ast-edit-datasets datasets: commitpackft: commitpackft agentpack: agentpack
-
Generate/update the
dataset/<name>/contentsymlinks:python -m scripts.manage_dataset_storage
To finalize Stage 1 on a new machine, run the following (from the repo root):
-
Create and activate the venv
uv venv source .venv/bin/activate uv pip install -e ".[dev]" pytest -q
-
Configure Hugging Face auth (for gated/hosted datasets)
cat > .env.local << 'EOF' HF_TOKEN=hf_xxx_your_token_here EOF set -a . ./.env.local set +a
-
Point datasets at the large HDD
# Ensure configs/dataset_storage.local.yaml points at your HDD, # e.g. base_dir: /mnt/seagate_18tb/ast-edit-dataset python -m scripts.manage_dataset_storage
-
Dry-run dataset downloads
python -m src.dataset.download_all --metadata-only
-
Run full downloads (as needed)
# All datasets sequentially python -m src.dataset.download_all # Or one-by-one python -m dataset.commitpackft.download python -m dataset.editpackft.download python -m dataset.canitedit.download python -m dataset.agentpack.download python -m dataset.smellycode.download
Some download scripts require authenticated access (e.g., private Hugging Face mirrors). Provide credentials via environment variables or an untracked .env.local file at the repo root:
HF_TOKEN=hf_xxx
CUSTOM_S3_ENDPOINT=https://...Load them before running scripts:
set -a
. ./.env.local
set +aNever commit real credentials. .env* patterns are already ignored in .gitignore, and long-lived tokens should be stored in your password manager. For CI usage, rely on your provider's secret store rather than plaintext files.