ast-edit

Infrastructure for normalizing multi-source code-edit datasets plus scaffolding for sandbox/state/actor/teacher components referenced in PLAN_STAGE_1.md.

Repository layout

dataset/<name>/download.py – dataset-specific download entrypoints
configs/ – dataset catalog, sandbox/actors/teacher configs, local dataset storage mappings
configs/datasets.yaml – curated metadata for every upstream corpus we plan to ingest
src/data – shared schemas and download utilities
scripts/ – orchestration / reporting helpers
reports/ – generated dataset notes and checkpoints

Setup (with uv)

This repo assumes Python 3.10+ and uses a src/ layout. The easiest way to set up a local environment is with uv:

cd /path/to/ast-edit

# 1) Create a virtual environment
uv venv

# 2) Activate it (bash/zsh)
source .venv/bin/activate

# 3) Install runtime and dev dependencies
uv pip install -e ".[dev]"

# 4) Run tests to sanity-check the setup
pytest -q

After this, commands like pytest and the python -m ... entrypoints below should work as-is.

Running dataset downloaders

All dataset helpers live inside Python packages under dataset/. Execute them as modules so imports stay isolated from sys.path hacks:

python -m dataset.commitpackft.download --help
python -m dataset.agentpack.download

Each downloader stages files under dataset/<name>/content/ and writes _meta.json describing provenance and checksums.

Using external storage (e.g., 18TB HDD)

To keep large corpora off the main disk, you can manage dataset storage locations via a local YAML config and symlinks:

Create configs/dataset_storage.local.yaml (this file is gitignored), for example:

base_dir: /mnt/18tb/ast-edit-datasets
datasets:
  commitpackft: commitpackft
  agentpack: agentpack

Generate/update the dataset/<name>/content symlinks:
```
python -m scripts.manage_dataset_storage
```

Stage 1: end-to-end setup checklist

To finalize Stage 1 on a new machine, run the following (from the repo root):

Create and activate the venv

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
pytest -q

Configure Hugging Face auth (for gated/hosted datasets)

cat > .env.local << 'EOF'
HF_TOKEN=hf_xxx_your_token_here
EOF

set -a
. ./.env.local
set +a

Point datasets at the large HDD

# Ensure configs/dataset_storage.local.yaml points at your HDD,
# e.g. base_dir: /mnt/seagate_18tb/ast-edit-dataset
python -m scripts.manage_dataset_storage

Dry-run dataset downloads

python -m src.dataset.download_all --metadata-only

Run full downloads (as needed)

# All datasets sequentially
python -m src.dataset.download_all

# Or one-by-one
python -m dataset.commitpackft.download
python -m dataset.editpackft.download
python -m dataset.canitedit.download
python -m dataset.agentpack.download
python -m dataset.smellycode.download

Secrets & credentials

Some download scripts require authenticated access (e.g., private Hugging Face mirrors). Provide credentials via environment variables or an untracked .env.local file at the repo root:

HF_TOKEN=hf_xxx
CUSTOM_S3_ENDPOINT=https://...

Load them before running scripts:

set -a
. ./.env.local
set +a

Never commit real credentials. .env* patterns are already ignored in .gitignore, and long-lived tokens should be stored in your password manager. For CI usage, rely on your provider's secret store rather than plaintext files.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
configs		configs
data		data
dataset		dataset
reports/datasets		reports/datasets
scripts		scripts
src		src
tests		tests
trajectories/raw		trajectories/raw
.gitignore		.gitignore
PLAN.md		PLAN.md
PLAN_STAGE_1.md		PLAN_STAGE_1.md
PLAN_STAGE_2.md		PLAN_STAGE_2.md
PROGRESS_STAGE_1.md		PROGRESS_STAGE_1.md
PlanExplanationStage2.md		PlanExplanationStage2.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ast-edit

Repository layout

Setup (with uv)

Running dataset downloaders

Using external storage (e.g., 18TB HDD)

Stage 1: end-to-end setup checklist

Secrets & credentials

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ast-edit

Repository layout

Setup (with uv)

Running dataset downloaders

Using external storage (e.g., 18TB HDD)

Stage 1: end-to-end setup checklist

Secrets & credentials

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages