A reproducible research pipeline to analyze which parts of the NumPy codebase evolve the most over time (Research Question RQ2). The project mines the git history using PyDriller, stores structured data in SQLite, and generates analysis notebooks and visualizations.
- End-to-end data extraction from a local
numpygit clone - Rename-aware file path normalization for longitudinal tracking
- Bot and merge commit filtering
- SQLite database with analysis-friendly views
- Parallel sharding support for faster extraction
- Comprehensive analysis notebook with plots and exports
data_extraction_pipeline.py— PyDriller-based miner and SQLite writerconfig.yaml— Configuration for repository path, output, and filtersrun.sh— Orchestrates full and parallel-sharded extractions with loggingmerge_dbs.py— Merges shard databases into a single consolidated DBrq2_analysis.ipynb— Notebook for RQ2 analysis, visualizations, and exportsvalidate_data.py— Quick checks for DB integrity and basic statsrequirements.txt— Python dependencies
- Linux/macOS recommended
- Python 3.10+ (tested on 3.12)
- A local clone of NumPy at
./numpy(or updateconfig.yaml)
- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Ensure
config.yamlpoints to your local NumPy clone:
repository:
path: ./numpy
output:
data_dir: ./numpy_analysis_data
db_name: numpy_analysis.db
export_csv: true
extraction:
since_date: null # set YYYY-MM-DD for windowed run if needed
to_date: null # set YYYY-MM-DD for windowed run if needed
batch_size: 2000
filter_merge_commits: true
filter_bot_commits: true
only_main_branch: false- Run a full extraction:
./run.sh fullLogs are written to logs/. The SQLite DB is created in numpy_analysis_data/.
Use 10 shard workers, each with a per-shard repo clone to avoid git lock contention:
./run.sh parallel-yearsThis creates repos/ and configs/ per shard, runs in parallel, and logs to logs/extraction_*.log.
After shards complete, merge them:
python merge_dbs.pyTo speed initial mining and skip CSV + summary generation:
python data_extraction_pipeline.py --extract-only
# or with a date window
python data_extraction_pipeline.py --since 2018-01-01 --to 2019-12-31 --db-name numpy_analysis_20180101_20191231.db --extract-onlyOpen rq2_analysis.ipynb and run cells:
- Loads commit and file change data from the DB
- Produces time series, distribution plots, and a treemap
- Exports summary CSVs (top files, modules, categories, hotspots, yearly summary)
Outputs are saved under plots/.
- Git config lock during parallel runs: remove stale lock
rm -f numpy/.git/config.lock.run.sh parallel-yearsalready uses per-shard clones to avoid contention. - Large runtime: mining git history is CPU/disk-bound; increase
batch_size, use parallel shards, and ensure fast local storage.
Issues and PRs are welcome. Keep changes focused and documented. Please run the extraction and notebook locally to validate before submitting.
This repository is for academic research purposes. Please respect NumPy’s copyright in the cloned repository and avoid redistributing third-party code.