Skip to content

istiyakamin/numpy-codebase-evolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NumPy Codebase Evolution Analysis (RQ2)

A reproducible research pipeline to analyze which parts of the NumPy codebase evolve the most over time (Research Question RQ2). The project mines the git history using PyDriller, stores structured data in SQLite, and generates analysis notebooks and visualizations.

Features

  • End-to-end data extraction from a local numpy git clone
  • Rename-aware file path normalization for longitudinal tracking
  • Bot and merge commit filtering
  • SQLite database with analysis-friendly views
  • Parallel sharding support for faster extraction
  • Comprehensive analysis notebook with plots and exports

Repository Layout

  • data_extraction_pipeline.py — PyDriller-based miner and SQLite writer
  • config.yaml — Configuration for repository path, output, and filters
  • run.sh — Orchestrates full and parallel-sharded extractions with logging
  • merge_dbs.py — Merges shard databases into a single consolidated DB
  • rq2_analysis.ipynb — Notebook for RQ2 analysis, visualizations, and exports
  • validate_data.py — Quick checks for DB integrity and basic stats
  • requirements.txt — Python dependencies

Prerequisites

  • Linux/macOS recommended
  • Python 3.10+ (tested on 3.12)
  • A local clone of NumPy at ./numpy (or update config.yaml)

Quickstart

  1. Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  1. Ensure config.yaml points to your local NumPy clone:
repository:
  path: ./numpy
output:
  data_dir: ./numpy_analysis_data
  db_name: numpy_analysis.db
  export_csv: true
extraction:
  since_date: null      # set YYYY-MM-DD for windowed run if needed
  to_date: null         # set YYYY-MM-DD for windowed run if needed
  batch_size: 2000
  filter_merge_commits: true
  filter_bot_commits: true
  only_main_branch: false
  1. Run a full extraction:
./run.sh full

Logs are written to logs/. The SQLite DB is created in numpy_analysis_data/.

Faster Parallel Extraction

Use 10 shard workers, each with a per-shard repo clone to avoid git lock contention:

./run.sh parallel-years

This creates repos/ and configs/ per shard, runs in parallel, and logs to logs/extraction_*.log.

After shards complete, merge them:

python merge_dbs.py

Extract-Only Mode

To speed initial mining and skip CSV + summary generation:

python data_extraction_pipeline.py --extract-only
# or with a date window
python data_extraction_pipeline.py --since 2018-01-01 --to 2019-12-31 --db-name numpy_analysis_20180101_20191231.db --extract-only

Analysis Notebook

Open rq2_analysis.ipynb and run cells:

  • Loads commit and file change data from the DB
  • Produces time series, distribution plots, and a treemap
  • Exports summary CSVs (top files, modules, categories, hotspots, yearly summary)

Outputs are saved under plots/.

Common Issues

  • Git config lock during parallel runs: remove stale lock rm -f numpy/.git/config.lock. run.sh parallel-years already uses per-shard clones to avoid contention.
  • Large runtime: mining git history is CPU/disk-bound; increase batch_size, use parallel shards, and ensure fast local storage.

Contributing

Issues and PRs are welcome. Keep changes focused and documented. Please run the extraction and notebook locally to validate before submitting.

License

This repository is for academic research purposes. Please respect NumPy’s copyright in the cloned repository and avoid redistributing third-party code.

About

A reproducible research pipeline to analyze which parts of the NumPy codebase evolve the most over time

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors