NumPy Codebase Evolution Analysis (RQ2)

A reproducible research pipeline to analyze which parts of the NumPy codebase evolve the most over time (Research Question RQ2). The project mines the git history using PyDriller, stores structured data in SQLite, and generates analysis notebooks and visualizations.

Features

End-to-end data extraction from a local numpy git clone
Rename-aware file path normalization for longitudinal tracking
Bot and merge commit filtering
SQLite database with analysis-friendly views
Parallel sharding support for faster extraction
Comprehensive analysis notebook with plots and exports

Repository Layout

data_extraction_pipeline.py — PyDriller-based miner and SQLite writer
config.yaml — Configuration for repository path, output, and filters
run.sh — Orchestrates full and parallel-sharded extractions with logging
merge_dbs.py — Merges shard databases into a single consolidated DB
rq2_analysis.ipynb — Notebook for RQ2 analysis, visualizations, and exports
validate_data.py — Quick checks for DB integrity and basic stats
requirements.txt — Python dependencies

Prerequisites

Linux/macOS recommended
Python 3.10+ (tested on 3.12)
A local clone of NumPy at ./numpy (or update config.yaml)

Quickstart

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Ensure config.yaml points to your local NumPy clone:

repository:
  path: ./numpy
output:
  data_dir: ./numpy_analysis_data
  db_name: numpy_analysis.db
  export_csv: true
extraction:
  since_date: null      # set YYYY-MM-DD for windowed run if needed
  to_date: null         # set YYYY-MM-DD for windowed run if needed
  batch_size: 2000
  filter_merge_commits: true
  filter_bot_commits: true
  only_main_branch: false

Run a full extraction:

./run.sh full

Logs are written to logs/. The SQLite DB is created in numpy_analysis_data/.

Faster Parallel Extraction

Use 10 shard workers, each with a per-shard repo clone to avoid git lock contention:

./run.sh parallel-years

This creates repos/ and configs/ per shard, runs in parallel, and logs to logs/extraction_*.log.

After shards complete, merge them:

python merge_dbs.py

Extract-Only Mode

To speed initial mining and skip CSV + summary generation:

python data_extraction_pipeline.py --extract-only
# or with a date window
python data_extraction_pipeline.py --since 2018-01-01 --to 2019-12-31 --db-name numpy_analysis_20180101_20191231.db --extract-only

Analysis Notebook

Open rq2_analysis.ipynb and run cells:

Loads commit and file change data from the DB
Produces time series, distribution plots, and a treemap
Exports summary CSVs (top files, modules, categories, hotspots, yearly summary)

Outputs are saved under plots/.

Common Issues

Git config lock during parallel runs: remove stale lock rm -f numpy/.git/config.lock. run.sh parallel-years already uses per-shard clones to avoid contention.
Large runtime: mining git history is CPU/disk-bound; increase batch_size, use parallel shards, and ensure fast local storage.

Contributing

Issues and PRs are welcome. Keep changes focused and documented. Please run the extraction and notebook locally to validate before submitting.

License

This repository is for academic research purposes. Please respect NumPy’s copyright in the cloned repository and avoid redistributing third-party code.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
numpy		numpy
numpy_analysis_data		numpy_analysis_data
plots		plots
.gitignore		.gitignore
3.Reflection_Questions.pdf		3.Reflection_Questions.pdf
4.AI_CHAT_SESSION_PROJECT_UNDERSTANDING.pdf		4.AI_CHAT_SESSION_PROJECT_UNDERSTANDING.pdf
4.AI_CHAT_SESSION_TECHNICAL_USE.pdf		4.AI_CHAT_SESSION_TECHNICAL_USE.pdf
EVALUATION_RUBRIC.md		EVALUATION_RUBRIC.md
PhD_Assignment_v2.pdf		PhD_Assignment_v2.pdf
README.md		README.md
RUN_ANALYSIS.md		RUN_ANALYSIS.md
config.yaml		config.yaml
data_extraction_pipeline.py		data_extraction_pipeline.py
requirements.txt		requirements.txt
rq2_analysis.ipynb		rq2_analysis.ipynb
run.sh		run.sh
validate_data.py		validate_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NumPy Codebase Evolution Analysis (RQ2)

Features

Repository Layout

Prerequisites

Quickstart

Faster Parallel Extraction

Extract-Only Mode

Analysis Notebook

Common Issues

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NumPy Codebase Evolution Analysis (RQ2)

Features

Repository Layout

Prerequisites

Quickstart

Faster Parallel Extraction

Extract-Only Mode

Analysis Notebook

Common Issues

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages