# 02 - Batch Data Collection

Efficiently collect data for multiple Stadium project candidates.

**Goals:**
1. Load Stadium candidates from `data/stadium_candidates.md`
2. Quick verification of maintainer counts
3. Batch collection with rate limit management
4. Progress tracking and error handling

## Setup

In [3]:
import os
import sys
import json
import time
from pathlib import Path
from datetime import datetime

import pandas as pd
from dotenv import load_dotenv

# Add src to path
sys.path.insert(0, '../src')
from collection.github_collector import GitHubCollector

# Load environment from .env file
env_path = Path("../.env")
if env_path.exists():
    load_dotenv(env_path)
    print(f"‚úÖ Loaded .env from {env_path.resolve()}")
else:
    load_dotenv()
    print("‚ö†Ô∏è  No .env file found, trying default locations")

GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')

if not GITHUB_TOKEN:
    raise ValueError(
        "GITHUB_TOKEN not found!\n"
        "1. Copy .env.example to .env: cp ../.env.example ../.env\n"
        "2. Edit .env and add your GitHub token\n"
        "3. Get a token at: https://github.com/settings/tokens"
    )

if GITHUB_TOKEN == "your_github_token_here":
    raise ValueError(
        "GITHUB_TOKEN is still the placeholder!\n"
        "Edit ../.env and replace with your actual token"
    )

# Initialize collector
collector = GitHubCollector(token=GITHUB_TOKEN)

print("‚úÖ Setup complete!")
rate = collector.get_rate_limit()
print(f"   Rate limit: {rate['core']['remaining']}/{rate['core']['limit']}")



‚úÖ Loaded .env from /Users/ibrahimcesar/Dev/categories-of-the-commons/.env
‚úÖ Setup complete!
   Rate limit: 5000/5000


## 1. Define Stadium Candidates

Based on research criteria:
- High usage/downloads
- Few maintainers (‚â§3 ideal, or high dominance)
- Critical infrastructure packages

In [5]:
# Import from centralized candidate lists
import sys
sys.path.insert(0, '../data')
from candidates import (
    STADIUM_ALL, STADIUM_COLLECTED, STADIUM_HIGH_PRIORITY,
    FEDERATION_CANDIDATES, CLUB_CANDIDATES, TOY_CANDIDATES,
    get_uncollected, print_status
)

# Show collection status across all categories
print_status()

# Use the centralized Stadium candidates
all_candidates = [{"repo": repo, "ecosystem": "mixed"} for repo in STADIUM_ALL]

print(f"\nTotal Stadium candidates: {len(all_candidates)}")
print(f"Already collected: {len(STADIUM_COLLECTED)}")
print(f"Remaining: {len(get_uncollected('stadium'))}")

CANDIDATE COLLECTION STATUS
STADIUM       12/ 37 (32%)
FEDERATION     0/ 19 (0%)
CLUB           0/ 19 (0%)
TOY            1/  1 (100%)

Total Stadium candidates: 37
Already collected: 12
Remaining: 25


## 2. Quick Verification - Check Maintainer Counts

Before full collection, quickly verify candidates meet Stadium criteria.

In [6]:
def quick_verify(repo_name: str) -> dict:
    """Quick verification of Stadium criteria."""
    try:
        # Get basic metrics
        metrics = collector.collect_repository_metrics(repo_name)
        maintainers = collector.collect_maintainer_data(repo_name)
        
        # Get top contributor dominance
        contributors = collector.collect_contributor_data(repo_name, max_contributors=10)
        
        dominance = 0
        if contributors:
            total = sum(c['contributions'] for c in contributors)
            if total > 0:
                dominance = contributors[0]['contributions'] / total * 100
        
        return {
            "repo": repo_name,
            "stars": metrics.get('stargazers_count', 0),
            "language": metrics.get('language', 'Unknown'),
            "active_maintainers": maintainers['statistics'].get('active_maintainers_6mo', 0),
            "top_contributor": contributors[0]['login'] if contributors else 'N/A',
            "top_contributor_pct": round(dominance, 1),
            "stadium_likely": maintainers['statistics'].get('active_maintainers_6mo', 0) <= 3 or dominance > 40,
            "error": None
        }
    except Exception as e:
        return {
            "repo": repo_name,
            "error": str(e)
        }

In [7]:
# Quick verify all candidates (uses ~50-100 API calls per repo)
print("Quick verification of Stadium candidates...")
print("="*70)

verification_results = []

for i, candidate in enumerate(all_candidates):
    repo = candidate['repo']
    print(f"[{i+1}/{len(all_candidates)}] Checking {repo}...", end=" ")
    
    result = quick_verify(repo)
    result['ecosystem'] = candidate['ecosystem']
    verification_results.append(result)
    
    if result.get('error'):
        print(f"‚ùå Error: {result['error'][:50]}")
    elif result.get('stadium_likely'):
        print(f"‚úÖ Stadium likely ({result['active_maintainers']} maintainers, {result['top_contributor_pct']}% dominance)")
    else:
        print(f"‚ö†Ô∏è  Maybe not Stadium ({result['active_maintainers']} maintainers, {result['top_contributor_pct']}% dominance)")
    
    # Rate limit check
    if (i + 1) % 5 == 0:
        rate = collector.get_rate_limit()
        print(f"    [Rate limit: {rate['core']['remaining']}/{rate['core']['limit']}]")
        if rate['core']['remaining'] < 500:
            print("‚ö†Ô∏è  Rate limit low, pausing...")
            time.sleep(60)

print("\n" + "="*70)
print("Verification complete!")

Quick verification of Stadium candidates...
[1/37] Checking curl/curl... 

Request GET /repos/curl/curl/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...


KeyboardInterrupt: 

In [8]:
# Display verification results
df_verify = pd.DataFrame(verification_results)

# Filter successful verifications
df_success = df_verify[df_verify['error'].isna()].copy()

print(f"\nSuccessfully verified: {len(df_success)}/{len(df_verify)}")
print(f"Stadium likely: {df_success['stadium_likely'].sum()}")
print(f"Uncertain: {(~df_success['stadium_likely']).sum()}")

# Display table
display_cols = ['repo', 'ecosystem', 'stars', 'active_maintainers', 'top_contributor', 'top_contributor_pct', 'stadium_likely']
df_success[display_cols].sort_values('stadium_likely', ascending=False)

KeyError: 'error'

## 3. Select Confirmed Stadium Projects

In [9]:
# Use centralized uncollected list
# This uses the COLLECTED list from stadium_candidates.py
uncollected = get_uncollected("stadium")

# Also allow collecting high priority first
confirmed_stadium = STADIUM_HIGH_PRIORITY + [r for r in uncollected if r not in STADIUM_HIGH_PRIORITY]

print(f"Confirmed Stadium projects to collect ({len(confirmed_stadium)}):")
for repo in confirmed_stadium[:10]:  # Show first 10
    priority = "HIGH" if repo in STADIUM_HIGH_PRIORITY else ""
    print(f"  - {repo} {priority}")
if len(confirmed_stadium) > 10:
    print(f"  ... and {len(confirmed_stadium) - 10} more")

Confirmed Stadium projects to collect (25):
  - uuidjs/uuid HIGH
  - debug-js/debug HIGH
  - npm/node-semver HIGH
  - vercel/ms HIGH
  - node-fetch/node-fetch HIGH
  - yargs/yargs HIGH
  - urllib3/urllib3 HIGH
  - dateutil/dateutil HIGH
  - certifi/python-certifi HIGH
  - pallets/click HIGH
  ... and 15 more


## 4. Batch Collection - Full Dataset

Collect complete data for confirmed Stadium projects.

In [10]:
def collect_with_retry(repo_name: str, since_days: int = 365, max_retries: int = 3) -> dict:
    """Collect data with retry logic."""
    for attempt in range(max_retries):
        try:
            data = collector.collect_complete_dataset(repo_name, since_days=since_days)
            return {"success": True, "data": data, "error": None}
        except Exception as e:
            if "rate limit" in str(e).lower():
                print(f"      Rate limit hit, waiting 60s...")
                time.sleep(60)
            elif attempt < max_retries - 1:
                print(f"      Retry {attempt + 1}/{max_retries}...")
                time.sleep(5)
            else:
                return {"success": False, "data": None, "error": str(e)}
    return {"success": False, "data": None, "error": "Max retries exceeded"}

In [11]:
# Check which projects already have data
data_dir = Path("../data/raw")
existing_files = {f.stem.replace('_data', '').replace('_', '/'): f for f in data_dir.glob("*_data.json")}

print(f"Existing data files: {len(existing_files)}")
for repo in existing_files:
    print(f"  ‚úì {repo}")

# Filter to only collect missing ones
to_collect = [repo for repo in confirmed_stadium if repo not in existing_files]
print(f"\nNeed to collect: {len(to_collect)}")

Existing data files: 23
  ‚úì benjaminp/six
  ‚úì glennrp/libpng
  ‚úì curl/curl
  ‚úì psf/requests
  ‚úì yaml/pyyaml
  ‚úì sindresorhus/got
  ‚úì debug-js/debug
  ‚úì uuidjs/uuid
  ‚úì vercel/ms
  ‚úì tj/commander.js
  ‚úì dateutil/dateutil
  ‚úì yargs/yargs
  ‚úì pallets/click
  ‚úì ibrahimcesar/react-lite-youtube-embed
  ‚úì zloirock/core-js
  ‚úì chalk/chalk
  ‚úì certifi/python-certifi
  ‚úì node-fetch/node-fetch
  ‚úì serde-rs/serde
  ‚úì axios/axios
  ‚úì npm/node-semver
  ‚úì urllib3/urllib3
  ‚úì madler/zlib

Need to collect: 15


In [12]:
# Batch collection
collection_results = []
start_time = datetime.now()

print(f"Starting batch collection for {len(to_collect)} projects...")
print("="*70)

for i, repo in enumerate(to_collect):
    print(f"\n[{i+1}/{len(to_collect)}] Collecting {repo}...")
    
    # Check rate limit before starting
    rate = collector.get_rate_limit()
    if rate['core']['remaining'] < 500:
        wait_time = 60
        print(f"    ‚è≥ Rate limit low ({rate['core']['remaining']}), waiting {wait_time}s...")
        time.sleep(wait_time)
    
    # Collect data
    result = collect_with_retry(repo, since_days=365)
    
    if result['success']:
        # Save to file
        output_path = data_dir / f"{repo.replace('/', '_')}_data.json"
        collector.save_data(result['data'], output_path)
        
        stars = result['data']['repository'].get('stargazers_count', 0)
        contributors = len(result['data']['contributors'])
        print(f"    ‚úÖ Success! ({stars:,} stars, {contributors} contributors)")
        
        collection_results.append({
            "repo": repo,
            "success": True,
            "stars": stars,
            "contributors": contributors,
            "file": str(output_path)
        })
    else:
        print(f"    ‚ùå Failed: {result['error'][:50]}")
        collection_results.append({
            "repo": repo,
            "success": False,
            "error": result['error']
        })

elapsed = datetime.now() - start_time
print("\n" + "="*70)
print(f"Batch collection complete! Time: {elapsed}")

Starting batch collection for 15 projects...

[1/15] Collecting rust-lang/regex...

Collecting complete dataset for: rust-lang/regex

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/rust-lang/regex/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 42.16it/s]


4/7 Collecting commit history...


Commits: 61it [00:41,  1.48it/s]


5/7 Collecting pull request data...


Collecting PRs:  24%|‚ñà‚ñà‚ñç       | 48/200 [00:38<02:01,  1.26it/s]


6/7 Collecting issue data...


Collecting Issues: 76it [00:20,  3.77it/s]


7/7 Checking governance files...

Collection complete for: rust-lang/regex

Data saved to ../data/raw/rust-lang_regex_data.json
    ‚úÖ Success! (3,844 stars, 100 contributors)

[2/15] Collecting serde-rs/json...

Collecting complete dataset for: serde-rs/json

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/serde-rs/json/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 44.85it/s]


4/7 Collecting commit history...


Commits: 76it [00:50,  1.50it/s]


5/7 Collecting pull request data...


Collecting PRs:  14%|‚ñà‚ñé        | 27/200 [00:22<02:21,  1.22it/s]


6/7 Collecting issue data...


Collecting Issues: 53it [00:17,  2.95it/s]


7/7 Checking governance files...

Collection complete for: serde-rs/json

Data saved to ../data/raw/serde-rs_json_data.json
    ‚úÖ Success! (5,381 stars, 100 contributors)

[3/15] Collecting clap-rs/clap...

Collecting complete dataset for: clap-rs/clap

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/clap-rs/clap/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 42.14it/s]


4/7 Collecting commit history...


Commits: 455it [04:57,  1.53it/s]


5/7 Collecting pull request data...


Collecting PRs:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 155/200 [02:02<00:35,  1.27it/s]


6/7 Collecting issue data...


Collecting Issues: 200it [00:40,  4.91it/s]


7/7 Checking governance files...

Collection complete for: clap-rs/clap

Data saved to ../data/raw/clap-rs_clap_data.json
    ‚úÖ Success! (15,761 stars, 100 contributors)

[4/15] Collecting sqlite/sqlite...

Collecting complete dataset for: sqlite/sqlite

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/sqlite/sqlite/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
Could not analyze committers: Cannot complete object as it contains no URL: 400
3/7 Collecting contributor data...


Contributors: 0it [00:00, ?it/s]


Error collecting contributors for sqlite/sqlite: list index out of range
4/7 Collecting commit history...


Commits: 5it [00:04,  1.20it/s]


Error collecting commits for sqlite/sqlite: Cannot complete object as it contains no URL: 400
5/7 Collecting pull request data...


Collecting PRs:  29%|‚ñà‚ñà‚ñä       | 4/14 [00:04<00:10,  1.08s/it]


6/7 Collecting issue data...


Collecting Issues: 4it [00:00,  6.65it/s]


7/7 Checking governance files...

Collection complete for: sqlite/sqlite

Data saved to ../data/raw/sqlite_sqlite_data.json
    ‚úÖ Success! (8,687 stars, 0 contributors)

[5/15] Collecting babel/babel...

Collecting complete dataset for: babel/babel

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/babel/babel/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 34.87it/s]


4/7 Collecting commit history...


Commits: 464it [05:06,  1.51it/s]


5/7 Collecting pull request data...


Collecting PRs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [02:38<00:00,  1.26it/s]


6/7 Collecting issue data...


Collecting Issues: 200it [00:28,  6.99it/s]


7/7 Checking governance files...

Collection complete for: babel/babel

Data saved to ../data/raw/babel_babel_data.json
    ‚úÖ Success! (43,813 stars, 100 contributors)

[6/15] Collecting lodash/lodash...

Collecting complete dataset for: lodash/lodash

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/lodash/lodash/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 42.78it/s]


4/7 Collecting commit history...


Commits: 16it [00:10,  1.49it/s]


5/7 Collecting pull request data...


Collecting PRs:  22%|‚ñà‚ñà‚ñè       | 43/200 [00:35<02:10,  1.21it/s]


6/7 Collecting issue data...


Collecting Issues: 121it [00:50,  2.38it/s]


7/7 Checking governance files...

Collection complete for: lodash/lodash

Data saved to ../data/raw/lodash_lodash_data.json
    ‚úÖ Success! (61,431 stars, 100 contributors)

[7/15] Collecting expressjs/express...

Collecting complete dataset for: expressjs/express

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/expressjs/express/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 40.86it/s]


4/7 Collecting commit history...


Commits: 94it [01:00,  1.57it/s]


5/7 Collecting pull request data...


Collecting PRs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [02:48<00:00,  1.19it/s]


6/7 Collecting issue data...


Collecting Issues: 200it [00:14, 13.35it/s]


7/7 Checking governance files...

Collection complete for: expressjs/express

Data saved to ../data/raw/expressjs_express_data.json
    ‚úÖ Success! (68,271 stars, 100 contributors)

[8/15] Collecting python-attrs/attrs...

Collecting complete dataset for: python-attrs/attrs

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/python-attrs/attrs/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 44.48it/s]


4/7 Collecting commit history...


Commits: 129it [01:27,  1.48it/s]


5/7 Collecting pull request data...


Collecting PRs:  38%|‚ñà‚ñà‚ñà‚ñä      | 77/200 [00:59<01:35,  1.29it/s]


6/7 Collecting issue data...


Collecting Issues: 125it [00:33,  3.73it/s]


7/7 Checking governance files...

Collection complete for: python-attrs/attrs

Data saved to ../data/raw/python-attrs_attrs_data.json
    ‚úÖ Success! (5,678 stars, 100 contributors)

[9/15] Collecting pypa/pip...

Collecting complete dataset for: pypa/pip

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/pypa/pip/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
Found 1 maintainers in AUTHORS.txt
Found 1 maintainers from files: ['Switch01']
3/7 Collecting contributor data...


Contributors: 100it [00:02, 36.51it/s]


4/7 Collecting commit history...


Commits: 754it [08:25,  1.49it/s]


5/7 Collecting pull request data...


Collecting PRs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [02:46<00:00,  1.20it/s]


6/7 Collecting issue data...


Collecting Issues: 200it [00:53,  3.77it/s]


7/7 Checking governance files...

Collection complete for: pypa/pip

Data saved to ../data/raw/pypa_pip_data.json
    ‚úÖ Success! (10,008 stars, 100 contributors)

[10/15] Collecting tokio-rs/tokio...
    ‚è≥ Rate limit low (419), waiting 60s...

Collecting complete dataset for: tokio-rs/tokio

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/tokio-rs/tokio/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 38.28it/s]


4/7 Collecting commit history...


Commits: 299it [03:14,  1.67it/s]

Rate limit low. Waiting 229 seconds...


Commits: 354it [07:39,  1.30s/it]


5/7 Collecting pull request data...


Collecting PRs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [02:36<00:00,  1.28it/s]


6/7 Collecting issue data...


Collecting Issues: 200it [00:34,  5.81it/s]


7/7 Checking governance files...

Collection complete for: tokio-rs/tokio

Data saved to ../data/raw/tokio-rs_tokio_data.json
    ‚úÖ Success! (30,373 stars, 100 contributors)

[11/15] Collecting rust-random/rand...

Collecting complete dataset for: rust-random/rand

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/rust-random/rand/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 44.03it/s]


4/7 Collecting commit history...


Commits: 149it [01:33,  1.59it/s]


5/7 Collecting pull request data...


Collecting PRs:  49%|‚ñà‚ñà‚ñà‚ñà‚ñâ     | 98/200 [01:18<01:21,  1.25it/s]


6/7 Collecting issue data...


Collecting Issues: 155it [00:37,  4.14it/s]


7/7 Checking governance files...

Collection complete for: rust-random/rand

Data saved to ../data/raw/rust-random_rand_data.json
    ‚úÖ Success! (1,940 stars, 100 contributors)

[12/15] Collecting spf13/cobra...

Collecting complete dataset for: spf13/cobra

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/spf13/cobra/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
Found 0 maintainers in MAINTAINERS
3/7 Collecting contributor data...


Contributors: 100it [00:02, 45.69it/s]


4/7 Collecting commit history...


Commits: 42it [00:26,  1.61it/s]


5/7 Collecting pull request data...


Collecting PRs:  30%|‚ñà‚ñà‚ñâ       | 59/200 [00:46<01:50,  1.28it/s]


6/7 Collecting issue data...


Collecting Issues: 101it [00:27,  3.67it/s]


7/7 Checking governance files...

Collection complete for: spf13/cobra

Data saved to ../data/raw/spf13_cobra_data.json
    ‚úÖ Success! (42,503 stars, 100 contributors)

[13/15] Collecting gorilla/mux...

Collecting complete dataset for: gorilla/mux

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/gorilla/mux/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 47.16it/s]


4/7 Collecting commit history...


Commits: 0it [00:00, ?it/s]


5/7 Collecting pull request data...


Collecting PRs:   1%|          | 2/200 [00:03<05:51,  1.78s/it]


6/7 Collecting issue data...


Collecting Issues: 9it [00:04,  2.01it/s]


7/7 Checking governance files...

Collection complete for: gorilla/mux

Data saved to ../data/raw/gorilla_mux_data.json
    ‚úÖ Success! (21,717 stars, 100 contributors)

[14/15] Collecting rack/rack...

Collecting complete dataset for: rack/rack

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/rack/rack/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 43.80it/s]


4/7 Collecting commit history...


Commits: 91it [00:58,  1.57it/s]


5/7 Collecting pull request data...


Collecting PRs:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 88/200 [01:07<01:26,  1.30it/s]


6/7 Collecting issue data...


Collecting Issues: 141it [00:35,  4.00it/s]


7/7 Checking governance files...

Collection complete for: rack/rack

Data saved to ../data/raw/rack_rack_data.json
    ‚úÖ Success! (5,073 stars, 100 contributors)

[15/15] Collecting sparklemotion/nokogiri...

Collecting complete dataset for: sparklemotion/nokogiri

1/7 Collecting repository metrics...
2/7 Identifying maintainers...


Request GET /repos/sparklemotion/nokogiri/collaborators failed with 403: Forbidden


Could not fetch collaborators (requires admin access): Must have push access to view repository collaborators.: 403 {"message": "Must have push access to view repository collaborators.", "documentation_url": "https://docs.github.com/rest/collaborators/collaborators#list-repository-collaborators", "status": "403"}
Will use MAINTAINERS.md/CONTRIBUTORS.md files instead...
3/7 Collecting contributor data...


Contributors: 100it [00:02, 41.60it/s]


4/7 Collecting commit history...


Commits: 307it [03:22,  1.52it/s]


5/7 Collecting pull request data...


Collecting PRs:  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 164/200 [02:12<00:28,  1.24it/s]


6/7 Collecting issue data...


Collecting Issues: 200it [00:35,  5.58it/s]


7/7 Checking governance files...

Collection complete for: sparklemotion/nokogiri

Data saved to ../data/raw/sparklemotion_nokogiri_data.json
    ‚úÖ Success! (6,214 stars, 100 contributors)

Batch collection complete! Time: 1:22:36.816233


In [13]:
# Summary
df_results = pd.DataFrame(collection_results)

print("\n" + "="*60)
print("COLLECTION SUMMARY")
print("="*60)

if len(df_results) > 0:
    success_count = df_results['success'].sum()
    print(f"Successful: {success_count}/{len(df_results)}")
    print(f"Failed: {len(df_results) - success_count}")
    
    if 'stars' in df_results.columns:
        df_success = df_results[df_results['success']]
        print(f"\nTotal stars collected: {df_success['stars'].sum():,}")
        print(f"Total contributors: {df_success['contributors'].sum():,}")

# Show all collected data
all_data_files = list(data_dir.glob("*_data.json"))
print(f"\nTotal data files: {len(all_data_files)}")
total_size = sum(f.stat().st_size for f in all_data_files) / 1024
print(f"Total size: {total_size:.1f} KB")


COLLECTION SUMMARY
Successful: 15/15
Failed: 0

Total stars collected: 330,694
Total contributors: 1,400

Total data files: 38
Total size: 7628.5 KB


## 5. Final Rate Limit Check

In [14]:
rate = collector.get_rate_limit()
print(f"\nüìä Final Rate Limit Status:")
print(f"   Core API: {rate['core']['remaining']}/{rate['core']['limit']} remaining")
print(f"   Resets at: {rate['core']['reset']}")


üìä Final Rate Limit Status:
   Core API: 2899/5000 remaining
   Resets at: 2025-11-29 03:35:07+00:00


## Next Steps

1. Run `01_data_exploration.ipynb` to analyze collected data
2. Run `03_statistical_analysis.ipynb` for hypothesis testing
3. Add Federation/Club control projects for comparison