# Setup & Test GitHub Data Collection

This notebook will help you:
1. Set up your GitHub token
2. Test the data collector
3. Collect data for your first Stadium project

---

## ‚ö†Ô∏è Important: Use Virtual Environment

**Before running this notebook:**

```bash
# 1. Create and activate virtual environment (if not already done)
python3 -m venv venv
source venv/bin/activate

# 2. Install packages
pip install -r requirements.txt
pip install -e .

# 3. Install Jupyter kernel for this venv
python -m ipykernel install --user --name=categories-venv --display-name="Python (Categories)"

# 4. Start Jupyter from the venv
jupyter lab
```

Then select the **"Python (Categories)"** kernel for this notebook.

## Step 1: Verify Setup

Check that all required packages are installed:

In [3]:
# Verify packages
import sys
print(f"Python: {sys.version}")
print(f"Executable: {sys.executable}")

try:
    import github
    import tqdm
    import dotenv
    from importlib.metadata import version
    print("\n‚úÖ All required packages installed!")
    print(f"   - PyGithub: {version('PyGithub')}")
    print(f"   - tqdm: {version('tqdm')}")
    print(f"   - python-dotenv: {version('python-dotenv')}")
except ImportError as e:
    print(f"\n‚ùå Missing package: {e}")
    print("\nPlease run from terminal:")
    print("  pip install PyGithub tqdm python-dotenv")

Python: 3.9.6 (default, Nov 11 2024, 03:15:38) 
[Clang 16.0.0 (clang-1600.0.26.6)]
Executable: /Users/ibrahimcesar/Dev/categories-of-the-commons/venv/bin/python

‚úÖ All required packages installed!
   - PyGithub: 2.8.1
   - tqdm: 4.67.1
   - python-dotenv: 1.2.1


## Step 2: Load GitHub Token

**Get your token:**
1. Go to: https://github.com/settings/tokens
2. Click **"Generate new token (classic)"**
3. Name: `categories-of-the-commons`
4. Select scopes:
   - ‚úì `public_repo`
   - ‚úì `read:org`
   - ‚úì `read:user`
5. Click **"Generate token"**
6. Copy the token (starts with `ghp_`)
7. Add to `.env` file: `GITHUB_TOKEN=ghp_your_token_here`

In [None]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load from .env file (in project root)
env_path = Path("../.env")
if env_path.exists():
    load_dotenv(env_path)
    print(f"‚úÖ Loaded .env from {env_path.resolve()}")
else:
    load_dotenv()  # Try default locations
    print("‚ö†Ô∏è  No .env file found in project root")
    print("   Please copy .env.example to .env and add your token:")
    print("   cp ../.env.example ../.env")

GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')

if not GITHUB_TOKEN:
    print("\n‚ùå GITHUB_TOKEN not found!")
    print("   1. Copy .env.example to .env: cp ../.env.example ../.env")
    print("   2. Edit .env and add your GitHub token")
    print("   3. Get a token at: https://github.com/settings/tokens")
elif GITHUB_TOKEN == "your_github_token_here":
    print("\n‚ö†Ô∏è  GITHUB_TOKEN is still the placeholder!")
    print("   Edit ../.env and replace with your actual token")
else:
    print(f"‚úÖ Token loaded (length: {len(GITHUB_TOKEN)})")
    print(f"   Token preview: {GITHUB_TOKEN[:7]}...{GITHUB_TOKEN[-4:]}")

## Step 3: Initialize Collector & Check Rate Limits

In [5]:
import sys
sys.path.insert(0, '../src')

from collection.github_collector import GitHubCollector

# Initialize collector
collector = GitHubCollector(token=GITHUB_TOKEN)

# Check rate limits
rate_limit = collector.get_rate_limit()

print("GitHub API Rate Limits:")
print("‚îÅ" * 50)
print(f"Core API:   {rate_limit['core']['remaining']:5d} / {rate_limit['core']['limit']:5d} remaining")
print(f"Search API: {rate_limit['search']['remaining']:5d} / {rate_limit['search']['limit']:5d} remaining")
print(f"Resets at:  {rate_limit['core']['reset']}")
print("‚îÅ" * 50)

if rate_limit['core']['remaining'] > 4000:
    print("‚úÖ Plenty of API calls available!")
elif rate_limit['core']['remaining'] > 1000:
    print("‚ö†Ô∏è  Moderate API calls remaining")
else:
    print("üî¥ Low API calls - may need to wait")

GitHub API Rate Limits:
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
Core API:    4998 /  5000 remaining
Search API:  4998 /  5000 remaining
Resets at:  2025-11-25 14:54:59+00:00
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
‚úÖ Plenty of API calls available!


## Step 4: Quick Test - Check Maintainer Count

Let's verify a Stadium candidate by checking maintainer count:

In [None]:
# Test on curl - confirmed Stadium project
test_repo = "curl/curl"

print(f"Testing: {test_repo}")
print("‚îÅ" * 50)

# Get maintainer data
maintainers = collector.collect_maintainer_data(test_repo)

print(f"\nüìä Maintainer Analysis:")
print(f"   Total collaborators:     {maintainers['statistics']['total_collaborators']}")
print(f"   From files (parsed):     {maintainers['statistics']['maintainers_from_files_count']}")
print(f"   Active (6 months):       {maintainers['statistics']['active_maintainers_6mo']}")

if maintainers['maintainers_from_files']:
    print(f"\nüë• Maintainers from files:")
    print(f"   {', '.join(maintainers['maintainers_from_files'])}")

print(f"\nüë• Top Committers (last 6 months):")
for committer in maintainers['top_committers'][:5]:
    print(f"   - {committer['login']:20s} {committer['commits_6mo']:4d} commits")

# Get contributor data for entropy classification
print("\n" + "‚îÅ" * 50)
print("üìà Entropy-Based Classification:")
contributors = collector.collect_contributor_data(test_repo, max_contributors=100)

from analysis.entropy_calculation import EntropyCalculator
entropy_calc = EntropyCalculator()
classification = entropy_calc.classify_project(contributors)

print(f"\n   Classification: {classification['classification']}")
print(f"   Confidence: {classification['confidence']:.0%}")
print(f"   Stadium Score: {classification.get('stadium_score', 'N/A')}/3 criteria met")
print(f"\n   Metrics:")
for key, value in classification['metrics'].items():
    if isinstance(value, float):
        print(f"      {key}: {value:.3f}")
    else:
        print(f"      {key}: {value}")

print(f"\n   Criteria Met: {', '.join(classification['criteria_met']) or 'None'}")
print("‚îÅ" * 50)

## Step 5: Collect Complete Dataset

Now let's collect a full dataset for curl:

In [None]:
# Collect complete dataset
repo = "curl/curl"

print(f"\nüöÄ Collecting complete dataset for: {repo}")
print("This will take a few minutes...\n")

data = collector.collect_complete_dataset(repo, since_days=365)

## Step 6: Examine Collected Data

In [None]:
# Display summary
print("\n" + "=" * 60)
print("COLLECTION SUMMARY")
print("=" * 60)
print(f"Repository: {repo}")
print(f"Stars: {data['repository'].get('stargazers_count', 'N/A'):,}")
print(f"Forks: {data['repository'].get('forks_count', 'N/A'):,}")
print(f"Language: {data['repository'].get('language', 'N/A')}")

print(f"\nüë• Maintainers:")
print(f"   Collaborators:  {data['maintainers']['statistics'].get('total_collaborators', 0)}")
print(f"   From files:     {data['maintainers']['statistics'].get('maintainers_from_files_count', 0)}")
print(f"   Active (6mo):   {data['maintainers']['statistics'].get('active_maintainers_6mo', 0)}")

print(f"\nüìà Activity:")
print(f"   Contributors: {len(data['contributors'])}")
print(f"   Commits:      {len(data['recent_commits'])}")

print(f"\nüîÄ Pull Requests:")
pr_stats = data['pull_requests']['statistics']
print(f"   Total:          {pr_stats.get('total_prs', 0)}")
print(f"   Merged:         {pr_stats.get('merged_count', 0)}")
if pr_stats.get('total_prs', 0) > 0:
    print(f"   Merge rate:     {pr_stats.get('merged_count', 0) / pr_stats['total_prs'] * 100:.1f}%")
    print(f"   Avg time:       {pr_stats.get('avg_time_to_merge', 0):.1f} hours")
    print(f"   Conflict rate:  {pr_stats.get('conflict_rate', 0) * 100:.1f}%")

print(f"\nüêõ Issues:")
issue_stats = data['issues']['statistics']
print(f"   Total:         {issue_stats.get('total_issues', 0)}")
print(f"   Closed:        {issue_stats.get('closed_count', 0)}")
if issue_stats.get('total_issues', 0) > 0:
    print(f"   Avg time:      {issue_stats.get('avg_time_to_close', 0):.1f} hours")
    print(f"   Bugs:          {issue_stats.get('bug_count', 0)}")
    print(f"   Enhancements:  {issue_stats.get('enhancement_count', 0)}")

print(f"\nüìã Governance Files:")
gov_files = [k for k, v in data['governance_files'].items() if v]
if gov_files:
    for f in gov_files:
        print(f"   ‚úì {f}")
else:
    print("   - None found")

print("=" * 60)

## Step 7: Save Data

In [None]:
from pathlib import Path

# Save to data/raw/
output_path = Path("../data/raw") / f"{repo.replace('/', '_')}_data.json"
collector.save_data(data, output_path)

print(f"\nüíæ Data saved to: {output_path}")
print(f"   File size: {output_path.stat().st_size / 1024:.1f} KB")

## Step 8: Quick Validation - Is this a Stadium Project?

In [None]:
def validate_stadium_project(data):
    """Check if project meets Stadium criteria."""
    
    checks = []
    
    # 1. Maintainer count
    active_maintainers = data['maintainers']['statistics']['active_maintainers_6mo']
    if active_maintainers <= 3:
        checks.append((True, f"‚úÖ Maintainers: {active_maintainers} ‚â§ 3"))
    else:
        checks.append((False, f"‚ùå Maintainers: {active_maintainers} > 3"))
    
    # 2. Activity
    commits = len(data['recent_commits'])
    if commits >= 50:  # At least 50 commits in last year
        checks.append((True, f"‚úÖ Activity: {commits} commits/year"))
    else:
        checks.append((False, f"‚ö†Ô∏è  Activity: {commits} commits/year (low)"))
    
    # 3. Impact (stars as proxy)
    stars = data['repository']['stargazers_count']
    if stars >= 1000:
        checks.append((True, f"‚úÖ Impact: {stars:,} stars"))
    else:
        checks.append((False, f"‚ö†Ô∏è  Impact: {stars:,} stars (low)"))
    
    # 4. Contributors (should have some, but not too many)
    contributors = len(data['contributors'])
    if 10 <= contributors <= 200:
        checks.append((True, f"‚úÖ Contributors: {contributors} (balanced)"))
    elif contributors > 200:
        checks.append((False, f"‚ö†Ô∏è  Contributors: {contributors} (high, may be Federation)"))
    else:
        checks.append((False, f"‚ö†Ô∏è  Contributors: {contributors} (low)"))
    
    print("\nüéØ Stadium Validation:")
    print("‚îÅ" * 50)
    for passed, message in checks:
        print(f"   {message}")
    print("‚îÅ" * 50)
    
    all_passed = all(check[0] for check in checks[:2])  # Must pass first 2
    if all_passed:
        print("\n‚úÖ STADIUM PROJECT CONFIRMED!")
    else:
        print("\n‚ö†Ô∏è  Does not meet Stadium criteria")
        print("\nüí° Tip: Check dominance patterns in top_committers.")
        print("   Even with >3 active maintainers, strong dominance by 1-2")
        print("   contributors may indicate Stadium-like characteristics.")
    
    return all_passed

# Run validation
is_stadium = validate_stadium_project(data)

## Step 9: Check Dominance Patterns

For projects with >3 active maintainers, check if 1-2 are dominant:

In [None]:
# Calculate contributor dominance
if len(data['contributors']) > 0:
    total_contributions = sum(c['contributions'] for c in data['contributors'])
    top_contributor = data['contributors'][0]
    dominance_ratio = top_contributor['contributions'] / total_contributions
    
    print(f"\nüìä Contributor Dominance Analysis:")
    print(f"   Top contributor: {top_contributor['login']}")
    print(f"   Contributions:   {top_contributor['contributions']:,} / {total_contributions:,}")
    print(f"   Dominance ratio: {dominance_ratio * 100:.1f}%")
    
    if dominance_ratio > 0.4:  # >40% of contributions
        print(f"\n‚úÖ STRONG DOMINANCE - Stadium-like characteristics!")
        print(f"   Even with {data['maintainers']['statistics']['active_maintainers_6mo']} active maintainers,")
        print(f"   {top_contributor['login']} controls {dominance_ratio * 100:.1f}% of contributions.")
    elif dominance_ratio > 0.25:  # >25% of contributions
        print(f"\n‚ö†Ô∏è  MODERATE DOMINANCE")
    else:
        print(f"\nüìä DISTRIBUTED CONTRIBUTIONS")
        print(f"   May be Federation-type project.")

## Next Steps

Now that you've successfully collected data for one project, you can:

1. **Verify more Stadium candidates:**
   - See `data/stadium_candidates.md` for the list
   - Run `collect_maintainer_data()` to quickly check maintainer counts

2. **Calculate entropy:**
   - Use `src/analysis/entropy_calculation.py`
   - See notebook `01_data_exploration.ipynb`

3. **Batch collection:**
   - Collect data for multiple projects
   - Create a script to iterate through candidates

---

**Rate limit check:**

In [None]:
# Check remaining rate limit
rate_limit = collector.get_rate_limit()
print(f"\nüìä Rate Limit After Collection:")
print(f"   Core API: {rate_limit['core']['remaining']}/{rate_limit['core']['limit']} remaining")
print(f"   Used: {5000 - rate_limit['core']['remaining']} calls")

## Step 10: Using Candidate Lists

We have pre-defined candidate lists for each governance category:

In [None]:
# Import candidate lists
sys.path.insert(0, '../data')
from candidates import (
    STADIUM_COLLECTED, STADIUM_HIGH_PRIORITY, STADIUM_ALL,
    FEDERATION_CANDIDATES, FEDERATION_HIGH_PRIORITY,
    CLUB_CANDIDATES, CLUB_HIGH_PRIORITY,
    TOY_CANDIDATES, TOY_HIGH_PRIORITY,
    get_uncollected, print_status
)

# Show collection status across all categories
print_status()

# Example: Get uncollected Stadium projects
uncollected_stadium = get_uncollected("stadium")
print(f"\nUncollected Stadium projects: {len(uncollected_stadium)}")
if uncollected_stadium[:5]:
    print("Next to collect:")
    for repo in uncollected_stadium[:5]:
        print(f"  - {repo}")