A comprehensive personal baseball attendance tracker that processes game data from Baseball Reference and generates detailed statistics, visualizations, and interactive web reports.
This project processes Baseball Reference HTML files for games you've attended and generates:
- Detailed Excel reports with attendance statistics and milestones
- Interactive website with visualizations and analytics
- Player milestone tracking (achievements witnessed at games)
- Stadium records and attendance patterns
- Comprehensive game logs and statistics
- Game Log Processing: Extracts and analyzes game data from Baseball Reference HTML files
- Player Statistics: Tracks individual player performance across attended games
- Milestone Detection: Identifies significant player achievements (career milestones, records, etc.)
- Signature Home Runs: Special tracking for memorable home runs
- Stadium Records: Attendance patterns and statistics by venue
- Comprehensive attendance summaries
- Player statistics breakdowns
- Milestone achievements
- Game-by-game logs with formatting
- Stadium visit tracking
- Interactive calendar showing all months (March-October)
- Team matchup matrices
- Division checklist (track teams/stadiums by division)
- Badges and achievements system
- Companions tracking (games attended with others)
- Team-specific dashboards (e.g., Orioles with streaks, opponent breakdowns)
- Smart insights and analytics
- Statistical leaders and trends
- Responsive design for mobile and desktop
mlb_baseball_processor/
├── engines/ # Core processing engines
│ ├── game_log_processor.py
│ ├── milestone_engine.py
│ └── special_events_engine.py
├── excel/ # Excel generation and formatting
│ ├── generators.py
│ ├── formatters.py
│ └── workbook_generator.py
├── parsers/ # HTML and data parsers
│ ├── html_parser.py
│ ├── play_by_play_parser.py
│ └── stats_parser.py
├── processors/ # Specialized stat processors
│ ├── game_log_processor.py
│ ├── milestones_processor.py
│ ├── player_stats_processor.py
│ ├── signature_home_runs_processor.py
│ ├── stadium_records_processor.py
│ └── summary_stats_processor.py
├── scrapers/ # Web scrapers for external data
│ ├── career_firsts_scraper.py # Career firsts & milestones
│ └── debut_scraper.py # MLB debut data
├── utils/ # Utility functions and helpers
│ ├── constants.py
│ ├── globals.py
│ ├── helpers.py
│ ├── log.py
│ └── stat_utils.py
├── website/ # Website generation
│ ├── generator.py
│ ├── serializers.py
│ └── templates.py
└── main.py # Main entry point
- Python 3.8+
- Required packages:
- BeautifulSoup4 (HTML parsing)
- openpyxl (Excel generation)
- Additional dependencies (see
requirements.txt)
- Clone the repository:
git clone https://github.com/jnpushkin/mlb_baseball_processor.git
cd mlb_baseball_processor- Install dependencies:
pip install -r requirements.txt- Place Baseball Reference HTML files for games you've attended in the input directory (e.g.,
Current Season Games/) - Run the processor:
python3 -m baseball_processor "Current Season Games"- Generated outputs:
- Excel workbook:
MLB Game Passport - BREF.xlsx - Interactive website:
MLB Game Passport - BREF.html
- Excel workbook:
| Option | Description |
|---|---|
--output-excel FILE |
Custom Excel filename |
--save-json |
Save intermediate JSON data |
--from-cache-only |
Load from cache (skip HTML parsing) |
--excel-only |
Generate only Excel |
--website-only |
Generate only website |
--verbose |
Enable debug output |
--parallel |
Use parallel processing for faster parsing |
--export-csv DIR |
Export all data to CSV files |
--deploy |
Deploy website to Surge after generation |
--scrape-career-firsts |
Scrape career firsts for players in processed games |
Create a companions.csv file to track who you attended games with:
GameID,Companions
BAL202505130,Dad
SFN202507110,Dad|Mom
Scrape Baseball Reference to find career firsts (first hit, first HR, etc.) and career milestones (100th HR, 3000th hit, etc.) for players in your attended games.
Standalone Usage:
# Scrape career firsts for all players in your games
python -m baseball_processor.scrapers.career_firsts_scraper
# Scrape a specific player
python -m baseball_processor.scrapers.career_firsts_scraper --player troutmi01
# Scrape players from a specific game file
python -m baseball_processor.scrapers.career_firsts_scraper --game "Game_Box_Score.json"
# Refresh only players from 2025 games
python -m baseball_processor.scrapers.career_firsts_scraper --refresh-year 2025
# Re-scrape only batting data (preserves pitching data)
python -m baseball_processor.scrapers.career_firsts_scraper --refresh-batting
# Re-scrape only pitching data (preserves batting data)
python -m baseball_processor.scrapers.career_firsts_scraper --refresh-pitching
# Check for career firsts you witnessed
python -m baseball_processor.scrapers.career_firsts_scraper --check-witnessedCareer Firsts Scraper Options:
| Option | Description |
|---|---|
--player ID |
Scrape a specific player by Baseball Reference ID |
--game FILE |
Scrape players from a specific cached game JSON file |
--refresh |
Force refresh cached data |
--refresh-year YEAR |
Refresh only players from games in a specific year |
--refresh-batting |
Re-scrape only batting data for all cached players |
--refresh-pitching |
Re-scrape only pitching data for all cached players |
--check-witnessed |
Check for career firsts witnessed at attended games |
--delay SECONDS |
Delay between requests (default: 3.05s) |
--quiet |
Suppress progress messages |
Tracked Batting Milestones:
- Firsts: Hit, Home Run, RBI, Double, Triple, Walk, Stolen Base, Run
- Thresholds: Hits (100-4000), HRs (10-800), RBIs (100-2000), Doubles (50-700), Triples (25-200), SBs (50-800), Walks (100-2000), Runs (100-2000)
Tracked Pitching Milestones:
- Firsts: Win, Save, Strikeout, Inning Pitched, Start, Complete Game, Shutout
- Thresholds: Wins (25-350), Saves (25-600), Strikeouts (100-3500), Innings Pitched (100-3500), Starts (50-500), Complete Games (10-100), Shutouts (10-100)
- Python: Core processing and data analysis
- BeautifulSoup4: HTML parsing from Baseball Reference
- openpyxl: Excel report generation
- React: Interactive website frontend (for visualization components)
Game data is sourced from Baseball Reference, which provides comprehensive statistics and play-by-play information for MLB games.
- Enhanced milestone detection algorithms
- Additional visualization types
- Historical trend analysis
- Advanced statistical insights
Personal project by Jeremy Pushkin
This project is for personal use. Baseball data is property of MLB and Baseball Reference.