# Scraping Supplementary Pitching Data from Baseball Reference

The provided dataset (`k.csv`) located in the `data/` directory contains essential but limited pitching statistics, with the following eight columns:  

1. **`MLBAMID`**: Player's MLB ID  
2. **`PlayerId`**: Player's FanGraphs ID  
3. **`Name`**: Player's name  
4. **`Team`**: Player's team name (*Note*: `" - - -"` indicates the player played for multiple teams in a season)  
5. **`Age`**: Player's age during the 2024 season  
6. **`Season`**: Year of the season  
7. **`TBF`**: Total batters faced for the player-season  
8. **`K%`**: Strikeout percentage for the player-season  

To make accurate predictions of a pitcher's strikeout percentage (`K%`) for the 2024 season, additional contextual data will likely be required. Fortunately, Baseball Reference offers a comprehensive dataset of MLB pitching statistics: [Baseball Reference Pitching Data](https://www.baseball-reference.com/leagues/majors/2021-pitches-pitching.shtml).  

### Scraping Utility
To facilitate data collection, a scraping utility has been implemented:  
- **`bullpen.data_utils.Scraper()`**: A core scraping tool for Baseball Reference data.  
- **`bullpen.data_utils.batch_scrape()`**: A convenience function to scrape data across multiple seasons.  

Since the dataset in `k.csv` covers the seasons from 2021 to 2024, we will limit our scraping to this same range.

---

## Supplemental Data Attributes

The Baseball Reference data contains the following additional attributes, which provide deeper insights into a pitcher's performance:  

1. **`Rk`**: Arbitrary rank based on sorting  
2. **`Name`**: Player's name  
3. **`Age`**: Age as of June 30th of the season year  
4. **`Tm`**: Abbreviated team name  
5. **`IP`**: Innings pitched  
6. **`PA`**: Number of plate appearances (including inning-ending baserunning outs)  
7. **`Pit`**: Total pitches in plate appearances  
8. **`Pit/PA`**: Pitches per plate appearance  
9. **`Str`**: Total strikes (including both in-zone and out-of-zone swings)  
10. **`Str%`**: Strike percentage (`Str / Pit`)  
11. **`L/Str`**: Looking strike percentage (`Looking strikes / Str`)  
12. **`S/Str`**: Swinging strike percentage (`Swinging strikes / Str`)  
13. **`F/Str`**: Foul strike percentage (`Fouls / Str`)  
14. **`I/Str`**: Balls in play percentage (`Balls in play / Str`)  
15. **`AS/Str`**: Percentage of strikes swung at (`(In-play + Fouls + Swings) / Str`)  
16. **`I/Bll`**: Intentional ball percentage (`Intentional balls / Total balls`)  
17. **`AS/Pit`**: Swing percentage (`Swings / (Pit - Intentional balls)`)  
18. **`Con`**: Contact percentage (`(Fouls + In-play) / Swings`)  
19. **`1st%`**: First pitch strike percentage (`First-pitch strikes / PA`)  
20. **`30%`**: Percentage of 3-0 counts seen (`3-0 counts / PA`)  
21. **`30c`**: Total 3-0 counts  
22. **`30s`**: Strikes in 3-0 counts  
23. **`02%`**: Percentage of 0-2 counts seen (`0-2 counts / PA`)  
24. **`02c`**: Total 0-2 counts  
25. **`02s`**: Strikes in 0-2 counts  
26. **`02h`**: Hits allowed in 0-2 counts  
27. **`L/SO`**: Strikeouts looking  
28. **`S/SO`**: Strikeouts swinging  
29. **`L/SO%`**: Looking strikeout percentage (`Looking SO / Total SO`)  
30. **`3pK`**: Three-pitch strikeouts  
31. **`4pW`**: Four-pitch walks  
32. **`PAu`**: Plate appearances with unknown outcomes  
33. **`Pitu`**: Pitches with unknown ball-strike results  
34. **`Stru`**: Strikes with unknown details  
35. **`Season`**: Year of the season  

---

## Data Organization

All scraped data will be saved in the `data/` directory. A module-level attribute provides convenient access to this directory:  

```python
from bullpen.data_utils import DATA_DIR

print(DATA_DIR)
/Users/logan/Desktop/repos/mlb-pitcher-xK/data


In [1]:
from bullpen import data_utils

In [2]:
supplemental_data = data_utils.batch_scrape([2021, 2022, 2023, 2024])
supplemental_data.head()

scraping https://www.baseball-reference.com/leagues/majors/2021-pitches-pitching.shtml...
scraping https://www.baseball-reference.com/leagues/majors/2022-pitches-pitching.shtml...
scraping https://www.baseball-reference.com/leagues/majors/2023-pitches-pitching.shtml...
scraping https://www.baseball-reference.com/leagues/majors/2024-pitches-pitching.shtml...


Unnamed: 0,Rk,Name,Age,Tm,IP,PA,Pit,Pit/PA,Str,Str%,...,02h,L/SO,S/SO,L/SO%,3pK,4pW,PAu,Pitu,Stru,Season
0,1,Fernando Abad,35,BAL,17.2,82,299,3.65,183,0.612,...,1,5,5,0.5,1,0,0,0,0,2021
1,2,Cory Abbott,25,CHC,17.1,82,352,4.29,203,0.577,...,0,4,8,0.333,1,2,0,0,0,2021
2,3,Albert Abreu,25,NYY,36.2,156,642,4.12,392,0.611,...,1,9,26,0.257,7,2,0,0,0,2021
3,4,Bryan Abreu,24,HOU,36.0,161,689,4.28,407,0.591,...,3,6,30,0.167,6,4,0,0,0,2021
4,5,Domingo Acevedo,27,OAK,11.0,44,174,3.95,114,0.655,...,1,0,9,0.0,4,1,0,0,0,2021


In [3]:
# supplemental_data.to_csv('../data/supplemental-stats.csv', index=False)