**Predicting the MLB's Most Valuable Player!**

In [1]:
import time
import pybaseball as pyb
import pandas as pd

In [2]:
# Let's get batting stats for a range of seasons (2000-2024)
# MVP voting often considers seasons where sabermetrics became more widely recognized.
start_season = 2000
end_season = 2024 

In [3]:
batting_stats_list = []
print(f"Fetching batting stats from FanGraphs for seasons {start_season}-{end_season}...")

for year in range(start_season, end_season + 1):
    print(f"  Fetching batting stats for {year}...")
    # Fetch data for one year at a time
    yearly_batting_stats = pyb.batting_stats(year, year, qual=0)
    batting_stats_list.append(yearly_batting_stats)
    time.sleep(1) # Be polite to the server by waiting a second between requests

# Combine the list of DataFrames into a single DataFrame
batting_stats = pd.concat(batting_stats_list, ignore_index=True)
print("Batting stats fetched successfully!")


Fetching batting stats from FanGraphs for seasons 2000-2024...
  Fetching batting stats for 2000...
  Fetching batting stats for 2001...
  Fetching batting stats for 2002...
  Fetching batting stats for 2003...
  Fetching batting stats for 2004...
  Fetching batting stats for 2005...
  Fetching batting stats for 2006...
  Fetching batting stats for 2007...
  Fetching batting stats for 2008...
  Fetching batting stats for 2009...
  Fetching batting stats for 2010...
  Fetching batting stats for 2011...
  Fetching batting stats for 2012...
  Fetching batting stats for 2013...
  Fetching batting stats for 2014...
  Fetching batting stats for 2015...
  Fetching batting stats for 2016...
  Fetching batting stats for 2017...
  Fetching batting stats for 2018...
  Fetching batting stats for 2019...
  Fetching batting stats for 2020...
  Fetching batting stats for 2021...
  Fetching batting stats for 2022...
  Fetching batting stats for 2023...
  Fetching batting stats for 2024...
Batting stat

**Let's take a look at some sample batting statistics.**

In [4]:
print("\nSample Batting Stats:")
print(batting_stats.head())
print(f"\nShape of batting_stats: {batting_stats.shape}")


Sample Batting Stats:
   IDfg  Season            Name Team  Age    G   AB   PA    H   1B  ...  \
0  1274    2000  Alex Rodriguez  SEA   24  148  554  672  175   98  ...   
1    11    2000    Darin Erstad  ANA   26  157  676  747  240  170  ...   
2   432    2000     Todd Helton  COL   26  160  580  697  216  113  ...   
3    15    2000      Troy Glaus  ANA   23  159  563  678  160   75  ...   
4   818    2000    Jason Giambi  OAK   29  152  510  664  170   97  ...   

   maxEV  HardHit  HardHit%  Events  CStr%  CSW%  xBA  xSLG  xwOBA  L-WAR  
0    NaN      NaN       NaN       0    NaN   NaN  NaN   NaN    NaN    9.5  
1    NaN      NaN       NaN       0    NaN   NaN  NaN   NaN    NaN    8.7  
2    NaN      NaN       NaN       0    NaN   NaN  NaN   NaN    NaN    8.3  
3    NaN      NaN       NaN       0    NaN   NaN  NaN   NaN    NaN    8.2  
4    NaN      NaN       NaN       0    NaN   NaN  NaN   NaN    NaN    7.7  

[5 rows x 320 columns]

Shape of batting_stats: (32465, 320)


**We can't forget pitchers too!**

In [5]:
pitching_stats_list = []
print(f"\nFetching pitching stats from FanGraphs for seasons {start_season}-{end_season}...")

for year in range(start_season, end_season + 1):
    print(f"  Fetching pitching stats for {year}...")
    # Fetch data for one year at a time
    yearly_pitching_stats = pyb.pitching_stats(year, year, qual=0)
    pitching_stats_list.append(yearly_pitching_stats)
    time.sleep(1) # Be polite to the server

# Combine the list of DataFrames into a single DataFrame
pitching_stats = pd.concat(pitching_stats_list, ignore_index=True)
print("Pitching stats fetched successfully!")


Fetching pitching stats from FanGraphs for seasons 2000-2024...
  Fetching pitching stats for 2000...
  Fetching pitching stats for 2001...
  Fetching pitching stats for 2002...
  Fetching pitching stats for 2003...
  Fetching pitching stats for 2004...
  Fetching pitching stats for 2005...
  Fetching pitching stats for 2006...
  Fetching pitching stats for 2007...
  Fetching pitching stats for 2008...
  Fetching pitching stats for 2009...
  Fetching pitching stats for 2010...
  Fetching pitching stats for 2011...
  Fetching pitching stats for 2012...
  Fetching pitching stats for 2013...
  Fetching pitching stats for 2014...
  Fetching pitching stats for 2015...
  Fetching pitching stats for 2016...
  Fetching pitching stats for 2017...
  Fetching pitching stats for 2018...
  Fetching pitching stats for 2019...
  Fetching pitching stats for 2020...
  Fetching pitching stats for 2021...
  Fetching pitching stats for 2022...
  Fetching pitching stats for 2023...
  Fetching pitching sta

**Let's take a look at some sample pitching statistics.**

In [6]:
print("\nSample Pitching Stats:")
print(pitching_stats.head())
print(f"\nShape of pitching_stats: {pitching_stats.shape}")



Sample Pitching Stats:
   IDfg  Season            Name Team  Age   W   L  WAR   ERA   G  ...  \
0    60    2000   Randy Johnson  ARI   36  19   7  9.6  2.64  35  ...   
1   200    2000  Pedro Martinez  BOS   28  18   6  9.4  1.74  29  ...   
2   104    2000     Greg Maddux  ATL   34  19   9  7.2  3.00  35  ...   
3   642    2000     Kevin Brown  LAD   35  13   6  6.8  2.58  33  ...   
4   837    2000    Mike Mussina  BAL   31  11  15  6.4  3.79  34  ...   

   Pit+ FC  Stf+ FS  Loc+ FS  Pit+ FS  Stuff+  Location+  Pitching+  Stf+ FO  \
0      NaN      NaN      NaN      NaN     NaN        NaN        NaN      NaN   
1      NaN      NaN      NaN      NaN     NaN        NaN        NaN      NaN   
2      NaN      NaN      NaN      NaN     NaN        NaN        NaN      NaN   
3      NaN      NaN      NaN      NaN     NaN        NaN        NaN      NaN   
4      NaN      NaN      NaN      NaN     NaN        NaN        NaN      NaN   

   Loc+ FO  Pit+ FO  
0      NaN      NaN  
1      NaN  

**Now I can save both the hitting and pitching stats as CSVs, so we do not have to scrape them everytime.**

In [7]:
batting_stats.to_csv('fangraphs_batting_stats.csv', index=False)
pitching_stats.to_csv('fangraphs_pitching_stats.csv', index=False)
print("\nData saved to CSV files.")


Data saved to CSV files.


**Now, I will input all of the MVP winners and their information. Let's start with all of the AL MVPs!**

In [8]:
mvp_data = []

# American League MVP Winners
al_mvps = [
    (2000, 'Jason Giambi', 'AL', 'OAK'),
    (2001, 'Ichiro Suzuki', 'AL', 'SEA'),
    (2002, 'Miguel Tejada', 'AL', 'OAK'),
    (2003, 'Alex Rodriguez', 'AL', 'TEX'),
    (2004, 'Vladimir Guerrero', 'AL', 'LAA'), 
    (2005, 'Alex Rodriguez', 'AL', 'NYY'),
    (2006, 'Justin Morneau', 'AL', 'MIN'),
    (2007, 'Alex Rodriguez', 'AL', 'NYY'),
    (2008, 'Dustin Pedroia', 'AL', 'BOS'),
    (2009, 'Joe Mauer', 'AL', 'MIN'),
    (2010, 'Josh Hamilton', 'AL', 'TEX'),
    (2011, 'Justin Verlander', 'AL', 'DET'),
    (2012, 'Miguel Cabrera', 'AL', 'DET'),
    (2013, 'Miguel Cabrera', 'AL', 'DET'),
    (2014, 'Mike Trout', 'AL', 'LAA'),
    (2015, 'Josh Donaldson', 'AL', 'TOR'),
    (2016, 'Mike Trout', 'AL', 'LAA'),
    (2017, 'Jose Altuve', 'AL', 'HOU'),
    (2018, 'Mookie Betts', 'AL', 'BOS'),
    (2019, 'Mike Trout', 'AL', 'LAA'),
    (2020, 'José Abreu', 'AL', 'CHW'),
    (2021, 'Shohei Ohtani', 'AL', 'LAA'),
    (2022, 'Aaron Judge', 'AL', 'NYY'),
    (2023, 'Shohei Ohtani', 'AL', 'LAA'),
    (2024, 'Aaron Judge', 'AL', 'NYY')
]
mvp_data.extend(al_mvps)

**Now for the NL MVPs!**

In [9]:
# National League MVP Winners
nl_mvps = [
    (2000, 'Jeff Kent', 'NL', 'SF'),
    (2001, 'Barry Bonds', 'NL', 'SF'),
    (2002, 'Barry Bonds', 'NL', 'SF'),
    (2003, 'Barry Bonds', 'NL', 'SF'),
    (2004, 'Barry Bonds', 'NL', 'SF'),
    (2005, 'Albert Pujols', 'NL', 'STL'),
    (2006, 'Ryan Howard', 'NL', 'PHI'),
    (2007, 'Jimmy Rollins', 'NL', 'PHI'),
    (2008, 'Albert Pujols', 'NL', 'STL'),
    (2009, 'Albert Pujols', 'NL', 'STL'),
    (2010, 'Joey Votto', 'NL', 'CIN'),
    (2011, 'Ryan Braun', 'NL', 'MIL'),
    (2012, 'Buster Posey', 'NL', 'SF'),
    (2013, 'Andrew McCutchen', 'NL', 'PIT'),
    (2014, 'Clayton Kershaw', 'NL', 'LAD'),
    (2015, 'Bryce Harper', 'NL', 'WSH'),
    (2016, 'Kris Bryant', 'NL', 'CHC'),
    (2017, 'Giancarlo Stanton', 'NL', 'MIA'),
    (2018, 'Christian Yelich', 'NL', 'MIL'),
    (2019, 'Cody Bellinger', 'NL', 'LAD'),
    (2020, 'Freddie Freeman', 'NL', 'ATL'),
    (2021, 'Bryce Harper', 'NL', 'PHI'),
    (2022, 'Paul Goldschmidt', 'NL', 'STL'),
    (2023, 'Ronald Acuña Jr.', 'NL', 'ATL'),
    (2024, 'Shohei Ohtani', 'NL', 'LAD') # Based on current information/projections for 2024
]
mvp_data.extend(nl_mvps)


**Let's bring them together into one dataframe.**

In [10]:
mvp_df = pd.DataFrame(mvp_data, columns=['Season', 'Name', 'League', 'Team'])

print(mvp_df.head())
print(mvp_df.tail())
print(f"\nTotal MVP winners recorded: {len(mvp_df)}")


   Season               Name League Team
0    2000       Jason Giambi     AL  OAK
1    2001      Ichiro Suzuki     AL  SEA
2    2002      Miguel Tejada     AL  OAK
3    2003     Alex Rodriguez     AL  TEX
4    2004  Vladimir Guerrero     AL  LAA
    Season              Name League Team
45    2020   Freddie Freeman     NL  ATL
46    2021      Bryce Harper     NL  PHI
47    2022  Paul Goldschmidt     NL  STL
48    2023  Ronald Acuña Jr.     NL  ATL
49    2024     Shohei Ohtani     NL  LAD

Total MVP winners recorded: 50


**Once again, I will save this as a CSV.**

In [11]:
mvp_df.to_csv('mvp_winners_2000_2024.csv', index=False)
print("\nMVP winners data saved to 'mvp_winners_2000_2024.csv'")


MVP winners data saved to 'mvp_winners_2000_2024.csv'
