The purpose of this notebook is to create a function that takes the imported player list:
- clean to only needed information
- seperate by pither and batter
- merge each on player name to get projections from model
- merge back pitcher and batter data 
- output df for now

Next steps:
- build line up based on salary

In [1]:
# imports
import pandas as pd

In [2]:
# will need fanduel import
fd = pd.read_csv('../CapStone_Data/FanDuel-MLB-2021 ET-05 ET-04 ET-58318-players-list.csv')

In [3]:
# print first 5 rows of fd
fd.head()

Unnamed: 0,Id,Position,First Name,Nickname,Last Name,FPPG,Played,Salary,Game,Team,Opponent,Injury Indicator,Injury Details,Tier,Probable Pitcher,Batting Order,Roster Position
0,58318-52859,P,Jacob,Jacob deGrom,deGrom,61.6,5.0,12500,NYM@STL,NYM,STL,,,,Yes,,P
1,58318-16956,P,Gerrit,Gerrit Cole,Cole,54.166667,6.0,12200,HOU@NYY,NYY,HOU,,,,,0.0,P
2,58318-82554,P,Shane,Shane Bieber,Bieber,55.666667,6.0,12000,CLE@KC,CLE,KC,,,,,,P
3,58318-5481,P,Max,Max Scherzer,Scherzer,42.166667,6.0,12000,ATL@WSH,WSH,ATL,,,,,0.0,P
4,58318-82604,P,Corbin,Corbin Burnes,Burnes,49.6,5.0,11100,MIL@PHI,MIL,PHI,IL,Undisclosed,,,0.0,P


only columns needed:
- id - will need this later for template
- Position
- nickname - renamed to Name
- salary
- game
- team
- opponent
- injury indicator
- probable pitcher

In [4]:
fd.columns

Index(['Id', 'Position', 'First Name', 'Nickname', 'Last Name', 'FPPG',
       'Played', 'Salary', 'Game', 'Team', 'Opponent', 'Injury Indicator',
       'Injury Details', 'Tier', 'Probable Pitcher', 'Batting Order',
       'Roster Position'],
      dtype='object')

In [6]:
# clean up fd to match column list above
fd.drop(columns=['First Name', 'Last Name', 'FPPG', 'Played',
                 'Injury Details', 'Tier', 'Batting Order', 'Roster Position'], inplace=True)

Next step is fill nulls in probable pitcher and injury indicator

In [9]:
# filling nulls for probable pitcher
fd['Probable Pitcher'].fillna('No', inplace=True)

In [11]:
# fill nulls for injury indicator
fd['Injury Indicator'].fillna('Healthy', inplace=True)

In [13]:
# review new cleaned df
fd.head()

Unnamed: 0,Id,Position,Nickname,Salary,Game,Team,Opponent,Injury Indicator,Probable Pitcher
0,58318-52859,P,Jacob deGrom,12500,NYM@STL,NYM,STL,Healthy,Yes
1,58318-16956,P,Gerrit Cole,12200,HOU@NYY,NYY,HOU,Healthy,No
2,58318-82554,P,Shane Bieber,12000,CLE@KC,CLE,KC,Healthy,No
3,58318-5481,P,Max Scherzer,12000,ATL@WSH,WSH,ATL,Healthy,No
4,58318-82604,P,Corbin Burnes,11100,MIL@PHI,MIL,PHI,IL,No


In [14]:
fd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 831 entries, 0 to 830
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                831 non-null    object
 1   Position          831 non-null    object
 2   Nickname          831 non-null    object
 3   Salary            831 non-null    int64 
 4   Game              831 non-null    object
 5   Team              831 non-null    object
 6   Opponent          831 non-null    object
 7   Injury Indicator  831 non-null    object
 8   Probable Pitcher  831 non-null    object
dtypes: int64(1), object(8)
memory usage: 58.6+ KB


Next step rename Nickname to name

In [16]:
# renaming nickname column
fd.rename(columns={'Nickname': 'Name'}, inplace=True)

Next step filter to only healthy players

In [19]:
# fitler to only healthy players
fd = fd.loc[fd['Injury Indicator'] == 'Healthy']

In [20]:
# review dataframe
fd.head()

Unnamed: 0,Id,Position,Name,Salary,Game,Team,Opponent,Injury Indicator,Probable Pitcher
0,58318-52859,P,Jacob deGrom,12500,NYM@STL,NYM,STL,Healthy,Yes
1,58318-16956,P,Gerrit Cole,12200,HOU@NYY,NYY,HOU,Healthy,No
2,58318-82554,P,Shane Bieber,12000,CLE@KC,CLE,KC,Healthy,No
3,58318-5481,P,Max Scherzer,12000,ATL@WSH,WSH,ATL,Healthy,No
5,58318-16931,P,Yu Darvish,11000,PIT@SD,SD,PIT,Healthy,No


In [21]:
fd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 681 entries, 0 to 830
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                681 non-null    object
 1   Position          681 non-null    object
 2   Name              681 non-null    object
 3   Salary            681 non-null    int64 
 4   Game              681 non-null    object
 5   Team              681 non-null    object
 6   Opponent          681 non-null    object
 7   Injury Indicator  681 non-null    object
 8   Probable Pitcher  681 non-null    object
dtypes: int64(1), object(8)
memory usage: 53.2+ KB


next step is to split into pithcers and batters

In [23]:
# split using .loc by position and make new dataframe for pitchers
pitchers = fd.loc[fd['Position']=='P']

In [25]:
# split using .loc by position and make new dataframe for batters
batters = fd.loc[fd['Position']!='P']

Now we have cleaned dataframes for each group. Now lets focus on pitchers to get projections added to dataframe.

First step for pitchers is to get the probable pitchers first, we only care about the ones that will start.

In [27]:
# save pitchers df to only starting pitchers
pitchers = pitchers.loc[pitchers['Probable Pitcher']=='Yes']

In [37]:
# review changes
pitchers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21 entries, 0 to 398
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                21 non-null     object
 1   Position          21 non-null     object
 2   Name              21 non-null     object
 3   Salary            21 non-null     int64 
 4   Game              21 non-null     object
 5   Team              21 non-null     object
 6   Opponent          21 non-null     object
 7   Injury Indicator  21 non-null     object
 8   Probable Pitcher  21 non-null     object
dtypes: int64(1), object(8)
memory usage: 1.6+ KB


Next step is to combine projections with 2021 stats.
- import testing data with projections from model
- merge the two data frames

In [31]:
# read in pitcher projections
pitcher_proj = pd.read_csv('../Projections/pitcher_projections_2021.csv')

In [32]:
pitcher_proj

Unnamed: 0,Name,Team,Pos,W,L,GMS,GS,SV,IP,H,R,ER,HR,BB,SO,PTS,ERA,WHIP,FPPG,Proj_FPPG
0,Tyler Glasnow,TB,SP,4,1,7,7,0,43.2,23,10,10,4,15,64,337,2.06,0.87,48.142857,30.179118
1,Shane Bieber,CLE,SP,3,2,6,6,0,42.1,28,13,13,5,14,68,334,2.76,0.99,55.666667,28.664806
2,Gerrit Cole,NYY,SP,4,1,6,6,0,37.2,24,7,6,1,3,62,325,1.43,0.72,54.166667,19.099635
3,Jacob deGrom,NYM,SP,2,2,5,5,0,35.0,16,5,2,2,4,59,308,0.51,0.57,61.600000,20.120108
4,Trevor Bauer,LAD,SP,3,1,6,6,0,40.0,19,12,11,7,8,51,278,2.48,0.68,46.333333,22.412773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
509,Keegan Thompson,CHC,SP,0,0,1,0,0,1.0,2,0,0,0,1,0,3,0.00,3.00,3.000000,9.842265
510,Brooks Kriske,NYY,RP,0,0,1,0,0,1.0,1,1,1,1,2,1,3,9.00,3.00,3.000000,5.662728
511,Alex Vesia,LAD,RP,0,1,1,0,0,1.0,0,4,2,0,4,2,3,18.00,4.00,3.000000,1.440864
512,Joakim Soria,ARI,RP,0,0,1,0,0,0.2,0,0,0,0,2,0,2,0.00,3.00,2.000000,10.769238


Next step merge.

In [33]:
# merge attempt
pitcher_projections = pitchers.merge(pitcher_proj, how='left', on='Name')

In [35]:
pitcher_projections

Unnamed: 0,Id,Position,Name,Salary,Game,Team_x,Opponent,Injury Indicator,Probable Pitcher,Team_y,...,R,ER,HR,BB,SO,PTS,ERA,WHIP,FPPG,Proj_FPPG
0,58318-52859,P,Jacob deGrom,12500,NYM@STL,NYM,STL,Healthy,Yes,NYM,...,5.0,2.0,2.0,4.0,59.0,308.0,0.51,0.57,61.6,20.120108
1,58318-60647,P,Aaron Nola,9200,MIL@PHI,PHI,MIL,Healthy,Yes,PHI,...,13.0,13.0,3.0,5.0,39.0,215.0,3.11,0.96,35.833333,17.907559
2,58318-12936,P,Kyle Gibson,8600,TEX@MIN,TEX,MIN,Healthy,Yes,TEX,...,9.0,8.0,0.0,11.0,27.0,195.0,2.16,1.14,32.5,23.140287
3,58318-5522,P,Zack Greinke,8400,HOU@NYY,HOU,NYY,Healthy,Yes,HOU,...,14.0,14.0,5.0,6.0,27.0,177.0,3.44,1.15,29.5,17.537554
4,58318-79951,P,Nick Pivetta,8300,DET@BOS,BOS,DET,Healthy,Yes,BOS,...,8.0,8.0,1.0,17.0,25.0,150.0,2.81,1.25,30.0,27.943867
5,58318-13123,P,Alex Cobb,8000,TB@LAA,LAA,TB,Healthy,Yes,LAA,...,14.0,13.0,1.0,5.0,23.0,89.0,7.16,1.78,22.25,12.761222
6,58318-101804,P,Huascar Ynoa,8000,ATL@WSH,ATL,WSH,Healthy,Yes,ATL,...,9.0,9.0,5.0,6.0,34.0,177.0,2.96,0.92,29.5,18.15508
7,58318-5767,P,J.A. Happ,7900,TEX@MIN,MIN,TEX,Healthy,Yes,MIN,...,5.0,5.0,2.0,7.0,13.0,113.0,1.96,0.83,28.25,18.51327
8,58318-85269,P,Cole Irvin,7700,TOR@OAK,OAK,TOR,Healthy,Yes,OAK,...,11.0,11.0,3.0,4.0,25.0,143.0,3.67,1.3,28.6,14.976088
9,58318-65987,P,Domingo German,7200,HOU@NYY,NYY,HOU,Healthy,Yes,,...,,,,,,,,,,


In [39]:
pitcher_projections.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21 entries, 0 to 20
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                21 non-null     object 
 1   Position          21 non-null     object 
 2   Name              21 non-null     object 
 3   Salary            21 non-null     int64  
 4   Game              21 non-null     object 
 5   Team_x            21 non-null     object 
 6   Opponent          21 non-null     object 
 7   Injury Indicator  21 non-null     object 
 8   Probable Pitcher  21 non-null     object 
 9   Team_y            19 non-null     object 
 10  Pos               19 non-null     object 
 11  W                 19 non-null     float64
 12  L                 19 non-null     float64
 13  GMS               19 non-null     float64
 14  GS                19 non-null     float64
 15  SV                19 non-null     float64
 16  IP                19 non-null     float64
 17 