# Problem

Updating data at end of February, some players aren't getting caught (e.g. Yu Darvish)

My goal is to:

1. Identify where Darvish is getting lost
2. Fix the problem

I'll load the necessary things:

In [2]:
from dataScraping import * 
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import math
from scipy.stats import zscore
from buildFeatureMatrix import *

I'm going to replicate the necessary steps for building out the features and see where Darvish disappears

First, I'll grab him using free agent data:

In [3]:
# Grab all free agents since 2006
all_years = list(range(2006,2018))
all_fa_data = getAllFAData(all_years)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  all_fa_contracts_real['Length'] = pd.to_numeric(all_fa_contracts_real['Length'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  all_fa_contracts_real['Dollars'] = pd.to_numeric(all_fa_contracts_real['Dollars'].str.strip('$').str.replace(',',''))


Now I'll check to make sure he's there for the year 2017

In [4]:
all_fa_data[all_fa_data.nameLast == 'Darvish']

Unnamed: 0,Age,Full_Name,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Name,Position
2070,31,Yu Darvish,6.5,Yu,Darvish,2017,126000000.0,6,Yu Darvish,P


Looks good, now I'm going to fast forward to loading from the database, as that's the next time I'd think he'd disappear

In [5]:
engine = db_connect()
pitching_df = createPitchingTable(engine)
pitching_fa = addFilterFreeAgents(pitching_df, engine)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [6]:
pitching_fa[pitching_fa.nameLast == 'Darvish']

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,ERA,WHIP,K_9,HR_9,IPouts,W,SV
994,31,6.5,Yu,Darvish,2017,126000000.0,6,P,darviyu01,2017,-0.244162,-0.45139,0.656312,-0.109456,2.42751,1.714552,-0.271589


He's still there, so I haven't lost him yet. I'll do the next couple steps, though they shouldn't affect much:

In [7]:
pitching_war = allPositionWAR(pitching_fa, engine)
pitching_adjusted = addInflation(pitching_war, engine)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  inflation.loc[2017] = inflation.loc[2016] * (1 + per_year)


In [8]:
pitching_adjusted[pitching_adjusted.nameLast == 'Darvish']

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,...,K_9,HR_9,IPouts,W,SV,Med_WAR,Min_WAR,Inflation_Factor,Total,Dollars_2006


Looks like this is the problem step; I'll just make sure the pitching_war step isn't the problem:

In [9]:
pitching_war[pitching_war.nameLast == 'Darvish']

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,ERA,WHIP,K_9,HR_9,IPouts,W,SV,Med_WAR,Min_WAR


Okay, so this is the problem step. Somehow, Darvish is getting dropped out, so I'll now jump into that method

In [10]:
# Pull the data but drop the index
position_only_war = pullFullTable('position_team_war', engine).drop(['index'], axis = 1)
pitching_war = pullFullTable('pitcher_team_war', engine).drop(['index'], axis = 1)
    
# Put them together
position_war = pd.concat([position_only_war, pitching_war])
    
# Change the Year to "yearID"
position_war['yearID'] = position_war.Year
position_war = position_war.drop(['Year'], axis = 1)
    
# Create a dictionary for converting these to abbreviations
team_dict = {'Angels' : 'LAA', 'Astros' : 'HOU', 'Athletics' : 'OAK', 'Blue Jays' : 'TOR', 
                 'Braves' : 'ATL', 'Brewers': 'MIL', 'Cardinals' : 'STL', 'Cubs' : 'CHN',
                 'Diamondbacks' : 'ARI', 'Dodgers' : 'LAN', 'Giants' : 'SFN', 'Indians' : 'CLE',
                 'Mariners' : 'SEA', 'Marlins' : 'MIA', 'Mets' : 'NYN', 'Nationals' : 'WAS',
                 'Orioles' : 'BAL', 'Padres' : 'SDN', 'Phillies' : 'PHI', 'Pirates' : 'PIT', 
                 'Rangers' : 'TEX', 'Rays' : 'TBR', 'Red Sox' : 'BOS', 'Reds' : 'CIN', 
                 'Rockies' : 'COL', 'Royals' : 'KCR', 'Tigers' : 'DET', 'Twins' : 'MIN', 
                 'White Sox' : 'CHA', 'Yankees' : 'NYA'}
    
# Alter it to include WAR and Change the actual data frame
team_dict = {key : value + "_WAR" for key, value in team_dict.items()}
position_war = position_war.rename(columns = team_dict)
    
# Create stats for non-position/Year categories
position_war['Med_WAR'] = position_war.drop(['yearID', 'Position'], axis = 1).median(axis = 1)
position_war['Min_WAR'] = position_war.drop(['yearID', 'Position'], axis = 1).min(axis = 1)

# Shrink to only the Year/Position/Median/Min WAR stats
position_war_small = position_war[['yearID', 'Position', 'Med_WAR', 'Min_WAR']]
    

This will get merged with the pitching_fa data....it's doing so on position and yearID

--> The problem here is the position; Darvish is a "P", not an 'SP'....this is really annoying

I'd wager is this also the case for Arrieta:

In [11]:
pitching_fa[pitching_fa.nameLast == 'Arrieta']

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,ERA,WHIP,K_9,HR_9,IPouts,W,SV
1052,32,15.1,Jake,Arrieta,2017,,0,P,arrieja01,2017,-0.288174,-0.389471,0.224369,-0.154607,2.083104,2.727146,-0.271589


Alright, so when did THIS crap happen? Apparently now ESPN has "P" as position for some pitchers. I think I will.....

Just convert all "P" to "SP for now? I think that's a quick fix. Unfortunate, but that's what I'll have to do

In [12]:
pitching_fa[pitching_fa.Position == 'P']

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,ERA,WHIP,K_9,HR_9,IPouts,W,SV
67,29,1.9,Adam,Eaton,2006,24500000.0,3,P,eatonad01,2006,-0.081389,-0.007054,-0.231113,0.132766,-0.050145,0.708248,-0.276611
189,29,0.5,Matt,Belisle,2008,,0,P,belisma01,2008,0.586447,0.429357,-1.081583,0.114101,-0.599995,-0.617416,-0.271105
231,27,8.5,Francisco,Rodriguez,2008,37000000.0,3,P,rodrifr03,2008,-0.827441,-0.470944,1.425325,-0.679057,0.028131,-0.391227,8.970740
308,33,0.7,Fernando,Rodney,2009,11000000.0,2,P,rodnefe01,2009,-0.178338,-0.173942,0.167957,-0.231125,0.175033,-0.393941,5.144664
316,31,11.4,John,Lackey,2009,82500000.0,5,P,lackejo01,2009,-0.259877,-0.415463,0.104166,-0.294688,1.853448,1.742349,-0.264653
414,28,1.0,Matt,Albers,2010,875000.0,1,P,alberma01,2010,-0.136942,-0.091468,-0.486709,-0.331441,0.115946,0.256097,-0.274906
416,26,-3.0,Andrew,Miller,2010,,0,P,millean01,2010,0.708888,1.217855,0.189097,0.342212,-0.551544,-0.617039,-0.274906
432,33,2.1,Joaquin,Benoit,2010,16500000.0,3,P,benoijo01,2010,-0.805295,-1.286803,1.433709,-0.201379,-0.122074,-0.617039,-0.129918
455,29,1.5,Zach,Duke,2011,,0,P,dukeza01,2011,0.111551,0.252157,-1.320083,-0.334917,0.168917,-0.151274,-0.123258
463,28,9.4,Edwin,Jackson,2011,,1,P,jacksed01,2011,-0.260760,-0.024155,-0.176149,-0.318207,2.072354,1.883239,-0.263701


In [13]:
pitching_fa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1080 entries, 0 to 1079
Data columns (total 17 columns):
Age          1080 non-null int64
WAR_3        1080 non-null float64
nameFirst    1080 non-null object
nameLast     1080 non-null object
Year         1080 non-null int64
Dollars      565 non-null float64
Length       1080 non-null int64
Position     1080 non-null object
playerID     1080 non-null object
yearID       1080 non-null int64
ERA          1080 non-null float64
WHIP         1080 non-null float64
K_9          1080 non-null float64
HR_9         1080 non-null float64
IPouts       1080 non-null float64
W            1080 non-null float64
SV           1080 non-null float64
dtypes: float64(9), int64(4), object(4)
memory usage: 151.9+ KB


Looks like IPouts ~ 0.5 is a good enough cutoff...I'll use that

In [14]:
pitch_subset = pitching_fa[pitching_fa.Position == 'P']

In [15]:
pitch_subset.loc[pitch_subset.IPouts >= 0.5, 'Position'] = 'SP'
pitch_subset.loc[pitch_subset.IPouts < 0.5, 'Position'] = 'RP'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [16]:
pitch_subset

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,ERA,WHIP,K_9,HR_9,IPouts,W,SV
67,29,1.9,Adam,Eaton,2006,24500000.0,3,RP,eatonad01,2006,-0.081389,-0.007054,-0.231113,0.132766,-0.050145,0.708248,-0.276611
189,29,0.5,Matt,Belisle,2008,,0,RP,belisma01,2008,0.586447,0.429357,-1.081583,0.114101,-0.599995,-0.617416,-0.271105
231,27,8.5,Francisco,Rodriguez,2008,37000000.0,3,RP,rodrifr03,2008,-0.827441,-0.470944,1.425325,-0.679057,0.028131,-0.391227,8.970740
308,33,0.7,Fernando,Rodney,2009,11000000.0,2,RP,rodnefe01,2009,-0.178338,-0.173942,0.167957,-0.231125,0.175033,-0.393941,5.144664
316,31,11.4,John,Lackey,2009,82500000.0,5,SP,lackejo01,2009,-0.259877,-0.415463,0.104166,-0.294688,1.853448,1.742349,-0.264653
414,28,1.0,Matt,Albers,2010,875000.0,1,RP,alberma01,2010,-0.136942,-0.091468,-0.486709,-0.331441,0.115946,0.256097,-0.274906
416,26,-3.0,Andrew,Miller,2010,,0,RP,millean01,2010,0.708888,1.217855,0.189097,0.342212,-0.551544,-0.617039,-0.274906
432,33,2.1,Joaquin,Benoit,2010,16500000.0,3,RP,benoijo01,2010,-0.805295,-1.286803,1.433709,-0.201379,-0.122074,-0.617039,-0.129918
455,29,1.5,Zach,Duke,2011,,0,RP,dukeza01,2011,0.111551,0.252157,-1.320083,-0.334917,0.168917,-0.151274,-0.123258
463,28,9.4,Edwin,Jackson,2011,,1,SP,jacksed01,2011,-0.260760,-0.024155,-0.176149,-0.318207,2.072354,1.883239,-0.263701


That'll work...I'll add that to the code and call it good

# Problem 2: Where is Lorenzo Cain?

In adding Darvish and friends, I lost Lorenzo Cain...but why? Let's look!

In [17]:
all_fa_data[all_fa_data.nameLast == 'Cain']

Unnamed: 0,Age,Full_Name,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Name,Position
2090,32,Lorenzo Cain,15.4,Lorenzo,Cain,2017,80000000.0,5,Lorenzo Cain,OF


So it looks like Cain was there to start with, which is good. Where did he go? Let's do the steps, as above:

In [19]:
position_df = createBattingTable(engine)
position_fa = addFilterFreeAgents(position_df, engine)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [20]:
position_fa[position_fa.nameLast == 'Cain']

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,G,OBP,SLG,HR,RBI,SB
2009,32,15.4,Lorenzo,Cain,2017,80000000.0,5,OF,cainlo01,2017,2.223589,1.071544,1.004702,1.213271,1.225883,4.682077


Still good...I'm starting to sense that Cain might not match because his position was simply "OF", not "CF". So let's see how many of those there are:

In [21]:
pos_subset = position_fa[position_fa.Position == 'OF']

In [22]:
pos_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34 entries, 18 to 2111
Data columns (total 16 columns):
Age          34 non-null int64
WAR_3        34 non-null float64
nameFirst    34 non-null object
nameLast     34 non-null object
Year         34 non-null int64
Dollars      20 non-null float64
Length       34 non-null int64
Position     34 non-null object
playerID     34 non-null object
yearID       34 non-null int64
G            34 non-null float64
OBP          34 non-null float64
SLG          34 non-null float64
HR           34 non-null float64
RBI          34 non-null float64
SB           34 non-null float64
dtypes: float64(8), int64(4), object(4)
memory usage: 4.5+ KB


This is pretty manageable, let's see who's in it

In [23]:
pos_subset

Unnamed: 0,Age,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,G,OBP,SLG,HR,RBI,SB
18,28,-0.4,Alexis,Gomez,2006,,0,OF,gomezal01,2006,0.127744,0.690602,0.692438,-0.392814,-0.405293,0.272648
182,25,0.0,Bronson,Sardinha,2007,,0,OF,sardibr01,2007,-0.935782,1.24277,0.432492,-0.505253,-0.531259,-0.34424
235,29,0.0,Nick,Gorneault,2007,,0,OF,gorneni01,2007,-1.102569,0.038034,-0.948662,-0.505253,-0.600185,-0.34424
259,38,1.7,So,Taguchi,2007,1050000.0,1,OF,tagucso01,2007,1.566031,0.872907,0.576455,-0.114464,0.433696,0.71113
827,30,3.1,Fred,Lewis,2010,900000.0,1,OF,lewisfr02,2010,1.17358,0.810454,0.830702,0.590092,0.74457,2.290385
902,32,12.7,Jayson,Werth,2010,126000000.0,7,OF,werthja01,2010,2.15836,1.130596,1.339614,3.193451,2.591241,1.664208
1247,28,7.3,Melvin,Upton,2012,75250000.0,5,OF,uptonbj01,2012,1.986875,0.750337,1.229162,3.213217,2.411134,4.469374
1322,33,4.1,Nelson,Cruz,2013,8000000.0,1,OF,cruzne02,2013,1.204383,0.966944,1.590395,3.414838,2.480587,0.508707
1406,33,1.8,Rajai,Davis,2013,10000000.0,2,OF,davisra01,2013,1.18283,0.874384,0.913678,0.353591,0.373643,7.442113
1592,34,7.4,Nelson,Cruz,2014,57000000.0,4,OF,cruzne02,2014,2.289193,0.998011,1.7055,5.802666,3.931762,0.325597


Alright, this isn't a perfect solution, but let's just assume all these people are actually just left fielders. For Cain, it's not the best, but everyone else should fall decently into line

In [25]:
pos_subset.loc[pos_subset.Position == 'OF','Position'] = 'LF'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Actually, instead, we can grab the "OF" WAR for teams and just do that...let's do that instead. Solved!