# Assemble the Features

We want to assemble our data in to a data frame of features; for now I'm going to try to make something including:

* Position player performance data (~3 numbers)
* Position player position
* Team salary data
* Team performance for position (previous year)
* Team value lost for position (from previous year, using FAs)

We'll try doing it in stages

## Task 1: Grab Batting/Pitching data and filter it by only free agents

We'll do it in 5 stages:

1. Pull batting/pitching data and shorten its columns to just the ones I want
2. Pull the "people" data to get the first/last names for batting data
3. Join batting and people to get all the data JUST for our desired years
3. Pull the "free_agents" data
5. Join "batting" and new free_agents/people to filter batting by only free agents

In [1]:
import buildFeatureMatrix as bfm
engine = bfm.db_connect()

## Functions 1 + 2: create batting and pitching tables

In [2]:
# Demonstration of batting data creation
batting_df = bfm.createBattingTable(engine)
print(batting_df.shape)
print(batting_df[batting_df.playerID == 'pujolal01'])

(18072, 8)
        playerID  yearID         G       OBP       SLG        HR       RBI  \
13139  pujolal01    2004  2.069087  1.336778  1.943523  4.956622  3.615371   
13140  pujolal01    2005  2.189724  1.362913  1.529887  4.658596  3.554547   
13141  pujolal01    2006  1.802831  1.322061  1.937764  5.258251  3.979439   
13142  pujolal01    2007  2.149787  1.308964  1.405410  3.663164  2.949472   
13143  pujolal01    2008  2.002254  1.479137  1.758056  4.364865  3.570410   
13144  pujolal01    2009  2.216923  1.464446  2.100880  5.475587  4.209536   
13145  pujolal01    2010  2.222585  1.280402  1.612431  5.248734  3.834918   
13146  pujolal01    2011  2.043980  1.008255  1.555847  4.733167  3.286667   
13147  pujolal01    2012  2.159915  1.032390  1.540237  3.479242  3.453992   
13148  pujolal01    2013  0.988854  0.984588  1.236602  1.957101  1.994369   
13149  pujolal01    2014  2.289193  0.939020  1.400173  3.911971  3.806011   
13150  pujolal01    2015  2.242632  0.761801  1.19515

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [3]:
# Test it out
pitching_df = bfm.createPitchingTable(engine)
print(pitching_df.info())
print(pitching_df[pitching_df.playerID == 'kershcl01'])

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9408 entries, 0 to 9415
Data columns (total 9 columns):
playerID    9408 non-null object
yearID      9408 non-null int64
ERA         9408 non-null float64
WHIP        9408 non-null float64
K_9         9408 non-null float64
HR_9        9408 non-null float64
IPouts      9408 non-null float64
W           9408 non-null float64
SV          9408 non-null float64
dtypes: float64(7), int64(1), object(1)
memory usage: 735.0+ KB
None
       playerID  yearID       ERA      WHIP       K_9      HR_9    IPouts  \
4353  kershcl01    2008 -0.259815 -0.096262  0.667277 -0.225475  0.667086   
4354  kershcl01    2009 -0.407683 -0.467358  1.151170 -0.673071  1.764525   
4355  kershcl01    2010 -0.476203 -0.540464  0.770777 -0.432598  2.113243   
4356  kershcl01    2011 -0.752325 -1.019083  0.959936 -0.459621  2.593349   
4357  kershcl01    2012 -0.441450 -0.617387  0.565174 -0.446909  2.648908   
4358  kershcl01    2013 -0.529750 -0.779394  0.457158 -0.415

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


### Function 3: Pull the People and Free Agents and use it to join to Pitching/Batting

Note: I tried to do the join directly with SQL and it got mad, so I'm going to do it here instead

In [4]:
all_batting = bfm.addFilterFreeAgents(batting_df, engine)
only_pos = all_batting[all_batting.Position.isin(['SP','RP']) == False]
only_pos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 873 entries, 1 to 1715
Data columns (total 17 columns):
Age            873 non-null int64
Destination    871 non-null object
WAR_3          873 non-null float64
nameFirst      873 non-null object
nameLast       873 non-null object
Year           873 non-null int64
Dollars        566 non-null float64
Length         873 non-null int64
Position       873 non-null object
playerID       873 non-null object
yearID         873 non-null int64
G              873 non-null float64
OBP            873 non-null float64
SLG            873 non-null float64
HR             873 non-null float64
RBI            873 non-null float64
SB             873 non-null float64
dtypes: float64(8), int64(4), object(5)
memory usage: 122.8+ KB


In [5]:
pitching_fa = bfm.addFilterFreeAgents(pitching_df, engine)
pitching_fa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 866 entries, 0 to 865
Data columns (total 18 columns):
Age            866 non-null int64
Destination    865 non-null object
WAR_3          866 non-null float64
nameFirst      866 non-null object
nameLast       866 non-null object
Year           866 non-null int64
Dollars        546 non-null float64
Length         866 non-null int64
Position       866 non-null object
playerID       866 non-null object
yearID         866 non-null int64
ERA            866 non-null float64
WHIP           866 non-null float64
K_9            866 non-null float64
HR_9           866 non-null float64
IPouts         866 non-null float64
W              866 non-null float64
SV             866 non-null float64
dtypes: float64(9), int64(4), object(5)
memory usage: 128.5+ KB


## Task 3 Add Team WAR for position

Basically I see this as:

1. Load the Team WAR data
2. Change column names to be more concise
3. Compute the Median and Min team WAR for that position + yearID 
4. Join it to the existing data frame using yearID + position

In [6]:
pitching_war = bfm.allPositionWAR(pitching_fa, engine)
pitching_war.head()

Unnamed: 0,Age,Destination,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,ERA,WHIP,K_9,HR_9,IPouts,W,SV,Med_WAR,Min_WAR
0,30,Texas Rangers,3.3,Bruce,Chen,2006,,0,SP,chenbr01,2006,0.408029,0.319668,-0.05311,0.811405,0.490456,-0.85334,-0.276611,11.0,0.0
1,36,Baltimore Orioles,4.0,Steve,Trachsel,2006,3100000.0,1,SP,trachst01,2006,-0.121799,0.045397,-0.906338,-0.042322,1.550248,2.49292,-0.276611,11.0,0.0
2,33,Atlanta Braves,2.2,Mark,Redman,2006,,0,SP,redmama01,2006,0.078066,0.025983,-0.997965,-0.195781,1.587715,1.600584,-0.276611,11.0,0.0
3,44,San Diego Padres,6.8,David,Wells,2006,3000000.0,1,SP,wellsda01,2006,-0.271471,-0.236734,-0.81471,-0.004753,0.115782,-0.184088,-0.276611,11.0,0.0
4,29,Pittsburgh Pirates,0.8,Tony,Armas,2006,3500000.0,1,SP,armasto02,2006,-0.107655,-0.13704,-0.348744,-0.138884,1.378968,1.154416,-0.276611,11.0,0.0


In [7]:
position_war = bfm.allPositionWAR(only_pos, engine)
position_war.head()

Unnamed: 0,Age,Destination,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,yearID,G,OBP,SLG,HR,RBI,SB,Med_WAR,Min_WAR
0,34,Oakland Athletics,3.1,Todd,Walker,2006,,0,1B,walketo04,2006,1.699431,0.902271,0.73579,0.54903,1.167855,-0.035056,2.2,-0.9
1,32,New York Mets,0.1,Fernando,Tatis,2006,,0,1B,tatisfe01,2006,-0.57538,0.658687,1.184309,-0.275084,-0.338351,-0.34276,2.2,-0.9
2,33,New York Yankees,1.6,Miguel,Cairo,2006,750000.0,1,1B,cairomi01,2006,0.520666,0.478012,0.390534,-0.510545,0.398017,1.657315,2.2,-0.9
3,30,Atlanta Braves,1.0,Craig,Wilson,2006,2000000.0,1,1B,wilsocr03,2006,1.43059,0.666686,0.945016,1.490874,1.03397,-0.188908,2.2,-0.9
4,37,Pittsburgh Pirates,2.4,Jose,Hernandez,2006,,0,1B,hernajo01,2006,0.603386,0.684375,0.575661,-0.157353,0.029833,-0.34276,2.2,-0.9


## Task ???: Pull Team data

Pull this to help with team -> teamID

In [9]:
# Change column names to team abbreviations using Team Data
teams = bfm.pullFullTable('teams', engine)
# Pull just a handful of these columns (W, G, teamID, name, yearID)
teams_short = teams[['yearID', 'teamID', 'name', 'W', 'G']]
teams_short['teamID'].value_counts()

BAL    14
HOU    14
BOS    14
TEX    14
SLN    14
SDN    14
PIT    14
NYN    14
ARI    14
SFN    14
CIN    14
PHI    14
MIL    14
COL    14
CHN    14
TBA    14
LAN    14
OAK    14
MIN    14
NYA    14
ATL    14
TOR    14
KCA    14
CLE    14
CHA    14
SEA    14
DET    14
WAS    13
LAA    13
FLO     8
MIA     6
ANA     1
MON     1
Name: teamID, dtype: int64

In [10]:
# Convert altered names/teamID 
name_change = {'Anaheim Angels': 'Los Angeles Angels', 
               'Los Angeles Angels of Anaheim' : 'Los Angeles Angels',
                   'Tampa Bay Devil Rays' : 'Tampa Bay Rays',
                   'Montreal Expos' : 'Washington Nationals', 
                   'Florida Marlins' : 'Miami Marlins'
                  }
    
origin_change = {'ANA': 'LAA', 'TBD':'TBR', 'MON':'WAS', 'FLO':'MIA'}

teams_short['name'] = teams_short['name'].replace(name_change)
teams_short['teamID'] = teams_short['teamID'].replace(origin_change)

# Change W/G to W_Pct
    
teams_short['W_Pct'] = teams_short['W'].divide(teams_short.G)

teams_short = teams_short.drop(['W','G'], axis = 1)

teams_short.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,yearID,teamID,name,W_Pct
0,2004,LAA,Los Angeles Angels,0.567901
1,2004,ARI,Arizona Diamondbacks,0.314815
2,2004,ATL,Atlanta Braves,0.592593
3,2004,BAL,Baltimore Orioles,0.481481
4,2004,BOS,Boston Red Sox,0.604938


In [11]:
teams_short.name.value_counts()

Philadelphia Phillies    14
Oakland Athletics        14
Chicago White Sox        14
Los Angeles Angels       14
Washington Nationals     14
Toronto Blue Jays        14
Los Angeles Dodgers      14
Boston Red Sox           14
San Francisco Giants     14
Texas Rangers            14
New York Yankees         14
New York Mets            14
Minnesota Twins          14
St. Louis Cardinals      14
Baltimore Orioles        14
Seattle Mariners         14
Atlanta Braves           14
Detroit Tigers           14
Miami Marlins            14
Cleveland Indians        14
Chicago Cubs             14
Milwaukee Brewers        14
Pittsburgh Pirates       14
San Diego Padres         14
Colorado Rockies         14
Cincinnati Reds          14
Arizona Diamondbacks     14
Tampa Bay Rays           14
Houston Astros           14
Kansas City Royals       14
Name: name, dtype: int64

### Next task: Use these data to change the free_agents data a bit

We need to convert the Destination data from free_agents...this is where to do it!

*This will also remove the FAs without teams*

In [12]:
# Check destination data
fa_bat_pos.Destination.value_counts()

NameError: name 'fa_bat_pos' is not defined

In [51]:
#Fix the weird angels data
fa_bat_pos['Destination'] = fa_bat_pos['Destination'].replace({'Los Angeles Angels of Anaheim' :
                                                               'Los Angeles Angels'})
fa_bat_pos.Destination.value_counts()

Los Angeles Dodgers              91
New York Mets                    80
New York Yankees                 77
Washington Nationals             76
Chicago Cubs                     73
Baltimore Orioles                69
Boston Red Sox                   67
Kansas City Royals               66
Toronto Blue Jays                64
Colorado Rockies                 63
Houston Astros                   60
Texas Rangers                    59
San Diego Padres                 58
Seattle Mariners                 56
Cleveland Indians                54
Philadelphia Phillies            53
Cincinnati Reds                  53
San Francisco Giants             53
Milwaukee Brewers                52
Chicago White Sox                51
Pittsburgh Pirates               50
Atlanta Braves                   50
St. Louis Cardinals              47
Tampa Bay Rays                   45
Arizona Diamondbacks             45
Detroit Tigers                   42
Minnesota Twins                  41
Miami Marlins               

In [52]:
# Do a join to the sub-team DF
team_translate = teams_short[['teamID', 'name']].drop_duplicates()


fa_bat_team_war_teamID = pd.merge(fa_bat_pos, team_translate, how = 'left',
                          left_on = ['Destination'], right_on = ['name'])

# Substitute the Destination Column with the info from teamID and drop teamID
fa_bat_team_war_teamID['Destination'] = fa_bat_team_war_teamID['teamID']
fa_bat_team_war_teamID = fa_bat_team_war_teamID.drop(['teamID'], axis = 1)
fa_bat_team_war_teamID.drop_duplicates().head(10)

Unnamed: 0.1,Unnamed: 0,Age,Destination,Origin,WAR_3,nameFirst,nameLast,Dollars,Length,Name,Position_x,playerID,yearID,G,OBP,SLG,HR,RBI,Position_y,name
0,0,30,TEX,BAL,3.3,Bruce,Chen,,0,Bruce Chen,SP,chenbr01,2006,-0.327218,4.52043,3.387034,-0.510545,-0.60612,P,Texas Rangers
1,1,34,OAK,SDP,3.1,Todd,Walker,,0,Todd Walker,1B,walketo04,2006,1.699431,0.902271,0.73579,0.54903,1.167855,2B,Oakland Athletics
2,2,33,NYN,NYM,1.8,Ricky,Ledee,,0,Ricky Ledee,LF,ledeeri01,2006,0.293184,0.261325,0.432791,-0.275084,-0.304879,OF,New York Mets
3,3,33,LAN,TBR,-1.1,Tomas,Perez,,0,Tomas Perez,2B,perezto03,2006,0.892907,0.161575,0.242895,-0.275084,-0.070581,3B,Los Angeles Dodgers
4,4,32,WAS,STL,9.6,Ronnie,Belliard,,0,Ronnie Belliard,2B,belliro01,2006,1.885552,0.711915,0.755101,1.019952,1.636452,2B,Washington Nationals
5,5,32,NYN,BAL,0.1,Fernando,Tatis,,0,Fernando Tatis,1B,tatisfe01,2006,-0.57538,0.658687,1.184309,-0.275084,-0.338351,DH,New York Mets
6,6,36,BAL,NYM,4.0,Steve,Trachsel,3100000.0,1,Steve Trachsel,SP,trachst01,2006,-0.534019,0.128894,-0.049217,-0.392814,-0.539178,P,Baltimore Orioles
7,7,33,ATL,KCR,2.2,Mark,Redman,,0,Mark Redman,SP,redmama01,2006,-0.554699,1.71189,-1.018416,-0.510545,-0.60612,P,Atlanta Braves
8,8,41,NYN,CHW,-0.4,Sandy,Alomar,,0,Sandy Alomar Jr.,C,alomasa02,2006,-0.203138,0.543736,0.654023,-0.392814,-0.037109,C,New York Mets
9,9,34,CIN,CHW,3.3,Dustin,Hermanson,,0,Dustin Hermanson,RP,hermadu01,2006,-1.030342,-1.096651,-1.018416,-0.510545,-0.60612,P,Cincinnati Reds


# Final Task: Save the data

For now, save a test set

# Experiments with Standardizing



## Task 5: Use payroll data to cluster teams

Now I need to use payroll data to create clusters of teams. So I'll:

1. Load the payroll data
2. Standardize it for each year
3. Run it through clustering
4. Use cluster labels to create a translation

In [78]:
# Load payroll data
payrolls = bfm.pullFullTable('payrolls', engine)
payrolls_2006 = payrolls[payrolls.Year >= 2006]
payrolls_2006.set_index('Year', inplace=True)
payrolls_2006.drop(['index'], axis = 1, inplace=True)
payrolls_2006

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0_level_0,Arizona Diamondbacks,Atlanta Braves,Baltimore Orioles,Boston Red Sox,Chicago Cubs,Chicago White Sox,Cincinnati Reds,Cleveland Indians,Colorado Rockies,Detroit Tigers,...,Philadelphia Phillies,Pittsburgh Pirates,San Diego Padres,San Francisco Giants,Seattle Mariners,St. Louis Cardinals,Tampa Bay Rays,Texas Rangers,Toronto Blue Jays,Washington Nationals
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006,59.68,90.16,72.59,120.1,94.42,102.75,60.91,56.03,41.23,82.61,...,88.27,46.72,69.9,90.06,87.96,88.89,35.42,68.23,71.92,63.14
2007,52.07,87.29,93.17,143.03,99.67,108.67,68.52,61.67,54.04,94.8,...,89.43,38.54,58.11,90.22,106.46,90.29,24.12,68.32,81.94,36.95
2008,66.2,102.37,67.2,133.39,118.35,121.19,74.12,78.97,68.66,137.69,...,98.27,48.69,73.68,76.59,117.67,99.62,43.82,67.71,97.79,54.96
2009,73.52,96.73,67.1,121.75,134.81,96.07,73.56,81.58,75.2,115.09,...,113.0,48.69,43.73,82.62,98.9,77.61,63.31,68.18,80.54,60.33
2010,60.72,84.42,81.61,162.75,146.86,108.27,72.39,61.2,84.23,122.86,...,141.93,34.94,37.8,97.83,98.38,93.54,71.92,55.25,62.69,61.43
2011,53.64,87.0,85.3,161.76,125.05,127.79,75.95,49.19,88.15,105.7,...,172.98,45.05,45.87,118.2,86.52,105.43,41.05,92.3,62.57,63.86
2012,74.28,83.31,81.43,173.19,88.2,96.92,82.2,78.43,78.07,132.3,...,174.54,63.43,55.24,81.98,117.62,110.3,64.17,120.51,75.49,81.34
2013,90.16,89.29,91.79,158.97,104.15,124.07,110.57,82.52,75.45,149.05,...,159.58,66.29,71.69,142.18,84.3,116.7,57.03,127.2,118.24,112.43
2014,112.69,110.9,107.41,162.82,89.01,91.16,112.39,82.53,95.83,162.23,...,180.05,78.11,90.09,154.19,92.08,111.02,77.06,136.04,132.63,134.7
2015,65.77,89.62,118.86,168.69,117.16,110.71,117.73,87.75,98.26,172.79,...,133.05,85.89,126.62,166.5,123.23,120.3,74.85,144.82,116.42,174.51


In [79]:
# Get the totals by year
payrolls_2006['Total'] = payrolls_2006.sum(axis = 1)
payrolls_2006['Inflation_Factor'] = payrolls_2006['Total'].divide(payrolls_2006['Total'].min())
inflation = payrolls_2006[['Inflation_Factor', 'Total']]
payrolls_2006

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Arizona Diamondbacks,Atlanta Braves,Baltimore Orioles,Boston Red Sox,Chicago Cubs,Chicago White Sox,Cincinnati Reds,Cleveland Indians,Colorado Rockies,Detroit Tigers,...,San Diego Padres,San Francisco Giants,Seattle Mariners,St. Louis Cardinals,Tampa Bay Rays,Texas Rangers,Toronto Blue Jays,Washington Nationals,Total,Inflation_Factor
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006,59.68,90.16,72.59,120.1,94.42,102.75,60.91,56.03,41.23,82.61,...,69.9,90.06,87.96,88.89,35.42,68.23,71.92,63.14,2326.7,1.0
2007,52.07,87.29,93.17,143.03,99.67,108.67,68.52,61.67,54.04,94.8,...,58.11,90.22,106.46,90.29,24.12,68.32,81.94,36.95,2476.69,1.064465
2008,66.2,102.37,67.2,133.39,118.35,121.19,74.12,78.97,68.66,137.69,...,73.68,76.59,117.67,99.62,43.82,67.71,97.79,54.96,2686.45,1.154618
2009,73.52,96.73,67.1,121.75,134.81,96.07,73.56,81.58,75.2,115.09,...,43.73,82.62,98.9,77.61,63.31,68.18,80.54,60.33,2655.4,1.141273
2010,60.72,84.42,81.61,162.75,146.86,108.27,72.39,61.2,84.23,122.86,...,37.8,97.83,98.38,93.54,71.92,55.25,62.69,61.43,2730.6,1.173594
2011,53.64,87.0,85.3,161.76,125.05,127.79,75.95,49.19,88.15,105.7,...,45.87,118.2,86.52,105.43,41.05,92.3,62.57,63.86,2786.17,1.197477
2012,74.28,83.31,81.43,173.19,88.2,96.92,82.2,78.43,78.07,132.3,...,55.24,81.98,117.62,110.3,64.17,120.51,75.49,81.34,2940.65,1.263872
2013,90.16,89.29,91.79,158.97,104.15,124.07,110.57,82.52,75.45,149.05,...,71.69,142.18,84.3,116.7,57.03,127.2,118.24,112.43,3187.59,1.370005
2014,112.69,110.9,107.41,162.82,89.01,91.16,112.39,82.53,95.83,162.23,...,90.09,154.19,92.08,111.02,77.06,136.04,132.63,134.7,3453.95,1.484484
2015,65.77,89.62,118.86,168.69,117.16,110.71,117.73,87.75,98.26,172.79,...,126.62,166.5,123.23,120.3,74.85,144.82,116.42,174.51,3658.27,1.5723


In [80]:
# Divide by the 2006 to get an inflation factor
per_year = inflation.loc[2006:2016, 'Inflation_Factor'].pct_change().mean()
inflation.loc[2017] = inflation.loc[2016] * (1 + per_year)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [86]:
# The 2017 figure is going to be wrong; can we forecast using a single number as "percent increase"??
inflation = inflation.reset_index()
inflation

Unnamed: 0,Year,Inflation_Factor,Total
0,2006,1.0,2326.7
1,2007,1.064465,2476.69
2,2008,1.154618,2686.45
3,2009,1.141273,2655.4
4,2010,1.173594,2730.6
5,2011,1.197477,2786.17
6,2012,1.263872,2940.65
7,2013,1.370005,3187.59
8,2014,1.484484,3453.95
9,2015,1.5723,3658.27


In [82]:
# Test: Try and get payroll sums adjusted by inflation...should be the same!
inflation['Total'].divide(inflation['Inflation_Factor'], axis=0)

Year
2006    2326.7
2007    2326.7
2008    2326.7
2009    2326.7
2010    2326.7
2011    2326.7
2012    2326.7
2013    2326.7
2014    2326.7
2015    2326.7
2016    2326.7
2017    2326.7
dtype: float64

In [88]:
import pandas as pd
test = pd.merge(pitch_year, inflation, on = 'Year')
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 864 entries, 0 to 863
Data columns (total 21 columns):
Age                 864 non-null int64
Destination         863 non-null object
WAR_3               864 non-null float64
nameFirst           864 non-null object
nameLast            864 non-null object
Year                864 non-null int64
Dollars             544 non-null float64
Length              864 non-null int64
Position            864 non-null object
playerID            864 non-null object
ERA                 864 non-null float64
WHIP                864 non-null float64
K_9                 864 non-null float64
HR_9                864 non-null float64
IPouts              864 non-null float64
W                   864 non-null float64
SV                  864 non-null float64
Med_WAR             864 non-null float64
Min_WAR             864 non-null float64
Inflation_Factor    864 non-null float64
Total               864 non-null float64
dtypes: float64(13), int64(3), object(5)
memo

In [90]:
test['Dollars_2006'] = test['Dollars'].divide(test['Inflation_Factor'])
test.head()

Unnamed: 0,Age,Destination,WAR_3,nameFirst,nameLast,Year,Dollars,Length,Position,playerID,...,K_9,HR_9,IPouts,W,SV,Med_WAR,Min_WAR,Inflation_Factor,Total,Dollars_2006
0,30,Texas Rangers,3.3,Bruce,Chen,2006,,0,SP,chenbr01,...,-0.05311,0.811405,0.490456,-0.85334,-0.276611,11.0,0.0,1.0,2326.7,
1,36,Baltimore Orioles,4.0,Steve,Trachsel,2006,3100000.0,1,SP,trachst01,...,-0.906338,-0.042322,1.550248,2.49292,-0.276611,11.0,0.0,1.0,2326.7,3100000.0
2,33,Atlanta Braves,2.2,Mark,Redman,2006,,0,SP,redmama01,...,-0.997965,-0.195781,1.587715,1.600584,-0.276611,11.0,0.0,1.0,2326.7,
3,44,San Diego Padres,6.8,David,Wells,2006,3000000.0,1,SP,wellsda01,...,-0.81471,-0.004753,0.115782,-0.184088,-0.276611,11.0,0.0,1.0,2326.7,3000000.0
4,29,Pittsburgh Pirates,0.8,Tony,Armas,2006,3500000.0,1,SP,armasto02,...,-0.348744,-0.138884,1.378968,1.154416,-0.276611,11.0,0.0,1.0,2326.7,3500000.0


In [4]:
# Transpose the data and standardize it
payrolls_transposed = payrolls.transpose()
payrolls_transposed_standard = payrolls_transposed.transform(zscore)
payrolls_transposed_standard

Year,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
index,-2.401578,-2.040984,-2.266215,-2.287018,-2.325516,-2.160705,-1.81343,-1.820954,-2.006667,-2.017893,-1.959364,-2.111651,-1.933544,-1.843838,-2.106917,-1.871592,-2.10554,-2.339795,-2.182169,-2.345991
Arizona Diamondbacks,-0.608206,1.060551,1.011678,0.826592,1.414762,0.40088,0.082364,-0.245014,-0.466033,-0.796956,-0.528953,-0.351682,-0.701674,-0.874399,-0.544957,-0.278682,0.016595,-1.216173,-0.576313,-0.76915
Atlanta Braves,1.260197,1.270688,1.22119,1.079845,1.06087,1.267836,0.688733,0.44227,0.442607,0.201452,0.391651,0.301692,-0.102427,-0.078619,-0.310974,-0.29712,-0.022691,-0.666688,-0.851853,0.255478
Baltimore Orioles,1.928714,1.080671,1.238902,0.411318,-0.187403,0.171949,-0.457423,0.084814,-0.081173,0.368137,-0.503501,-0.532408,-0.173477,-0.119172,-0.359688,-0.244136,-0.099289,0.006979,0.4048,0.790544
Boston Red Sox,0.774953,1.12091,1.155529,1.7514,1.624826,1.054822,1.792086,1.497548,1.33515,1.781555,1.181177,1.006017,1.878122,1.704729,2.017973,1.179649,1.11684,1.155022,1.438473,1.06775
Chicago Cubs,0.635346,0.3899,0.331304,0.041849,0.387908,0.374803,0.700028,0.458505,0.569602,0.552397,0.798376,1.373663,1.476348,0.829037,-0.184266,0.017816,-0.503129,-0.032188,0.793025,0.975421
Chicago White Sox,-0.135878,-0.988063,-1.00655,0.083868,-0.317605,-0.602554,-0.053474,0.120987,0.817928,0.807527,0.870661,0.283112,0.500612,0.894398,0.041684,0.439993,-0.455941,-0.180791,-0.231092,-0.35455
Cincinnati Reds,-1.048554,-0.201613,-0.442379,-0.553996,-0.771798,-0.319777,-0.606043,-0.257547,-0.429365,-0.330635,-0.327372,-0.350556,-0.406602,-0.342209,-0.339736,0.153879,0.010011,-0.019056,-0.685613,-0.952708
Cleveland Indians,1.228831,1.216142,0.952496,1.107101,0.509783,-0.786444,-0.971647,-0.838306,-0.574843,-0.524817,-0.203929,-0.124788,-0.689538,-0.980551,-0.437424,-0.440601,-0.645351,-0.709771,-0.589254,-0.213859
Colorado Rockies,0.515419,0.346085,0.4177,0.307594,-0.325175,-0.054949,-0.046341,-0.648612,-1.016047,-0.741111,-0.466341,-0.304389,-0.107232,-0.051187,-0.446752,-0.59044,-0.353445,-0.467629,-0.28604,-0.532173


In [None]:
# Cluster them via hierarchical clustering

# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage: mergings
mergings = linkage(payrolls_transposed_standard.values, method = 'complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=list(payrolls_transposed_standard.index),
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.savefig('./Payroll_Clustering')
plt.show()

In [None]:
# Select clusters using maximum height of 6
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion = 'distance')


label_data = pd.DataFrame({'label' : labels, 'Team' : list(payrolls_transposed_standard.index)})

label_data.sort_values('label')

# Join this to the existing data
label_data['Team'] = label_data.replace({'Los Angeles Angels':
                                         'Los Angeles Angels of Anaheim'})

labels_as_teamID = pd.merge(label_data, team_translate, 
                            left_on = ['Team'], right_on = ['name'])[['teamID', 'label']]
fa_with_clusters = pd.merge(fa_bat_team_war_teamID, labels_as_teamID,
                            left_on = ['Destination'], right_on = ['teamID'])
fa_with_clusters.head()

## DEPRECATED!!! Task 2: Add positions

This will require data from our new "free_agents_batting" and "appearances". Basically:

* Pull appearances data
* Collapse "appearances" data into positions
* Join it with free_agents_batting data

In [None]:
# Bring in Appearances data to add positions
appearances = pullFullTable('appearances', engine)
    
print(appearances.head())

# Subset to only positional data and group by playerID/yearID
appearances_compact = appearances.drop(['index', 'teamID','lgID', 'G_batting', 
                                        'G_defense','G_all','GS', 'G_ph', 'G_pr'], 
                                       axis = 1).groupby(['playerID','yearID'], 
                                                         as_index = False).sum()

# Check data
print(appearances_compact.head())

# Figure out primary position by melting, then grouping and finding the max value
appearances_melt = pd.melt(appearances_compact, id_vars= ['playerID', 'yearID'],
                           value_name = 'Games', var_name = 'Position')
print(appearances_melt[appearances_melt['playerID'] == 'clontbr01'])

# Grab the index for the maximum games
primary_idx = appearances_melt.groupby(['playerID','yearID'])['Games'].idxmax()

# Use it to screen out the proper rows
primary_position = appearances_melt.loc[primary_idx]

# Turn the "Position" Column into the right contents by pulling just the position and capitalizing
primary_position['Position'] = primary_position.Position.str.split("_").str.get(1).str.upper()
print(primary_position[primary_position['playerID'] == 'clontbr01'])

# Do the join on the 6202 x 13 free_agents_batting
# Join based on nameFirst/nameLast
fa_bat_pos = pd.merge(free_agents_batting, primary_position, 
                      on = ['playerID', 'yearID']).drop(['Games'], axis = 1)
print(fa_bat_pos.head(10))
print(fa_bat_pos.shape)