# Problem

Updating data at end of February, some players aren't getting caught (e.g. Yu Darvish)

My goal is to:

1. Identify where Darvish is getting lost
2. Fix the problem

I'll load the necessary things:

In [1]:
from dataScraping import * 
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import math
from scipy.stats import zscore
from buildFeatureMatrix import *

I'm going to replicate the necessary steps for building out the features and see where Darvish disappears

First, I'll grab him using free agent data:

In [None]:
# Grab all free agents since 2006
all_years = list(range(2006,2018))
all_fa_data = getAllFAData(all_years)

Now I'll check to make sure he's there for the year 2017

In [None]:
all_fa_data[all_fa_data.nameLast == 'Darvish']

Looks good, now I'm going to fast forward to loading from the database, as that's the next time I'd think he'd disappear

In [None]:
engine = db_connect()
pitching_df = createPitchingTable(engine)
pitching_fa = addFilterFreeAgents(pitching_df, engine)

In [None]:
pitching_fa[pitching_fa.nameLast == 'Darvish']

He's still there, so I haven't lost him yet. I'll do the next couple steps, though they shouldn't affect much:

In [None]:
pitching_war = allPositionWAR(pitching_fa, engine)
pitching_adjusted = addInflation(pitching_war, engine)

In [None]:
pitching_adjusted[pitching_adjusted.nameLast == 'Darvish']

Looks like this is the problem step; I'll just make sure the pitching_war step isn't the problem:

In [None]:
pitching_war[pitching_war.nameLast == 'Darvish']

Okay, so this is the problem step. Somehow, Darvish is getting dropped out, so I'll now jump into that method

In [None]:
# Pull the data but drop the index
position_only_war = pullFullTable('position_team_war', engine).drop(['index'], axis = 1)
pitching_war = pullFullTable('pitcher_team_war', engine).drop(['index'], axis = 1)
    
# Put them together
position_war = pd.concat([position_only_war, pitching_war])
    
# Change the Year to "yearID"
position_war['yearID'] = position_war.Year
position_war = position_war.drop(['Year'], axis = 1)
    
# Create a dictionary for converting these to abbreviations
team_dict = {'Angels' : 'LAA', 'Astros' : 'HOU', 'Athletics' : 'OAK', 'Blue Jays' : 'TOR', 
                 'Braves' : 'ATL', 'Brewers': 'MIL', 'Cardinals' : 'STL', 'Cubs' : 'CHN',
                 'Diamondbacks' : 'ARI', 'Dodgers' : 'LAN', 'Giants' : 'SFN', 'Indians' : 'CLE',
                 'Mariners' : 'SEA', 'Marlins' : 'MIA', 'Mets' : 'NYN', 'Nationals' : 'WAS',
                 'Orioles' : 'BAL', 'Padres' : 'SDN', 'Phillies' : 'PHI', 'Pirates' : 'PIT', 
                 'Rangers' : 'TEX', 'Rays' : 'TBR', 'Red Sox' : 'BOS', 'Reds' : 'CIN', 
                 'Rockies' : 'COL', 'Royals' : 'KCR', 'Tigers' : 'DET', 'Twins' : 'MIN', 
                 'White Sox' : 'CHA', 'Yankees' : 'NYA'}
    
# Alter it to include WAR and Change the actual data frame
team_dict = {key : value + "_WAR" for key, value in team_dict.items()}
position_war = position_war.rename(columns = team_dict)
    
# Create stats for non-position/Year categories
position_war['Med_WAR'] = position_war.drop(['yearID', 'Position'], axis = 1).median(axis = 1)
position_war['Min_WAR'] = position_war.drop(['yearID', 'Position'], axis = 1).min(axis = 1)

# Shrink to only the Year/Position/Median/Min WAR stats
position_war_small = position_war[['yearID', 'Position', 'Med_WAR', 'Min_WAR']]
    

This will get merged with the pitching_fa data....it's doing so on position and yearID

--> The problem here is the position; Darvish is a "P", not an 'SP'....this is really annoying

I'd wager is this also the case for Arrieta:

In [None]:
pitching_fa[pitching_fa.nameLast == 'Arrieta']

Alright, so when did THIS crap happen? Apparently now ESPN has "P" as position for some pitchers. I think I will.....

Just convert all "P" to "SP for now? I think that's a quick fix. Unfortunate, but that's what I'll have to do

In [None]:
pitching_fa[pitching_fa.Position == 'P']

In [None]:
pitching_fa.info()

Looks like IPouts ~ 0.5 is a good enough cutoff...I'll use that

In [None]:
pitch_subset = pitching_fa[pitching_fa.Position == 'P']

In [None]:
pitch_subset.loc[pitch_subset.IPouts >= 0.5, 'Position'] = 'SP'
pitch_subset.loc[pitch_subset.IPouts < 0.5, 'Position'] = 'RP'

In [None]:
pitch_subset

That'll work...I'll add that to the code and call it good