# Fangraphs to CBS Players

For fantasy baseball, I like to lean on the numbers. To do this, it's helpful to merge data from my league with advanced baseball projections and other data. To do so, we need to combine CBS Fantasy Baseball data with data from Fangraphs.com, my prefered source for advanced baseball projections.

However, CBS doesn't provide a unique key for players. (Thankfully, Fangraphs does.) The closest thing we have is players' names. It's a start but:

- The CBS data that contains a name needs to be cleaned of a position and team.
- There are multiple players who share names.
- The data sets use nick names differently.
- Some capitalization and hyphenation differences.
- The data sets use suffixes ('Jr.') differently.
- The data sets don't overlap completely.
- (At least everything is a ASCII.)

Let's try to build a mapping from CBS players to Fangraphs players so we can assign CBS players an id from fangraphs to easily merge data sets.

In [204]:
import pandas as pd
import numpy as np
import re
import random

In [151]:
cbs_hit = pd.read_csv('cbs_hitter_2020_projections.csv',skiprows=1, skipfooter=1,engine='python')
#cbs_hit = cbs_hit[cbs_hit['AB'] > 0]
cbs_pit = pd.read_csv('cbs_pitcher_2020_projections.csv',skiprows=1, skipfooter=1,engine='python')
fg_hit = pd.read_csv('steamer_hitters_2020.csv')
fg_pit = pd.read_csv('steamer_pitchers_2020.csv')

z_hit = pd.read_csv('zips_hitters_2020.csv')
z_pit = pd.read_csv('zips_pitchers_2020.csv')

In [152]:

### previously, we merged some Fangraphs data sets to get a larger one, 
### however, Steamer data is very large and should suffice. In case we need to merge more in later,
### we save this useful code.

### It doesn't suffice... Raphael Dolis is the first omission...

z_hit = pd.read_csv('zips_hitters_2020.csv')
z_pit = pd.read_csv('zips_pitchers_2020.csv')

#fg_hit['playerid'] = fg_hit['playerid'].apply(lambda x: str(x)) #data may be int or str...
#fg_pit['playerid'] = fg_pit['playerid'].apply(lambda x: str(x))

fg_hit = pd.concat([fg_hit, z_hit], join='inner', ignore_index=True).drop_duplicates(subset=['playerid'])
fg_pit = pd.concat([fg_pit, z_pit], join='inner', ignore_index=True).drop_duplicates(subset=['playerid'])

In [153]:
#clean fg names
#fg_hit['Name'] = fg_hit['Name'].apply(lambda x: re.sub(r'Jr.\w*','',x).strip())
#fg_pit['Name'] = fg_pit['Name'].apply(lambda x: x.strip(' Jr.'))
fg_hit[(fg_hit['ADP'] < 999) & (fg_hit['ADP'] >= 601) ]
cbs_pit.head()

Unnamed: 0,Avail,Player,INNs,APP,GS,QS,CG,W,L,S,BS,K,BB,H,ERA,WHIP,Rank,Unnamed: 17
0,Walk-In Closets,Gerrit Cole SP | NYY,80,13,13,11,0,7,3,0,0,99,27,66,3.25,1.16,7,
1,Union State Connectors,Jacob deGrom SP | NYM,73,12,12,10,0,5,2,0,0,82,18,60,2.59,1.07,13,
2,Screaming Prairie Camels,Max Scherzer SP | WAS,74,12,12,9,0,6,3,0,0,89,21,64,3.27,1.14,14,
3,Johnny and the Rockers,Justin Verlander SP | HOU,78,13,13,10,1,7,3,0,0,98,21,63,3.32,1.07,17,
4,Screaming Prairie Camels,Shane Bieber SP | CLE,66,11,11,8,1,5,3,0,0,76,14,62,3.53,1.15,19,


In [154]:
### I'm not doing this right now either

#cbs_hit = cbs_hit[['Player','Avail']].copy()
#cbs_pit = cbs_pit[['Player','Avail']].copy()
cbs_hit.iloc[1000:1010]

Unnamed: 0,Avail,Player,AB,R,H,1B,2B,3B,HR,RBI,BB,K,SB,CS,AVG,OBP,SLG,Rank,Unnamed: 18
1000,FA,Bo Naylor C | CLE,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1001,FA,Kevin Franklin 3B | CIN,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1002,FA,Ronnie Dawson CF | HOU,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1003,FA,Aldemar Burgos CF | SD,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1004,FA,Tyler Payne C | CHC,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1005,FA,Gerardo Parra RF | WAS,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1006,FA,Paul Hoenecke 3B | LAD,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1007,FA,Corban Joseph 2B | CHC,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1008,FA,Cole Sturgeon CF | BOS,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,
1009,FA,Logan Taylor 3B | SEA,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,9999,


# The Steps

1. Extract names from CBS data.
1. Create a dictionary of fangraph name_to_playerid.
1. Apply that to CBS names in a function to find where a CBS name raises a KeyError.
1. Create a dictionary of the names raising the errors that maps to fangraphs names.

In [155]:
#step 1: create name and team columns
func = (lambda x: ' '.join(x['Player'].strip().split()[:-3]))
cbs_hit['Name'] = cbs_hit.apply(func, axis = 1)
cbs_pit['Name'] = cbs_pit.apply(func, axis = 1)

func1 = (lambda x: x['Player'].strip().split()[-1])
cbs_hit['Team'] = cbs_hit.apply(func1, axis = 1)
cbs_pit['Team'] = cbs_pit.apply(func1, axis = 1)

cbs_hit.head()

Unnamed: 0,Avail,Player,AB,R,H,1B,2B,3B,HR,RBI,...,K,SB,CS,AVG,OBP,SLG,Rank,Unnamed: 18,Name,Team
0,Springer International,Christian Yelich RF | MIL,211,46,72,41,13,3,15,44,...,46,11,1,0.341,0.431,0.644,1,,Christian Yelich,MIL
1,Springer International,Ronald Acuna CF | ATL,220,45,64,37,10,2,15,37,...,67,14,4,0.291,0.369,0.559,2,,Ronald Acuna,ATL
2,Cackleberry Czars,Cody Bellinger RF | LAD,189,42,58,30,12,2,14,39,...,38,6,1,0.307,0.395,0.614,3,,Cody Bellinger,LAD
3,The Midnight Sillies,Mookie Betts RF | LAD,209,54,64,34,14,2,14,34,...,38,8,1,0.306,0.41,0.593,4,,Mookie Betts,LAD
4,Droitwich Murdercocks,Mike Trout CF | LAA,191,43,57,28,11,1,17,42,...,48,4,1,0.298,0.419,0.633,5,,Mike Trout,LAA


In [156]:
#let's fix some known issues in CBS names.

cbs_hitter_names_dict = {i:i for i in cbs_hit['Name']}
cbs_to_fg_names = { #identified problem cases, among hitters
    'Nick Castellanos':'Nicholas Castellanos',
    'Gio Urshela':'Giovanny Urshella',
    'DJ Stewart': 'D.J. Stewart',
    'Abraham Toro-Hernandez':'Abraham Toro',
    'Michael Taylor': 'Michael A. Talyor',
    'JT Riddle': 'J.T. Riddle',
    #'Bobby Witt': 'Robert Witt', #can't find it, probably not in fangraphs....
    #'Mark Payton',
    #'Taylor Trammell',
    'Nate Lowe':'Nathaniel Lowe',
    'Yu Chang':'Yu-Cheng Chang',
    #'Andrew Vaughn',
    #'Elehuris Montero',
    #'Jose Garcia',
    'Cedric Mullins': 'Cedric Mullins II',
    #'Jordan Weems',
    #"Brian O'Keefe",
    #'Cal Raleigh',
    'Stevie Wilkerson': 'Steve Wilkerson'
}

#juniors are a problem....
juniors = list(fg_hit[fg_hit['Name'].str.endswith('Jr.')]['Name'])
cbs_to_fg_names.update({x.replace(' Jr.','') : x for x in juniors})
cbs_hitter_names_dict.update(cbs_to_fg_names)

In [162]:
# and now for problem names in pitchers...
# by the way, we keep them separate to minimize any errors from duplicate names. 
cbs_pitcher_names_dict = {i:i for i in cbs_pit['Name']}
juniors = list(fg_pit[fg_pit['Name'].str.endswith('Jr.')]['Name'])
cbs_pit_to_fg_names = {
    #'Lance McCullers',# a junior, will fix elsewise
    'Kwang Hyun Kim': 'Kwang-hyun Kim',
    'Jake Junis': 'Jakob Junis',
    #'Carl Edwards', #jr.
    'J.T. Brubaker':'Johnathan Brubaker',
    #'Nick Lodolo',##not forcast in either set!
    #'Brooks Raley',#not found
    'Cam Hill':'Cameron Hill',
    #'Stephen Woods':'', #jr.
    #'Duane Underwood',#jr.
    'Mike Shawaryn':'Michael Shawaryn'
}
cbs_pit_to_fg_names.update({x.replace(' Jr.','') : x for x in juniors})
cbs_pitcher_names_dict.update(cbs_pit_to_fg_names)

#### Some data inspection

Below, we discover that there are very few name duplicates who share a team and that those who do share a "team" are actually unsigned free agents. That is, they are not likely to be relevant in a fantasy context. 

In duplicated names, we discover very few who would seem to matter. Among hitters, Jose Martinez (TB, RF) is projected for 160 AB. Wander Franco (TB, SS) is in the minors but is the top prospect in baseball. Among pitchers, we find the following likely-relevant names as players likely to be in our data set:
- Austin Adams RP | SEA 
- David Peterson SP | NYM 
- Tyler Alexander SP | DET 
- Javy Guerra RP | WAS
- Cody Reed RP | CIN 
- Javy Guerra RP | SD


In [173]:
#step 2: identify duplicate name entries.
data_sets = [fg_hit, fg_pit, cbs_hit, cbs_pit]
cbs_hit.duplicated(subset='Name').sum() #there are twelve
cbs_pit.duplicated(subset='Name').sum() #there are fifteen
fg_hit.duplicated(subset='Name').sum() #there are 37 (uhg)
fg_pit.duplicated(subset='Name').sum() # 59 of them.

n = 0
for df in data_sets:
    
    df[df.duplicated(subset=['Name'], keep=False)].to_csv('repeated_names_'+str(n)+'.csv')
    n += 1

df = fg_pit
#df[df.duplicated(subset=['Name','Team'], keep=False)]


In [194]:
cbs_pit['Player'] = cbs_pit['Player'].apply(lambda x: x.strip())
cbs_hit['Player'] = cbs_hit['Player'].apply(lambda x: x.strip())

cbs_hit_to_fg_id= {x : np.nan for x in cbs_hit['Player']}
cbs_pit_to_fg_id = {x : np.nan for x in cbs_pit['Player']}
player_to_fg_id = { #manually...
    'Austin Adams RP | SEA': '13801',
    'David Peterson SP | NYM' : '20302',
    'Tyler Alexander SP | DET': '17735',
    'Javy Guerra RP | WAS': '7407',
    'Cody Reed RP | CIN'  : '15232',
    'Javy Guerra RP | SD' : '17292',
    'Wander Franco SS | TB' : 'sa3007033',
    'Jose Martinez RF | TB' : '7996'


}

In [171]:
## the team names use different schemes...
cbs_hit['Team'].unique()
#print(fg_hit['Team'].unique())

team_name_to_city_abbreviation = {
    'Angels': 'LAA','Astros':'HOU', 'Dodgers':'LAD',
    'Indians': 'CLE', 'Brewers': 'MIL', 'Athletics':'OAK',
    'Twins': 'MIN', 'White Sox' : 'CHW', 'Nationals': 'WAS',
    'Red Sox': 'BOS', 'Padres':'SD', 'Cubs' : 'CHC',
    'Rockies': 'COL', 'Yankees': 'NYY', 'Braves': 'ATL',
    'Phillies': 'PHI', 'Diamondbacks': 'ARI', 'Cardinals': 'STL',
    'Blue Jays':'TOR', 'Mets': 'NYM', 'Reds': 'CIN',
    'Rays': 'TB', 'Marlins': 'MIA', 'Royals':'KC',
    'Rangers': 'TEX', 'Pirates': 'PIT', 'Mariners': 'SEA',
    'Giants' : 'SF',  'Tigers': 'DET', 'Orioles': 'BAL', np.nan: 'FA'
}
try:
    fg_hit['Team'] = fg_hit.apply(lambda x: team_name_to_city_abbreviation[x['Team']], axis=1)
    fg_pit['Team'] = fg_pit.apply(lambda x: team_name_to_city_abbreviation[x['Team']], axis=1)
except KeyError:
    ##you probably already ran this.
    pass

['LAA' 'HOU' 'LAD' 'CLE' 'MIL' 'OAK' 'MIN' 'CHW' 'WAS' 'BOS' 'SD' 'CHC'
 'COL' 'NYY' 'ATL' 'PHI' 'ARI' 'STL' 'TOR' 'NYM' 'CIN' 'TB' 'MIA' 'KC'
 'TEX' 'PIT' 'SEA' 'SF' 'DET' 'BAL' 'FA']


In [160]:
## variations on this cell help us find problem cases and see how many
## names are not matching in the two sets.

#temp = list(zip(fg_hit['Name'],fg_hit['Team'])) #using name-team pairs as keys sounds good, but CBS doesn't list Free Agents as such
#temp = list(zip(fg_pit['Name'],fg_pit['Team']))

h_names_to_player_id = dict(zip(fg_hit['Name'],fg_hit['playerid']))
p_names_to_player_id = dict(zip(fg_pit['Name'],fg_pit['playerid']))

fails = []
df = cbs_pit[cbs_pit['INNs'] > 0] #can I just say 'you're f-ing kidding me, who uses 'INNs' for innings pitched? (IP!)
                                  #This helps us find some 'problem cases' among those who are expected to play
for item in df.iterrows():
    try:
        item[1]['Name']
        p_names_to_player_id[item[1]['Name']]
    except KeyError:
        fails.append(item[1]['Name'])
print(df.shape)
fails[:20]
#fg_pit[fg_pit['Name'].str.endswith('Jr.')]

(550, 20)


['Lance McCullers',
 'Kwang Hyun Kim',
 'Jake Junis',
 'Carl Edwards',
 'J.T. Brubaker',
 'Nick Lodolo',
 'Cam Hill',
 'Stephen Woods',
 'Duane Underwood',
 'Mike Shawaryn']

In [184]:
#let's fix some known issues in CBS names.

cbs_hitter_names_dict = {i:i for i in cbs_hit['Name']}
cbs_to_fg_names = { #identified problem cases, among hitters
    'Nick Castellanos':'Nicholas Castellanos',
    'Gio Urshela':'Giovanny Urshella',
    'DJ Stewart': 'D.J. Stewart',
    'Abraham Toro-Hernandez':'Abraham Toro',
    'Michael Taylor': 'Michael A. Talyor',
    'JT Riddle': 'J.T. Riddle',
    #'Bobby Witt': 'Robert Witt', #can't find it, probably not in fangraphs....
    #'Mark Payton',
    #'Taylor Trammell',
    'Nate Lowe':'Nathaniel Lowe',
    'Yu Chang':'Yu-Cheng Chang',
    #'Andrew Vaughn',
    #'Elehuris Montero',
    #'Jose Garcia',
    'Cedric Mullins': 'Cedric Mullins II',
    #'Jordan Weems',
    #"Brian O'Keefe",
    #'Cal Raleigh',
    'Stevie Wilkerson': 'Steve Wilkerson'
}

#juniors are a problem....
juniors = list(fg_hit[fg_hit['Name'].str.endswith('Jr.')]['Name'])
cbs_to_fg_names.update({x.replace(' Jr.','') : x for x in juniors})

cbs_hitter_names_dict.update(cbs_to_fg_names)

In [185]:
cbs_hitter_names_dict = {i:i for i in cbs_hit['Name']}
cbs_to_fg_names = {
    'Nick Castellanos':'Nicholas Castellanos',
    'Gio Urshela':'Giovanny Urshella',
    'DJ Stewart': 'D.J. Stewart',
    'Abraham Toro-Hernandez':'Abraham Toro',
    'Michael Taylor': 'Michael A. Talyor',
    'JT Riddle': 'J.T. Riddle',
    #'Bobby Witt': 'Robert Witt', #can't find it, probably not in fangraphs....
    #'Mark Payton',
    #'Taylor Trammell',
    'Nate Lowe':'Nathaniel Lowe',
    'Yu Chang':'Yu-Cheng Chang',
    #'Andrew Vaughn',
    #'Elehuris Montero',
    #'Jose Garcia',
    'Cedric Mullins': 'Cedric Mullins II',
    #'Jordan Weems',
    #"Brian O'Keefe",
    #'Cal Raleigh',
    'Stevie Wilkerson': 'Steve Wilkerson'
}

#juniors are a problem....
juniors = list(fg_hit[fg_hit['Name'].str.endswith('Jr.')]['Name'])
cbs_to_fg_names.update({x.replace(' Jr.','') : x for x in juniors})
cbs_hitter_names_dict.update(cbs_to_fg_names)

In [None]:
# and now for problem names in pitchers...
# by the way, we keep them separate to minimize any errors from duplicate names. 

cbs_pitcher_names_dict = {i:i for i in cbs_pit['Name']}
juniors = list(fg_pit[fg_pit['Name'].str.endswith('Jr.')]['Name'])

cbs_pit_to_fg_names = { ## we are going to comment out unhandled cases so we have a record later.
    #'Lance McCullers',# a junior, will fix elsewise
    'Kwang Hyun Kim': 'Kwang-hyun Kim',
    'Jake Junis': 'Jakob Junis',
    #'Carl Edwards', #jr.
    'J.T. Brubaker':'Johnathan Brubaker',
    #'Nick Lodolo',##not forcast in either set!
    #'Brooks Raley',#not found
    'Cam Hill':'Cameron Hill',
    #'Stephen Woods':'', #jr.
    #'Duane Underwood',#jr.
    'Mike Shawaryn':'Michael Shawaryn'
}
cbs_pit_to_fg_names.update({x.replace(' Jr.','') : x for x in juniors})

cbs_pitcher_names_dict.update(cbs_pit_to_fg_names)

## Time to roll...
We can convert a bunch of names to their FG partners using the dicts above. Then we can create a playerid column in the cbs DataFrames with default nan values. With the repeated names dictionary, we can directly add player ids for those players using that dictionary. Then we can create a dictionary from the two FG DataFrames (Name: playerid) to get the rest. 

```
cbs_hitter_names_dict
cbs_pitcher_names_dict
player_to_fg_id
```


In [190]:
cbs_hit['playerid'] = np.nan
cbs_pit['playerid'] = np.nan

def apply_dict(row,dictionary,key,value):
    '''
    use dictionary to map row[key] to vow[value], ignoring KeyError Exceptions
    '''
    row[key]## throw an expection if the row doesn't have the keys.
    row[value] ## or values
    try:
        row[value] = dictionary[row[key]]
    except KeyError: #the dictionary doesn't have a key for row[key]
        pass
    return row

In [195]:
cbs_hit = cbs_hit.apply(apply_dict,args=(cbs_hitter_names_dict,'Name','Name'), axis=1)
fg_name_to_playerid = dict(zip(fg_hit['Name'],fg_hit['playerid']))
cbs_hit = cbs_hit.apply(apply_dict, args=(fg_name_to_playerid,'Name','playerid'), axis=1)
cbs_hit = cbs_hit.apply(apply_dict,args=(player_to_fg_id,'Player','playerid'), axis=1)

cbs_pit = cbs_pit.apply(apply_dict,args=(cbs_pitcher_names_dict,'Name','Name'), axis=1)
fg_name_to_playerid = dict(zip(fg_pit['Name'],fg_pit['playerid']))
cbs_pit = cbs_pit.apply(apply_dict, args=(fg_name_to_playerid,'Name','playerid'), axis=1)
cbs_pit = cbs_pit.apply(apply_dict,args=(player_to_fg_id,'Player','playerid'), axis=1)


In [197]:
df = cbs_hit
df[df['Name'].str.endswith('Jr.')]

Unnamed: 0,Avail,Player,AB,R,H,1B,2B,3B,HR,RBI,...,SB,CS,AVG,OBP,SLG,Rank,Unnamed: 18,Name,Team,playerid
1,Springer International,Ronald Acuna CF | ATL,220,45,64,37,10,2,15,37,...,14,4,0.291,0.369,0.559,2,,Ronald Acuna Jr.,ATL,18401
14,SweepTheLegJohnny,Fernando Tatis SS | SD,211,37,66,40,10,5,11,33,...,9,3,0.313,0.369,0.564,20,,Fernando Tatis Jr.,SD,19709
58,Omak Goat Rodeo,Vladimir Guerrero 3B | TOR,212,28,61,39,13,1,8,35,...,0,0,0.288,0.356,0.472,84,,Vladimir Guerrero Jr.,TOR,19611
100,Cackleberry Czars,Lourdes Gurriel LF | TOR,211,32,57,34,11,1,11,33,...,3,2,0.27,0.325,0.488,166,,Lourdes Gurriel Jr.,TOR,19238
230,FA,Jackie Bradley CF | BOS,197,27,45,24,12,1,8,25,...,4,2,0.228,0.308,0.421,339,,Jackie Bradley Jr.,BOS,12984
248,FA,Steven Souza RF | CHC,138,20,37,21,9,1,6,22,...,4,1,0.268,0.354,0.478,359,,Steven Souza Jr.,CHC,5667
283,FA,Albert Almora CF | CHC,164,23,44,30,8,1,5,18,...,1,1,0.268,0.315,0.421,399,,Albert Almora Jr.,CHC,14109
352,FA,Dwight Smith LF | BAL,123,14,30,18,6,1,5,16,...,1,0,0.244,0.298,0.431,491,,Dwight Smith Jr.,BAL,13473
370,FA,LaMonte Wade CF | MIN,70,12,20,11,4,1,4,11,...,1,0,0.286,0.375,0.543,525,,LaMonte Wade Jr.,MIN,18126
623,FA,Troy Stokes LF | DET,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,9999,,Troy Stokes Jr.,DET,sa828871


### Let's turn all this into a class

This will allow some flexibility to use it in other ways. It would be nice to keep working this into something relatively automated.

In [373]:
class NameToFangraphsID:
    """ 
    A tool for mapping name data with Fangraphs playerids. 
    
    The expected use includes having a collection of data without unique baseball playerids
    and adding ids from another collection of data. This can then be used to merge data sets.
    Presumably, a player's name provides a partial column on which to join. This class
    provides some functionality to handle duplication of names and naming variations. 
    
    Attributes
    ----------
    
    fg_data: DataFrame
        A DataFrame that includes unique player ids.
    
    name_data: DataFrame
        A DataFrame that includes data without unique player ids or with
        ids that don't match those in fg_data.
        
    fg_name: string
        The name for column containing data, such as a player's name, which can be identified
        similar data in the other data. 
        
    fg_pid: string
        The name for columns with a unique id.
        
    name_player: string
        The name of a column from which a match to name data can be found. Used for 
        extracting name data.
        
    empty_value: any
        The default value written when no data is provided.
        
    extract_func: function
        A function used to extract name data from other data.
        
    Methods
    -------
    extract_name(name_data)
        Extracts data from a string; used to find string matches with player names, for example.
    
    fg_name_id_map()
        Returns a dictionary that maps names in fg_data to playerids.
    
    transform_suffix(suffix, add=True):
        Adds or removes a suffix from names in name_data.
        
    transform_name(dictionary, name_data_key = None):
        Transforms names in name_data by mapping with a dictionary. Names that don't match a dictionary key
        are ignored.
        
    duplicated_names(in_names = True, as_series = True):
        Returns a Series or list of the names which are duplicated in the data.
                
    duplicated_name_entries(in_names =True):
        Returns a DataFrame of all entries with a duplicate name value.
    
    add_ids_from_dict(dictionary, name_data_key = None):
        Add playerids to name_data from a dictionary that has playerids as values.
        
    add_ids_from_fg_data(fg_on= True, name_data_on= True):
        Adds playerids to name_data using a columns from fg_data and name_data as a dictionary key.
        
    get_name_data_dict(name_data_key):
        Returns a dictionary mapping a column from name_data to playerids.

    """
    
    def __init__(self, fg_data, name_data, fg_data_name= 'Name', fg_data_pid= 'playerid', 
                 name_data_col= 'Player', empty_value= np.nan, extract_name_func = 'default'):
        """
        Parameters
        ----------
        fg_data: DataFrame or str
            The data including unique player ids. If string, string name of a csv file. If a DataFrame,
            the data is copied and not a view.
        
        name_data: DataFrame or str
            The data without unique player ids. If string, should name a csv file. If a DataFrame, the data
            is copied and not a view.
            
        fg_data_pid: str
            The name of a column in which playerid data is written and found. default 'playerid'
            
        name_data_col: str
            The name of a column in name_data from which a name can be extracted. Default = 'Player'
        
        empty_value: any
            The value to write when no data exists. Default is numpy.nan.
        
        extract_name_func: function or 'default'
            The function used to extract name data from name_data_col. If 'default', uses the 
            default lambda function. (See method docs for extract_name)
        """
        ## FIX ME
        ## Currently, if the name_data includes a column with a name == fg_data_name, this will
        ## overwrite the name_data[fg_data_name].
        
        try:
            self.fg_data = pd.read_csv(fg_data)
        except ValueError:
            self.fg_data = fg_data.copy()
        try:
            self.name_data = pd.read_csv(name_data)
        except:
            self.name_data = name_data.copy()
        self.fg_name = fg_data_name
        self.fg_pid = fg_data_pid
        self.name_player = name_data_col
        self.empty_value = empty_value
        self.name_data[self.fg_pid] = self.empty_value
        self.name_data[self.fg_name] = self.empty_value
        if extract_name_func == 'default':
            self.extract_func = (lambda x: ' '.join(x.strip().split()[:-3]))
            #self.extract_func = (lambda x: x)
        else:
            self.extract_func = extract_name_func
        self.name_data[self.fg_name] = self.name_data[self.name_player].apply(self.extract_name)
        self.name_data[self.fg_pid] = np.nan
        #self.name_data['Name'] = self.name_data['Player'].apply(self.extract_name)   
        
    def extract_name(self, name_data):
        """
        Extracts data from a string; used to find string matches with player names, for example.
        
        The assumed input value is from a CBS Fantasy Baseball File. These have the form of:
            Ichiro Suzuki CF | SEA
        and this method returns everything before the ' CF'.
        
        The end user can assign a function to self.extrac_func to replace default behavior with something else that
        suits the data they have. Currently, support is limited to single argument functions.
        
        Parameters
        ----------
        
        name_data: str
            A string from which a name can be extracted.
        
        Examples
        --------
        
        Example of assigning another fucntion to this name: For a trivial case where 
        name_data is just a name but might include white space:
            >>> NameToFangraphsID.extract_name_name = (lambda x: x.strip())
        More complex functions can be defined, but can only accept one argument.
        """
        
        return self.extract_func(name_data)
    
    def fg_name_id_map(self):
        '''
        Returns a dictionary that maps fangraphs names to playerids.
        '''
        return dict(zip(self.fg_data[self.fg_name],self.fg_data[self.fg_pid]))
    
    def transform_suffix(self, suffix, add=True):
        '''
        Adds or removes a suffix from names in name_data. Default is add.
        
        Addition adds the suffix to a names in name_data if it matches fg_data names
        without the suffix. Removal removes the suffix and any whitespace preceding it.
        
        This transformation applies to duplicated names. 
        
        Parameters
        ----------
        
        suffix: str
            The string to add or remove.
        
        add: bool
            Whether to add the string to the name data or remove it. 
        '''
        if add:
            series = self.fg_data[self.fg_name]
            suffixed = series.where(series.str.endswith(suffix)).dropna()
            suffix_map = { name.replace(suffix,'').strip() : name for name in suffixed }
            ## Possible improvement? It feels a little silly and is probably slow.
            ## Using apply on a whole data frame when a single column is being transformed.
            ## Better would be Series.map with a defaultdict.
            ## Later...
            self.name_data = self.name_data.apply(self._apply_dict,
                                                  args=(suffix_map,
                                                       self.fg_name,
                                                       self.fg_name), axis=1)
        else:
            name_data[self.fg_name] = name_data[self.fg_name].apply(lambda x: x.replace(suffix,''))
            
    def transform_names(self, dictionary, name_data_key = None):
        '''
        Transforms names in name_data with a dictionary. Names that don't match a dictionary key
        are ignored.
        
        Parameters
        ----------
        
        dictionary: dict
            The dictionary that will be used to transform names
            
        name_data_key: immutable
            The name of the column in which to find the keys in the dictionary. By default this is
            the name of the columns in which the name data is found in fg_data.
        '''
        if not name_data_key:
            name_data_key = self.fg_name
            
        self.name_data = self.name_data.apply(self._apply_dict, args=(dictionary,
                                                                      name_data_key,
                                                                      self.fg_name), axis=1)
                                              
    def duplicated_names(self, in_names = True, as_series = True):
        """
        Returns a series or list of the names which are duplicated in the data.
        
        Parameters
        ----------
        
        in_names: bool
            If true, finds duplicated names in name_data. If false, finds duplicated
            names in fg_data.
            
        as_series: bool
            Determines the type of the return value. A pandas series if True; a list if False.
            
        Returns
        -------
            A series or list containing the names that are repeated. A series will share indices
            with the DataFrame from which it is derived.
        """
        
        if in_names:
            out = self.name_data[self.fg_name]
        else:
            out = self.fg_data[self.fg_name]
        out = out[out.duplicated(keep=False)]
        if as_series:
            return out
        else:
            return list(out)
        
    def duplicated_name_entries(self, in_names =True): 
        """
        Returns a DataFrame of the entries that are duplicated names.
        
        This function can be useful for identifying which names need to be handled
        and for developing functions to handle them.
        
        Parameters
        ----------
        in_names: bool
            The duplicated entries in name_data, if True. The duplicated entries in fg_data
            if False.
             
        Returns
        -------
            A view of the DataFrame entries that have duplicated names.
        """
        if in_names:
            mask = self.name_data.duplicated(subset=self.fg_name, keep = False)
            return self.name_data[mask]
        else:
            mask =  self.fg_data.duplicated(subset=self.fg_name, keep = False)
            return self.fg_data[mask]
        
    
    def add_ids_from_dict(self, dictionary, name_data_key = None):
        '''
        Add playerids to name_data from a dictionary that has playerids as values.
        
        Parameters
        ----------
        dictionary: dict
            A dicitonary that maps a column in name_data to playerids
            
        name_data_key: immutable
            The name of a column in name_data; the data elements are passed as keys to dictionary.
        '''
        if not name_data_key:
            name_data_kay = self.fg_name
        self.name_data = self.name_data.apply(self._apply_dict,
                                              args = (dictionary,
                                                     name_data_key,
                                                     self.fg_pid), axis=1)
        
    def add_ids_from_fg_data(self, fg_on= True, name_data_on= True):
        '''
        Adds playerids to name_data using a column from fg_data and name_data as a dictionary key.
        
        Constructs a dictionary of {key: playerid} from the fg_data using column named with fg_on. Then
        applies that dictionary by finding keys in name_data from name_data_on to add player ids to name_data. 
        
        Parameters
        ----------
        fg_on: bool or str
            Column containing data used as dictionary keys for a mapping. Uses fg_name if true. This data
            should match some data in the column of name_data_on.
        
        name_data_on: bool or str
            Column containing data use as dictionary keys for a mapping. Uses fg_name if true. This data
            should match some data in the column of fg_on.
            
        
        '''
        if fg_on == True:
            fg_on = self.fg_name
        if name_data_on == True:
            name_data_on = self.fg_name
        dictionary = dict(zip(self.fg_data[fg_on],self.fg_data[self.fg_pid]))
        self.name_data = self.name_data.apply(self._apply_dict, args = (dictionary,
                                                                       name_data_on,
                                                                       self.fg_pid), axis=1)
        
    def get_name_data_dict(self, name_data_key):
        """
        Returns a dictionary mapping a column from name_data to playerids.
        
        Parameters
        ----------
        
        name_data_key: str
            The column name used as keys in name_data.
            
        Returns
        -------
        
        dict: 
            A dicionary containing keys from a column in name_data and values that are playerids. 
        """
        return dict(zip(self.name_data[name_data_key],self.name_data[self.fg_pid]))
                                              
    
    def _apply_dict(self,row,dictionary,key,value):
        '''
        use dictionary to map row[key] to vow[value], ignoring KeyError Exceptions.
        '''
        #print("Applying dict....")
        #print(row)
        #print()
        #print(dictionary)
        #print(key)
        #print(value)
        #print(temp)
        row[key]## throw an expection if the row doesn't have the keys.
        row[value] ## or if the row doesn't have the values
        try:
            row[value] = dictionary[row[key]]
        except KeyError: #the dictionary doesn't have an entry key == row[key]
            pass
        return row

In [374]:
cbs_hit = pd.read_csv('cbs_hitter_2020_projections.csv',skiprows=1, skipfooter=1,engine='python')
#cbs_pit = pd.read_csv('cbs_pitcher_2020_projections.csv',skiprows=1, skipfooter=1,engine='python')
fg_hit = pd.read_csv('steamer_hitters_2020.csv')
x = NameToFangraphsID(fg_hit,cbs_hit)

In [375]:
x.duplicated_names()
x.duplicated_name_entries()
x.transform_suffix('Jr.')
x.transform_names(cbs_to_fg_names)
x.add_ids_from_dict(player_to_fg_id, name_data_key='Player')
x.add_ids_from_fg_data()
x.get_name_data_dict('Player')

{'Christian Yelich RF | MIL ': '11477',
 'Ronald Acuna CF | ATL': '18401',
 'Cody Bellinger RF | LAD': '15998',
 'Mookie Betts RF | LAD ': '13611',
 'Mike Trout CF | LAA ': '10155',
 'Trea Turner SS | WAS ': '16252',
 'Alex Bregman 3B | HOU ': '17678',
 'Trevor Story SS | COL ': '12564',
 'Francisco Lindor SS | CLE': '12916',
 'Nolan Arenado 3B | COL ': '9777',
 'Freddie Freeman 1B | ATL': '5361',
 'Jose Ramirez 3B | CLE ': '13510',
 'Juan Soto LF | WAS ': '20123',
 'Anthony Rendon 3B | LAA ': '12861',
 'Fernando Tatis SS | SD ': '19709',
 'J.D. Martinez DH | BOS ': '6184',
 'Rafael Devers 3B | BOS ': '17350',
 'Xander Bogaerts SS | BOS ': '12161',
 'Ketel Marte CF | ARI ': '13613',
 'Starling Marte CF | ARI ': '9241',
 'George Springer CF | HOU ': '12856',
 'Bryce Harper RF | PHI ': '11579',
 'Jose Altuve 2B | HOU ': '5417',
 'Charlie Blackmon RF | COL': '7859',
 'Gleyber Torres SS | NYY ': '16997',
 'Pete Alonso 1B | NYM ': '19251',
 'Yordan Alvarez DH | HOU ': '19556',
 'Ozzie Albie

In [345]:
fg_hit.groupby('playerid').agg({'PA': max,'HR': sum})
fg_hit.columns

Index(['Name', 'Team', 'G', 'PA', 'AB', 'H', '2B', '3B', 'HR', 'R', 'RBI',
       'BB', 'SO', 'HBP', 'SB', 'CS', '-1', 'AVG', 'OBP', 'SLG', 'OPS', 'wOBA',
       '-1.1', 'wRC+', 'BsR', 'Fld', '-1.2', 'Off', 'Def', 'WAR', '-1.3',
       'ADP', 'playerid'],
      dtype='object')

In [346]:
temp = fg_hit.copy()
temp['unique'] = fg_hit.index

In [376]:
pd.read_csv(cbs_hit)

ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>