# COGS 108 - Data Checkpoint

# Names

- Jake Heinlein
- Nathan Tripp
- Naomi Chin
- Leo Friedman
- Dante Tanjuatco

<a id='research_question'></a>
# Research Question

*Is the combination of an mlb free agents age and batting performance, measured by batting average and on base plus slugging percentage, indicative of their yearly salary, and if so, can an algorithm predict a players future contract based on these factors?*

# Dataset(s)

Dataset 1: 
- Dataset name: MLB Free Agent Contracts 1991-2023
- Link: https://docs.google.com/spreadsheets/d/1bXUPBabVf82y0m2KaZ0F9Fno9xwZ2pmepbFvMBX_TEM/edit#gid=1265430999
- Number of observations: 4995
- Description: Dataset of all MLB Free Agent Contracts given from 1991-2023. Each observation contains relevant variables such salary, year signed, player name, and player age. 

Dataset 2: 
- Dataset name: MLB Batting Stats 1871-2022
- Link to dataset: https://github.com/chadwickbureau/baseballdatabank/blob/master/core/Batting.csv
- Number of observations: 110458
- Description: Dataset of all MLB player batting statistics from from 1871-2022. Each observation contains variables (hits, walks, doubles, singles, homeruns, etc.) that can be used to calculate batting average and on base percentace plus slugging average. 

Dataset 3: 
- Dataset name: MLB Batting Player Information 1871-2022
- Link to the dataset: https://github.com/chadwickbureau/baseballdatabank/blob/master/core/People.csv
- Number of observations: 20369
- Description: Dataset of general information about MLB players from 1871-2022. Each observation contains variables relating to the player such as age, name, and birthday. 

Combining datasets: 
- We plan to combine dataset 2 and 3 by adding the player names in dataset 3 to a column in dataset 2. This is possible because dataset 2 and 3 each share the variable playerIDs. We plan on comparing values in dataset 1 with values in dataset 2 based on player names. We will achieve this by standardizing each player's name. 

# Setup

In [None]:
# import packages and setup visuals
import pandas as pd
import os

In [None]:
directory = 'data/contracts/'
filepaths = [directory + filename for filename in os.listdir(directory)]
contracts = pd.concat([pd.read_csv(filepath) for filepath in filepaths])
print('contracts shape: ', contracts.shape)

people = pd.read_csv('data/batting/People.csv')
batting = pd.read_csv('data/batting/Batting.csv')
print('people shape: ', people.shape)
print('batting shape: ', batting.shape)

# Data Cleaning

##### Part 1: Cleaning contacts dataframe

In [None]:
contracts = contracts[['Player','Pos\'n', 'Age', 'Term', 'AAV']]
contracts = contracts[contracts["Pos'n"].str.contains("hp") == False]
contracts = contracts.dropna(axis=0)
contracts = contracts.drop('Pos\'n',axis=1)
contracts

In [None]:
def salary_to_int(str_in):
    return int(str_in.replace('$','').replace(',',''))

def term_to_year(str_in):
    return int(str(str_in).split('-')[0])

In [None]:
contracts['AAV'] = contracts['AAV'].apply(salary_to_int)
contracts['Term'] = contracts['Term'].apply(term_to_year)
contracts

In [None]:
contracts.columns = ['playerName','playerAge','year','yearSalary']
contracts.head()

##### Part 2: Clean batting dataframe 

In [None]:
def calc_avg(h, ab):
    return h / ab

def calc_obp(h, bb, hbp, ab, sf):
    return (h + bb + hbp) / (ab + bb + sf + hbp)

def calc_tb(h, two_b, three_b, hr):
    singles = h - two_b - three_b - hr
    return singles + two_b * 2 + three_b * 3 + hr * 4

def calc_slg(tb, ab):
    return tb / ab

def calc_obs(obp, slg):
    return obp + slg

In [None]:
avg = calc_avg(h=batting['H'], ab=batting['AB'])
batting['AVG'] = round(avg, 3)

obp = calc_obp(h=batting['H'], bb=batting['BB'], hbp=batting['HBP'], ab=batting['AB'], sf=batting['SF'])
tb = calc_tb(h=batting['H'], two_b=batting['2B'], three_b=batting['3B'], hr=batting['HR'])
slg = calc_slg(tb, batting['AB'])
obs = calc_obs(obp, slg)
batting['OBS'] = round(obs, 3)
batting.head()

In [None]:
batting = batting[['playerID', 'yearID', 'AVG', 'OBS']]
batting.columns = ['playerID', 'year', 'AVG', 'OBS']
batting = batting.dropna(axis=0)
batting.head()

In [None]:
people['name'] = people['nameFirst'] + ' ' + people['nameLast']
people = people[['playerID','name']]
batting = batting.merge(people, how='left', on='playerID')
batting = batting.rename({'name':'playerName'}, axis=1)
batting = batting[['playerID','playerName','year','AVG','OBS']]
batting.head()

##### Old Code: Will Delete Eventually

In [None]:
# # create functions

# # remove dollar sign and commas from salary string, convert to integer
# def salary_to_int(str_in):
#     str_in = str_in.replace('$','')
#     str_in = str_in.replace(',','')
#     output = int(str_in)
#     return output

# # turn 'LastName, FirstName' into 'FirstName LastName'
# def standardize_name(str_in):
#     # str_in = str_in.replace(',', '')
#     # str_in = remove_periods(str_in)
#     name_list = str_in.split(', ')
#     output = name_list[1] + ' ' + name_list[0]
#     return output

# # change the term length to just the starting year of the term
# def term_to_year(str_in):
#     term_list = str(str_in).split('-')
#     output = term_list[0]
#     return output

# # remove the periods from names
# def remove_periods(str_in):
#     output = str_in.replace('.','')
#     return output

In [None]:
# ## YOUR CODE HERE
# ## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# # clean MLB contract (salary) data

# # access the directory with MLB contracts
# directory = 'data/contracts'
# # initialize list of contract filenames
# filenames = []
# # concatenate contracts from 2011-2022
# for filename in os.listdir(directory):
#     filenames.append(str(directory) + "/" + filename)
# contracts = pd.concat([pd.read_csv(f) for f in filenames])
# # keep player, position, age, term, and AAV columns
# contracts = contracts[["Player","Pos'n","Age","Term","AAV"]]
# # remove pitcher information
# contracts = contracts[contracts["Pos'n"].str.contains("hp") == False]
# # remove any rows with empty data
# contracts = contracts[contracts["AAV"].isna() == False]
# # remove $1 contracts (0 year contracts)
# contracts = contracts[contracts["AAV"] != "$1"]
# # change AAV to an integer
# contracts["AAV"] = contracts["AAV"].apply(salary_to_int)
# # reset the indices
# contracts = contracts.reset_index(drop = True)
# # rename AAV (Average Annual Value) to Yearly Salary
# contracts = contracts.rename({"AAV":"Yearly Salary"}, axis=1)
# # turn 'LastName, FirstName' into 'FirstName LastName'
# contracts["Player"] = contracts["Player"].apply(standardize_name)
# # remove all columns except for selected ones
# contracts = contracts[["Player","Term","Yearly Salary"]]
# # change the term length to just the starting year of the term
# contracts["Term"] = contracts["Term"].apply(term_to_year)
# # rename 'term_to_year' to 'Year'
# contracts = contracts.rename({"Term":"Year"}, axis = 1)
# # sort the dataframe by year
# contracts = contracts.sort_values(by="Year")
# # cleaned contract dataframe outputs
# print("The first year in the dataset is " + contracts.iloc[1,1])
# print("There are " + str(len(contracts["Player"].unique())) + " unique players in the contract data")
# display(contracts.head())

In [None]:
# # clean MLB batting data

# # access the directory with MLB batting data
# directory = 'data/batting'
# filenames = []
# dataframes = []
# # add the year of corresponding data to each dataset
# for filename in os.listdir(directory):
#     year = filename[:4]
#     filepath = str(directory) + "/" + filename
#     filenames.append(filepath)
#     df = pd.read_csv(filepath)
#     df['Year'] = year
#     dataframes.append(df)
# # concatenate 2010-2022 datasets
# batting = pd.concat(dataframes)
# # sort dataframe by year
# batting = batting.sort_values(by = "Year")
# # remove all columns except for selected ones
# batting = batting[["Player", "Pos", "Age", "AVG", "OPS", "Year"]]
# # only keep the players that also appear in the contracts dataframe
# batting = batting[batting["Player"].isin(contracts["Player"])]
# # batting["Player"] = batting["Player"].apply(remove_periods)
# # print(len(batting["Player"].unique()))
# # print((contracts[contracts["Player"].isin(batting["Player"]) == False ])["Player"])
# # for player in contracts["Player"].unique() :
# #     if((batting["Player"].unique()).contains(player) == False):
# #         print(player)
# # print(contracts["Player"].unique().isin((batting["Player"].unique())))
# # batting = batting[batting.isin(contracts["Player"])]
# batting.head()
# # batting["Player"].value_counts()