# COGS 108 - Data Checkpoint

# Names

- Jake Heinlein
- Nathan Tripp
- Naomi Chin
- Leo Friedman
- Dante Tanjuatco

<a id='research_question'></a>
# Research Question

*Is the combination of an mlb free agents age and batting performance, measured by batting average and on base plus slugging percentage, indicative of their yearly salary, and if so, can an algorithm predict a players future contract based on these factors?*

# Dataset(s)

Dataset 1: 
- Dataset name: MLB Free Agent Contracts 1991-2023
- Link: https://docs.google.com/spreadsheets/d/1bXUPBabVf82y0m2KaZ0F9Fno9xwZ2pmepbFvMBX_TEM/edit#gid=1265430999
- Number of observations: 4995
- Description: Dataset of all MLB Free Agent Contracts given from 1991-2023. Each observation contains relevant variables such salary, year signed, player name, and player age. 

Dataset 2: 
- Dataset name: MLB Batting Stats 1871-2022
- Link to dataset: https://github.com/chadwickbureau/baseballdatabank/blob/master/core/Batting.csv
- Number of observations: 110458
- Description: Dataset of all MLB player batting statistics from from 1871-2022. Each observation contains variables (hits, walks, doubles, singles, homeruns, etc.) that can be used to calculate batting average and on base percentace plus slugging average. 

Dataset 3: 
- Dataset name: MLB Batting Player Information 1871-2022
- Link to the dataset: https://github.com/chadwickbureau/baseballdatabank/blob/master/core/People.csv
- Number of observations: 20369
- Description: Dataset of general information about MLB players from 1871-2022. Each observation contains variables relating to the player such as age, name, and birthday. 

Combining datasets: 
- We plan to combine dataset 2 and 3 by adding the player names in dataset 3 to a column in dataset 2. This is possible because dataset 2 and 3 each share the variable playerIDs. We plan on comparing values in dataset 1 with values in dataset 2 based on player names. We will achieve this by standardizing each player's name. 

# Setup

In [None]:
# import packages and setup visuals
import pandas as pd
import os

In [None]:
# merge all contract files in contracts directory as one dataframe
directory = 'data/contracts/'
filepaths = [directory + filename for filename in os.listdir(directory)]
contracts = pd.concat([pd.read_csv(filepath) for filepath in filepaths])
print('contracts shape: ', contracts.shape)

# read people and batting files as dataframes
people = pd.read_csv('data/batting/People.csv')
batting = pd.read_csv('data/batting/Batting.csv')
print('people shape: ', people.shape)
print('batting shape: ', batting.shape)

# Data Cleaning

##### Part 1: Cleaning contacts dataframe

In [None]:
# drop unecessary columns
contracts = contracts[['Player','Pos\'n', 'Age', 'Term', 'AAV']]

# drop players that are pitchers or have NaN values
contracts = contracts[contracts["Pos'n"].str.contains("hp") == False]
contracts = contracts.dropna(axis=0)

# drop position column (no longer needed)
contracts = contracts.drop('Pos\'n',axis=1)
contracts

In [None]:
# functions to standardize variables
def salary_to_int(str_in):
    return int(str_in.replace('$','').replace(',',''))

def term_to_year(str_in):
    return int(str(str_in).split('-')[0])

In [None]:
# standardize salary and term variables
contracts['AAV'] = contracts['AAV'].apply(salary_to_int)
contracts['Term'] = contracts['Term'].apply(term_to_year)
contracts

In [None]:
# rename columns for consistency
contracts.columns = ['playerName','playerAge','year','yearSalary']
contracts.head()

##### Part 2: Clean batting dataframe 

In [None]:
# functions used for calculating batting average and OBS average
def calc_avg(h, ab):
    return h / ab
    
def calc_obp(h, bb, hbp, ab, sf):
    return (h + bb + hbp) / (ab + bb + sf + hbp)
    
def calc_tb(h, two_b, three_b, hr):
    singles = h - two_b - three_b - hr
    return singles + two_b * 2 + three_b * 3 + hr * 4

def calc_slg(tb, ab):
    return tb / ab

def calc_obs(obp, slg):
    return obp + slg

In [None]:
# create batting average column in batting dataframe
avg = calc_avg(h=batting['H'], ab=batting['AB'])
batting['AVG'] = round(avg, 3)

# create OBS average column in batting dataframe
obp = calc_obp(h=batting['H'], bb=batting['BB'], hbp=batting['HBP'], ab=batting['AB'], sf=batting['SF'])
tb = calc_tb(h=batting['H'], two_b=batting['2B'], three_b=batting['3B'], hr=batting['HR'])
slg = calc_slg(tb, batting['AB'])
obs = calc_obs(obp, slg)
batting['OBS'] = round(obs, 3)
batting.head()

In [None]:
# drop irrelevant columns and rename
batting = batting[['playerID', 'yearID', 'teamID', 'AVG', 'OBS']]
batting.columns = ['playerID', 'year', 'team', 'AVG', 'OBS']

# drop observations with NaN values
batting = batting.dropna(axis=0)
batting.head()

In [None]:
# add name column to people that combines players first and last name
people['name'] = people['nameFirst'] + ' ' + people['nameLast']

# drop irrelevant columns
people = people[['playerID','name']]

# merge batting and people to add name column to batting dataframe
batting = batting.merge(people, how='left', on='playerID')

# rename and reorder  batting columns for consistency 
batting = batting.rename({'name':'playerName'}, axis=1)
batting = batting[['playerID','playerName','team','year','AVG','OBS']]
batting.head()