# COGS 108 - Data Checkpoint

# Names

- Jake Heinlein
- Nathan Tripp
- Naomi Chin
- Leo Friedman
- Dante Tanjuatco

<a id='research_question'></a>
# Research Question

*Is the combination of an mlb free agents age and batting performance, measured by batting average and on base plus slugging percentage, indicative of their yearly salary, and if so, can an algorithm predict a players future contract based on these factors?*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

The first data set looks at MLB players stats by year (https://www.rotowire.com/baseball/stats.php). We are using player age, batting average, and on base plus slugging percentage data. We are taking datasets from 2010-2022. From 2010-2022, the datasets contain 671, 681, 687, 685, 698, 693, 677, 675, 689, 694, 612, 734, and 772 observations, respectively. We will call the concatenation of these datasets 'batting'.

The second dataset shows free agent contracts (**insert link**), which we are using to find players yearly salary. The datasets contain... observations. We will call the concatenation of these datasets 'contracts'.

We are looking at players who are included in both sets, comparing batting performance along with age to their new contracted yearly salary as free agents.

# Setup

In [1]:
# import packages and setup visuals

import pandas as pd
pd.set_option('display.max_rows', None)
import os

In [2]:
# create functions

# remove dollar sign and commas from salary string, convert to integer
def salary_to_int(str_in):
    str_in = str_in.replace('$','')
    str_in = str_in.replace(',','')
    output = int(str_in)
    return output

# turn 'LastName, FirstName' into 'FirstName LastName'
def standardize_name(str_in):
    # str_in = str_in.replace(',', '')
    # str_in = remove_periods(str_in)
    name_list = str_in.split(', ')
    output = name_list[1] + ' ' + name_list[0]
    return output

# change the term length to just the starting year of the term
def term_to_year(str_in):
    term_list = str(str_in).split('-')
    output = term_list[0]
    return output

# remove the periods from names
def remove_periods(str_in):
    output = str_in.replace('.','')
    return output

# Data Cleaning

Describe your data cleaning steps here.

In [13]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# clean MLB contract (salary) data

# access the directory with MLB contracts
directory = 'data/contracts'
# initialize list of contract filenames
filenames = []
# concatenate contracts from 2011-2022
for filename in os.listdir(directory):
    filenames.append(str(directory) + "/" + filename)
contracts = pd.concat([pd.read_csv(f) for f in filenames])
# keep player, position, age, term, and AAV columns
contracts = contracts[["Player","Pos'n","Age","Term","AAV"]]
# remove pitcher information
contracts = contracts[contracts["Pos'n"].str.contains("hp") == False]
# remove any rows with empty data
contracts = contracts[contracts["AAV"].isna() == False]
# remove $1 contracts (0 year contracts)
contracts = contracts[contracts["AAV"] != "$1"]
# change AAV to an integer
contracts["AAV"] = contracts["AAV"].apply(salary_to_int)
# reset the indices
contracts = contracts.reset_index(drop = True)
# rename AAV (Average Annual Value) to Yearly Salary
contracts = contracts.rename({"AAV":"Yearly Salary"}, axis=1)
# turn 'LastName, FirstName' into 'FirstName LastName'
contracts["Player"] = contracts["Player"].apply(standardize_name)
# remove all columns except for selected ones
contracts = contracts[["Player","Term","Yearly Salary"]]
# change the term length to just the starting year of the term
contracts["Term"] = contracts["Term"].apply(term_to_year)
# rename 'term_to_year' to 'Year'
contracts = contracts.rename({"Term":"Year"}, axis = 1)
# sort the dataframe by year
contracts = contracts.sort_values(by="Year")
# cleaned contract dataframe outputs
print("The first year in the dataset is " + contracts.iloc[1,1])
print("There are " + str(len(contracts["Player"].unique())) + " unique players in the contract data")
display(contracts.head())

The first year in the dataset is 2011
There are 339 unique players in the contract data


Unnamed: 0,Player,Year,Yearly Salary
0,Carl Crawford,2011,20285714
31,Geoff Blum,2011,1350000
32,Edgar Renteria,2011,2100000
33,Manmy Ramirez,2011,2000020
34,Miguel Cairo,2011,2000000


In [4]:
# clean MLB batting data

# access the directory with MLB batting data
directory = 'data/batting'
filenames = []
dataframes = []
# add the year of corresponding data to each dataset
for filename in os.listdir(directory):
    year = filename[:4]
    filepath = str(directory) + "/" + filename
    filenames.append(filepath)
    df = pd.read_csv(filepath)
    df['Year'] = year
    dataframes.append(df)
# concatenate 2010-2022 datasets
batting = pd.concat(dataframes)
# sort dataframe by year
batting = batting.sort_values(by = "Year")
# remove all columns except for selected ones
batting = batting[["Player", "Pos", "Age", "AVG", "OPS", "Year"]]
# only keep the players that also appear in the contracts dataframe
batting = batting[batting["Player"].isin(contracts["Player"])]
# batting["Player"] = batting["Player"].apply(remove_periods)
# print(len(batting["Player"].unique()))
# print((contracts[contracts["Player"].isin(batting["Player"]) == False ])["Player"])
# for player in contracts["Player"].unique() :
#     if((batting["Player"].unique()).contains(player) == False):
#         print(player)
# print(contracts["Player"].unique().isin((batting["Player"].unique())))
# batting = batting[batting.isin(contracts["Player"])]
batting.head()
# batting["Player"].value_counts()

Unnamed: 0,Player,Pos,Age,AVG,OPS,Year
0,Ichiro Suzuki,OF,48,0.315,0.753,2010
448,Jorge Cantu,3B,40,0.235,0.606,2010
451,Mark DeRosa,OF,47,0.194,0.537,2010
457,Pat Burrell,OF,46,0.202,0.625,2010
458,Lucas Duda,1B,36,0.202,0.678,2010
