# Data Memo

1. [Overview of Dataset](#overview-of-dataset)
1. [Overview of Research Question](#overview-of-research-question)
1. [Proposed Timeline](#proposed-timeline)
1. [Questions or Concerns](#questions-or-concerns)

### Overview of Dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Before I answer any of the required questions, I will first import my data I have decided on so far. Everything comes from Kaggle at the following link: https://www.kaggle.com/datasets/open-source-sports/baseball-databank?resource=download. So far I have decided on using the Master.csv, Salaries.csv, and Batting.csv files which all have data from 1871 through 2015, but I will only be working with data from year 2000 to 2015. The Master.csv file contains personal information of all the players in the dataset, but I will only be using this to gather the names of players to make importing any future data easier since currently my datasets rely on player IDs. The Salaries.csv file contains team and league information, as well as the salary players have for any given year. Finally, my Batting.csv file contains various batting statistics for players.

In [44]:
master = pd.read_csv('Master.csv')
master['nameFull'] = master['nameFirst'] + ' ' + master['nameLast']
names = master[['playerID', 'nameFull']]
print(names.shape)

salaries = pd.read_csv('Salaries.csv')
salaries_post_2000 = salaries[salaries['yearID'] >= 2000]
print(salaries_post_2000.shape)

merged_data = pd.merge(salaries_post_2000, names, on='playerID', how='inner')
print(merged_data.shape)

merged_data_left = pd.merge(salaries_post_2000, names, on='playerID', how='left')
print(merged_data_left.shape)

(18846, 2)
(13312, 5)
(13304, 6)
(13312, 6)


After importing my Master.csv file, I first decided to make a full name column because the player's first and last names were separated. I then removed all columns except the playerID and nameFull to make the merging with salaries easier since all the other columns were unneeded. As you can see from the output above we are then left with 18846 rows and 2 columns, with the rows each representing players. We are left with so many because we currently still have values from every player that has played since 1871 through 2015, but all those prior to 2000 will be removed once we merge with our salaries dataset. 

I then imported my Salaries.csv file and filtered out all player data prior to 2000 to simplify things, leaving us with 13312 rows and 5 columns containing simple data about a player, the team they started the season and its league, as well as the year and their salary. This file does not contain information on if a player was traded during the year, because their salary stays the same. This is an issue I might have later when dealing with the batting information, because there might be multiple rows for the same player during a given year.

I was then able to finally merge the two datasets together, and I did so with both a left and an inner join so I could see if we were missing any information on any player's names. As you can see above, we were missing 8 player's names from the names data frame so I will create a function to discover the index of these players, as well as their playerID so I can try and find some more information about them in our Master.csv file.

In [None]:

nan_data = merged_data['nameFull'].isnull()
missing_names = []

for ind in merged_data.index:
    if nan_data.loc[ind] == True:
        missing_names.append(merged_data.loc[ind, 'playerID'])
    
print(missing_names)

In [29]:
batting_stats = pd.read_csv('Batting.csv')
batting_stats = batting_stats[batting_stats['yearID'] >= 2000]
print(batting_stats.shape)

new_merged = pd.merge(batting_stats, merged_data, on=['playerID', 'yearID'], how='inner')

final_data = new_merged[['playerID', 'nameFull', 'yearID', 'salary', 'teamID_x', 'stint',
                        'lgID_x', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS',
                        'BB', 'SO', 'IBB', 'HBP', 'SH', 'SF', 'GIDP']]

(22084, 22)


***

### Overview of Research Question

***

### Proposed Timeline

***

### Questions or Concerns