# Kaggle Dataset Work

The purpose of this notebook is to begin working with the data in Paul Schale's [MLB Pitch Data Kaggle Dataset](https://www.kaggle.com/pschale/mlb-pitch-data-20152018?select=games.csv).  This will be done by creating a PostgreSQL Database with the CSV files available, with the exception of the ejections.csv file, since I do not need that information for this project.  

Specifically, I will be combing the original .csv files in this notebook into new files, to build my SQL database out of.

To start, I downloaded the .csv files from the site, then removed the ejections file as discussed above.  Since the 2019 file and 2015-2018 atbats, games, and pitches file utilize similar setups, I am going to combine those.

In [1]:
import pandas as pd

In [2]:
pwd

'/Users/patrickbovard/Documents/GitHub/metis_final_project'

### First, loading in the at-bats csv files:

In [5]:
atbats_2019 = pd.read_csv('./Data/Kaggle_Files/archive/2019_atbats.csv')

In [6]:
atbats_2019.head()

Unnamed: 0,inning,top,ab_id,g_id,p_score,batter_id,pitcher_id,stand,p_throws,event,o
0,1.0,1.0,2019000000.0,201900001.0,0.0,594777,571666,L,R,Flyout,1
1,1.0,1.0,2019000000.0,201900001.0,0.0,545361,571666,R,R,Flyout,2
2,1.0,1.0,2019000000.0,201900001.0,0.0,571506,571666,L,R,Groundout,3
3,1.0,0.0,2019000000.0,201900001.0,0.0,543257,502239,L,R,Single,0
4,1.0,0.0,2019000000.0,201900001.0,0.0,656305,502239,R,R,Flyout,1


In [10]:
atbats_2019.shape

(185245, 11)

In [7]:
at_bats_2018_before = pd.read_csv('./Data/Kaggle_Files/archive/atbats.csv')

In [8]:
at_bats_2018_before.head()

Unnamed: 0,ab_id,batter_id,event,g_id,inning,o,p_score,p_throws,pitcher_id,stand,top
0,2015000001,572761,Groundout,201500001,1,1,0,L,452657,L,True
1,2015000002,518792,Double,201500001,1,1,0,L,452657,L,True
2,2015000003,407812,Single,201500001,1,1,0,L,452657,R,True
3,2015000004,425509,Strikeout,201500001,1,2,0,L,452657,R,True
4,2015000005,571431,Strikeout,201500001,1,3,0,L,452657,L,True


In [9]:
at_bats_2018_before.shape

(740389, 11)

Combining:

In [13]:
combined_atbats = pd.concat([atbats_2019, at_bats_2018_before], axis=0)

In [14]:
combined_atbats.head()

Unnamed: 0,inning,top,ab_id,g_id,p_score,batter_id,pitcher_id,stand,p_throws,event,o
0,1.0,1.0,2019000000.0,201900001.0,0.0,594777,571666,L,R,Flyout,1
1,1.0,1.0,2019000000.0,201900001.0,0.0,545361,571666,R,R,Flyout,2
2,1.0,1.0,2019000000.0,201900001.0,0.0,571506,571666,L,R,Groundout,3
3,1.0,0.0,2019000000.0,201900001.0,0.0,543257,502239,L,R,Single,0
4,1.0,0.0,2019000000.0,201900001.0,0.0,656305,502239,R,R,Flyout,1


In [15]:
combined_atbats.shape

(925634, 11)

This looks correct, so saving the combined file as a new CSV:

In [16]:
combined_atbats.to_csv('./Data/Kaggle_Files/combined_atbats.csv')

### Next, for games:

In [17]:
games_2019 = pd.read_csv('./Data/Kaggle_Files/archive/2019_games.csv')
games_2019.shape

(2408, 16)

In [27]:
games_2019.tail()

Unnamed: 0,g_id,home_team,away_team,home_final_score,away_final_score,date,umpire_HP,umpire_1B,umpire_2B,umpire_3B,start_time,venue_name,weather,wind,elapsed_time,attendance
2403,201902404.0,kca,min,3.0,4.0,2019-09-28,,,,,,Kauffman Stadium,,,,
2404,201902405.0,tex,nya,9.0,4.0,2019-09-28,,,,,,Globe Life Park in Arlington,,,,
2405,201902406.0,sea,oak,0.0,1.0,2019-09-28,,,,,,T-Mobile Park,,,,
2406,201902407.0,ari,sdn,6.0,5.0,2019-09-28,,,,,,Chase Field,,,,
2407,201902408.0,tor,tba,4.0,1.0,2019-09-28,,,,,,Rogers Centre,,,,


In [18]:
games_2018_before = pd.read_csv('./Data/Kaggle_Files/archive/games.csv')
games_2018_before.shape

(9718, 17)

In [19]:
games_2018_before.head()

Unnamed: 0,attendance,away_final_score,away_team,date,elapsed_time,g_id,home_final_score,home_team,start_time,umpire_1B,umpire_2B,umpire_3B,umpire_HP,venue_name,weather,wind,delay
0,35055,3,sln,2015-04-05,184,201500001,0,chn,7:17 PM,Mark Wegner,Marty Foster,Mike Muchlinski,Mike Winters,Wrigley Field,"44 degrees, clear","7 mph, In from CF",0
1,45909,1,ana,2015-04-06,153,201500002,4,sea,1:12 PM,Ron Kulpa,Brian Knight,Vic Carapazza,Larry Vanover,Safeco Field,"54 degrees, cloudy","1 mph, Varies",0
2,36969,2,atl,2015-04-06,156,201500003,1,mia,4:22 PM,Laz Diaz,Chris Guccione,Cory Blaser,Jeff Nelson,Marlins Park,"80 degrees, partly cloudy","16 mph, In from CF",16
3,31042,6,bal,2015-04-06,181,201500004,2,tba,3:12 PM,Ed Hickox,Paul Nauert,Mike Estabrook,Dana DeMuth,Tropicana Field,"72 degrees, dome","0 mph, None",0
4,45549,8,bos,2015-04-06,181,201500005,0,phi,3:08 PM,Phil Cuzzi,Tony Randazzo,Will Little,Gerry Davis,Citizens Bank Park,"71 degrees, partly cloudy","11 mph, Out to RF",0


Now, to combine the games using ignore_index=True, since they don't match up in column dimensions.  However, the names are the same, and the main information I'd want out of here is date and teams to link with atbats.

In [21]:
combined_games = pd.concat([games_2018_before, games_2019], ignore_index=True)

In [22]:
combined_games.shape

(12126, 17)

In [26]:
combined_games.tail()

Unnamed: 0,attendance,away_final_score,away_team,date,elapsed_time,g_id,home_final_score,home_team,start_time,umpire_1B,umpire_2B,umpire_3B,umpire_HP,venue_name,weather,wind,delay
12121,,4.0,min,2019-09-28,,201902404.0,3.0,kca,,,,,,Kauffman Stadium,,,
12122,,4.0,nya,2019-09-28,,201902405.0,9.0,tex,,,,,,Globe Life Park in Arlington,,,
12123,,1.0,oak,2019-09-28,,201902406.0,0.0,sea,,,,,,T-Mobile Park,,,
12124,,5.0,sdn,2019-09-28,,201902407.0,6.0,ari,,,,,,Chase Field,,,
12125,,1.0,tba,2019-09-28,,201902408.0,4.0,tor,,,,,,Rogers Centre,,,


This looks correct, so saving the combined file as a new CSV:

In [28]:
combined_games.to_csv('./Data/Kaggle_Files/combined_games.csv')

### Pitch Data

In [30]:
pitches_2019 = pd.read_csv('./Data/Kaggle_Files/archive/2019_pitches.csv')
pitches_2019.shape

(728790, 40)

In [31]:
pitches_2018_before = pd.read_csv('./Data/Kaggle_Files/archive/pitches.csv')
pitches_2018_before.shape

(2867154, 40)

Checking out the columns to make sure they match up:

In [32]:
pitches_2019.columns

Index(['px', 'pz', 'start_speed', 'end_speed', 'spin_rate', 'spin_dir',
       'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot',
       'sz_top', 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0',
       'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'code', 'type', 'pitch_type',
       'event_num', 'b_score', 'ab_id', 'b_count', 's_count', 'outs',
       'pitch_num', 'on_1b', 'on_2b', 'on_3b'],
      dtype='object')

In [33]:
pitches_2018_before.columns

Index(['px', 'pz', 'start_speed', 'end_speed', 'spin_rate', 'spin_dir',
       'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot',
       'sz_top', 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0',
       'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'code', 'type', 'pitch_type',
       'event_num', 'b_score', 'ab_id', 'b_count', 's_count', 'outs',
       'pitch_num', 'on_1b', 'on_2b', 'on_3b'],
      dtype='object')

They do, so combining:

In [35]:
combined_pitches = pd.concat([pitches_2018_before, pitches_2019], axis=0)

In [36]:
combined_pitches.head()

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,event_num,b_score,ab_id,b_count,s_count,outs,pitch_num,on_1b,on_2b,on_3b
0,0.416,2.963,92.9,84.1,2305.05,159.235,-25.0,3.2,23.7,7.665,...,3,0.0,2015000000.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.191,2.347,92.8,84.1,2689.93,151.402,-40.7,3.4,23.7,12.043,...,4,0.0,2015000000.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0
2,-0.518,3.284,94.1,85.2,2647.97,145.125,-43.7,3.7,23.7,14.368,...,5,0.0,2015000000.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0
3,-0.641,1.221,91.0,84.0,1289.59,169.751,-1.3,5.0,23.8,2.104,...,6,0.0,2015000000.0,0.0,2.0,0.0,4.0,0.0,0.0,0.0
4,-1.821,2.083,75.4,69.6,1374.57,280.671,18.4,12.0,23.8,-10.28,...,7,0.0,2015000000.0,1.0,2.0,0.0,5.0,0.0,0.0,0.0


In [37]:
combined_pitches.shape

(3595944, 40)

Outputting to a new CSV file:

In [39]:
combined_pitches.to_csv('./Data/Kaggle_Files/combined_pitches.csv')

As a note, with that done I'll be deleting the original CSV files from my local computer for space considerations, since all available data is in the combined files.  The one exception is the player_names.csv file, which will be used to link players to events. 

NOTE: all of this data is saved for me in the Data/Kaggle_Files location.  I am adding all .csv files to my .gitignore, for space considerations, so these will not appear on the github page.

### NEXT: kaggle_dataset_sql_construction.ipynb