# Homework 2: Pandas, MatPlotLib, and SciKit-Learn
**Created by:**&emsp;**Jacob Norman**  
**Date:**&emsp;&emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;**3/8/2019**
## Overview
This homework assignment covers the remaining content from Jake Vanderplas' text *Python Data Science Handbook*. This includes ``pandas``, ``matplotlib`` (and ``seaborn``), and ``sklearn``. I decided to group these three chapters together because they are so tightly linked anyway. Group by analysis with ``pandas`` is often supplemented with plotting in ``matplotlib`` or ``seaborn``. This is often known as *exploratory data analysis*, or EDA. The first part of this assignment will involve some EDA on a baseball dataset. Afterwords, I will use this analysis as a basis for a machine learning model with ``sklearn``.
## About the Data
The data used for this assignment is pitch-by-pitch data for Justin Verlander for the entire 2017 MLB season, including the postseason. Perhaps 2011, when he won the AL MVP, AL Cy Young, and AL pitching Triple Crown (W, ERA, Strikeouts), would have been a better year to analyze; however, the data was not as complete in this year. You may remember 2017 as the year the Detroit Tigers traded JV to the Houston Astros. In return the Tigers got some solid prospects, Jake Rogers (C), Daz Cameron (OF), and Franklin Perez (RHP). JV's hefty contract was a principal factor in the deal, hence why he was traded after the waiver-free deadline. To seal the deal, the Tigers had to pay \$8 million of JV's contract in 2018 and 2019. However, having that contract off the Tiger's payroll has freed them up and allowed them to accelerate their rebuild. The deal worked out for JV, as he finally became a World Series Champion. He was a key component in the Astro's pitching rotation, arguably being the reason they won the World Series. 

PITCHf/x is a relatively new tool used for Sabermetrics. Starting in 2006, PITCHf/x has recorded pitch-level data for individual players. The earlier years of the application are not as complete, with many missing values or even entire columns. ``pybaseball`` is a way to extract to PITCHf/x and other baseballdata directly into Python. You can read about the functions of the package [here](https://www.pydoc.io/pypi/pybaseball-1.0.1/). There are many R packages, such as ``baseballr``, ``Lahman``, and ``pitchRx`` that do a better job, but, as this is a Python class, I opted for the slightly less robust ``pybaseball``. 

Here is a way to install the package, if you wish:

In [None]:
# !pip install pybaseball

In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pybaseball

In [39]:
player_id = pybaseball.playerid_lookup('Verlander', first = 'Justin')
jv_id = player_id['key_mlbam'][0]
jv = pybaseball.statcast_pitcher(start_dt = '2017-04-02', end_dt = '2017-11-01', player_id = jv_id)
jv.head(10)

Gathering player lookup table. This may take a moment.
Gathering Player Data


Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,FF,2017-10-31,97.4,-1.9446,6.7237,Justin Verlander,641355,434378,strikeout,swinging_strike,...,2,1,2,1,1,2,2,1,Standard,Standard
1,FF,2017-10-31,97.1,-1.8633,6.7674,Justin Verlander,641355,434378,,foul,...,2,1,2,1,1,2,2,1,Standard,Standard
2,FF,2017-10-31,96.1,-1.8595,6.8649,Justin Verlander,641355,434378,,swinging_strike,...,2,1,2,1,1,2,2,1,Standard,Standard
3,FF,2017-10-31,96.6,-1.845,6.8482,Justin Verlander,641355,434378,,ball,...,2,1,2,1,1,2,2,1,Standard,Standard
4,FC,2017-10-31,91.5,-2.2148,6.5903,Justin Verlander,457759,434378,field_out,hit_into_play,...,2,1,2,1,1,2,2,1,Strategic,Standard
5,FC,2017-10-31,91.9,-2.1884,6.565,Justin Verlander,457759,434378,,foul,...,2,1,2,1,1,2,2,1,Strategic,Standard
6,FC,2017-10-31,91.8,-2.3221,6.5049,Justin Verlander,457759,434378,,foul_tip,...,2,1,2,1,1,2,2,1,Strategic,Standard
7,FC,2017-10-31,90.7,-2.2522,6.7167,Justin Verlander,457759,434378,,foul,...,2,1,2,1,1,2,2,1,Strategic,Standard
8,SL,2017-10-31,87.5,-2.3266,6.4741,Justin Verlander,608369,434378,sac_fly,hit_into_play_score,...,1,1,1,1,1,1,1,1,Strategic,Standard
9,FF,2017-10-31,97.1,-1.9415,6.6737,Justin Verlander,608369,434378,,foul,...,1,1,1,1,1,1,1,1,Strategic,Standard


Here is a sample of the data. It is clearly loaded with (89!) variables to analyze. Perhaps it is better to view the overall structure of the dataframe.

In [43]:
jv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4065 entries, 0 to 4064
Data columns (total 89 columns):
pitch_type                         4065 non-null object
game_date                          4065 non-null object
release_speed                      4065 non-null float64
release_pos_x                      4065 non-null float64
release_pos_z                      4065 non-null float64
player_name                        4065 non-null object
batter                             4065 non-null int64
pitcher                            4065 non-null int64
events                             983 non-null object
description                        4065 non-null object
spin_dir                           0 non-null float64
spin_rate_deprecated               0 non-null float64
break_angle_deprecated             0 non-null float64
break_length_deprecated            0 non-null float64
zone                               4065 non-null int64
des                                983 non-null object
game_ty

Clearly the previous view of the dataframe did not do it justice. These are all of the variables available for all  4065 pitches JV threw in 2017. Exracting this data directly from [Baseball Savant](https://baseballsavant.mlb.com/statcast_search) or [Brook's Baseball](http://www.brooksbaseball.net/pfxVB/pfx.php?) would be a very tedious task, but ``pybaseball`` makes the process very quick and painless. There are some columns that contain all NULL values; however, it appears that this is because they were deprecated and are captured in a different variable.
## EDA with ``pandas``, ``matplotlib``, and ``seaborn``  
The principal function of ``pandas`` is the group by analysis, similiar to ``dplyr`` in R. This allows for interesting views of the data that can lead to conclusions that cannot be drawn from simply looking at the individual observations. An possible group by would be by ``game_date``, essentially giving the stats for each of JV's starts or pitching appearances. For this assignment I will be plotting using the ``seaborn`` API. This is mainly because we did not cover this topic in PCDA and I prefer the cleaner, more modern plots created from ``seaborn``.  

Let's do a group by ``game_date``:

In [48]:
jv.groupby(['game_date']).mean()

Unnamed: 0_level_0,release_speed,release_pos_x,release_pos_z,batter,pitcher,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,...,at_bat_number,pitch_number,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score
game_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-04-04,90.93301,-2.520919,6.28109,551432.524272,434378.0,,,,,8.543689,...,32.242718,2.912621,1.368932,4.825243,1.368932,4.825243,4.825243,1.368932,1.368932,4.825243
2017-04-10,89.208929,-2.489267,6.363852,524848.892857,434378.0,,,,,9.1875,...,23.732143,3.151786,0.1875,0.758929,0.758929,0.1875,0.758929,0.1875,0.758929,0.1875
2017-04-15,91.604706,-2.410621,6.344404,526465.352941,434378.0,,,,,8.352941,...,23.011765,2.858824,4.364706,1.152941,4.364706,1.152941,1.152941,4.364706,4.364706,1.152941
2017-04-21,91.063551,-2.452093,6.519348,553271.308411,434378.0,,,,,9.429907,...,26.962617,3.317757,0.0,1.476636,0.0,1.476636,1.476636,0.0,0.0,1.476636
2017-04-27,91.637815,-2.489193,6.286086,512402.848739,434378.0,,,,,8.319328,...,25.033613,3.361345,0.117647,0.142857,0.142857,0.117647,0.142857,0.117647,0.142857,0.117647
2017-05-02,91.275424,-2.431496,6.285583,527200.90678,434378.0,,,,,8.728814,...,29.627119,2.90678,3.09322,0.991525,0.991525,3.09322,0.991525,3.09322,0.991525,3.09322
2017-05-09,91.434259,-2.337607,6.309935,542967.935185,434378.0,,,,,9.407407,...,27.685185,2.87037,1.324074,2.564815,1.324074,2.564815,2.564815,1.324074,1.324074,2.564815
2017-05-14,91.477358,-2.422964,6.36622,513454.5,434378.0,,,,,9.141509,...,27.537736,2.764151,1.301887,1.0,1.301887,1.0,1.0,1.301887,1.301887,1.0
2017-05-20,91.731193,-2.407803,6.318866,535007.293578,434378.0,,,,,8.981651,...,31.550459,2.889908,6.220183,1.376147,1.376147,6.220183,1.376147,6.220183,1.376147,6.220183
2017-05-25,91.249038,-2.33934,6.346934,548221.403846,434378.0,,,,,9.326923,...,34.442308,2.855769,2.836538,2.625,2.836538,2.625,2.625,2.836538,2.836538,2.625
