# Let's Stay Up-to-Date!

With the project completed during the end of 2020 and with COVID running rampant, altering the timeline of the season as well as turning MLB around, I feel like it is time to check and see if the umpires have gotten better since I last analyzed their performances.

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pybaseball as pb
from pybaseball import cache

In [2]:
pd.set_option('display.max_rows', None, 'display.max_columns', None)

In [3]:
cache.enable()

## Let's Retrieve New Data (and try to use the old one we already have)

In [4]:
# The last data retrieved was the 2020 MLB season where the LA Dodgers won, now we have a new champion the Atlanta Braves
# Previous data have already been saved in a different folder

# mlb2021 = pb.statcast(start_dt='2021-04-01', end_dt='2021-11-02')
# print('Retrieval Complete')

This is a large query, it may take a moment to complete
Completed sub-query from 2021-04-01 to 2021-04-06
Completed sub-query from 2021-04-07 to 2021-04-12
Completed sub-query from 2021-04-13 to 2021-04-18
Completed sub-query from 2021-04-19 to 2021-04-24
Completed sub-query from 2021-04-25 to 2021-04-30
Completed sub-query from 2021-05-01 to 2021-05-06
Completed sub-query from 2021-05-07 to 2021-05-12
Completed sub-query from 2021-05-13 to 2021-05-18
Completed sub-query from 2021-05-19 to 2021-05-24
Completed sub-query from 2021-05-25 to 2021-05-30
Completed sub-query from 2021-05-31 to 2021-06-05
Completed sub-query from 2021-06-06 to 2021-06-11
Completed sub-query from 2021-06-12 to 2021-06-17
Completed sub-query from 2021-06-18 to 2021-06-23
Completed sub-query from 2021-06-24 to 2021-06-29
Completed sub-query from 2021-06-30 to 2021-07-05
Completed sub-query from 2021-07-06 to 2021-07-11
Completed sub-query from 2021-07-12 to 2021-07-17
Completed sub-query from 2021-07-18 to 2021-

In [5]:
# Save this to a csv file in case I need to recall it.
# mlb2021.to_csv('2021mlb.csv', index=False)

In [4]:
# Skip retrieval since we already retrieved
# Read the CSV

mlb2021 = pd.read_csv('2021mlb.csv')

Let's look at the data to make sure it has all the correct information.

To get an understanding of what these columns stand for, [click here](https://baseballsavant.mlb.com/csv-docs).

In [5]:
mlb2021.head(5) # We have the last pitches of the World Series

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,des,game_type,stand,p_throws,home_team,away_team,type,hit_location,bb_type,balls,strikes,game_year,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,inning_topbot,hc_x,hc_y,tfs_deprecated,tfs_zulu_deprecated,fielder_2,umpire,sv_id,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,game_pk,pitcher.1,fielder_2.1,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number,pitch_name,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
0,114,FF,2021-11-02,93.7,1.39,6.72,"Smith, Will",493329.0,519293.0,field_out,hit_into_play,,,,,5.0,"Yuli Gurriel grounds out, shortstop Dansby Swa...",W,R,L,HOU,ATL,X,6.0,ground_ball,0.0,2.0,2021.0,0.57,1.21,0.19,2.68,,,488726.0,2.0,9.0,Bot,98.12,136.43,,,518595.0,,,-4.348927,-136.263269,-7.335202,8.101619,29.384843,-15.743652,3.37,1.53,5.0,93.8,-28.0,93.4,2112.0,6.1,660906.0,519293.0,518595.0,518692.0,645277.0,663586.0,621020.0,592696.0,628338.0,594807.0,54.39,0.071,0.064,0.0,1.0,0.0,0.0,2.0,69.0,3.0,4-Seam Fastball,0.0,7.0,0.0,7.0,7.0,0.0,0.0,7.0,Standard,Standard,146.0,0.0,-0.138
1,117,FF,2021-11-02,92.9,1.38,6.72,"Smith, Will",493329.0,519293.0,,foul,,,,,5.0,"Yuli Gurriel grounds out, shortstop Dansby Swa...",W,R,L,HOU,ATL,S,,,0.0,1.0,2021.0,0.9,1.34,0.01,2.12,,,488726.0,2.0,9.0,Bot,,,,,518595.0,,,-5.455433,-134.989926,-8.881689,12.188852,30.69001,-14.009571,3.37,1.53,156.0,75.0,15.0,92.6,2206.0,6.3,660906.0,519293.0,518595.0,518692.0,645277.0,663586.0,621020.0,592696.0,628338.0,594807.0,54.22,,,,,,,,69.0,2.0,4-Seam Fastball,0.0,7.0,0.0,7.0,7.0,0.0,0.0,7.0,Standard,Standard,145.0,0.0,-0.047
2,128,FF,2021-11-02,93.1,1.35,6.73,"Smith, Will",493329.0,519293.0,,called_strike,,,,,6.0,"Yuli Gurriel grounds out, shortstop Dansby Swa...",W,R,L,HOU,ATL,S,,,0.0,0.0,2021.0,0.81,1.52,0.78,2.13,,,488726.0,2.0,9.0,Bot,,,,,518595.0,,,-3.230974,-135.201801,-9.255781,10.67848,31.699974,-11.7058,3.4,1.53,,,,92.5,2216.0,6.2,660906.0,519293.0,518595.0,518692.0,645277.0,663586.0,621020.0,592696.0,628338.0,594807.0,54.3,,,,,,,,69.0,1.0,4-Seam Fastball,0.0,7.0,0.0,7.0,7.0,0.0,0.0,7.0,Standard,Standard,143.0,0.0,-0.042
3,130,FF,2021-11-02,94.6,1.31,6.73,"Smith, Will",670541.0,519293.0,field_out,hit_into_play,,,,,5.0,Yordan Alvarez flies out to left fielder Eddie...,W,L,L,HOU,ATL,X,7.0,fly_ball,3.0,2.0,2021.0,0.85,1.27,-0.23,2.66,,,488726.0,1.0,9.0,Bot,68.5,86.79,,,518595.0,,,-5.901934,-137.422092,-7.652311,12.118242,35.102245,-14.577857,3.58,1.68,312.0,92.7,39.0,93.8,2263.0,6.3,660906.0,519293.0,518595.0,518692.0,645277.0,663586.0,621020.0,592696.0,628338.0,594807.0,54.25,0.026,0.046,0.0,1.0,0.0,0.0,3.0,68.0,6.0,4-Seam Fastball,0.0,7.0,0.0,7.0,7.0,0.0,0.0,7.0,Infield shift,Standard,140.0,-0.001,-0.386
4,140,FF,2021-11-02,93.6,1.31,6.8,"Smith, Will",670541.0,519293.0,,ball,,,,,13.0,Yordan Alvarez flies out to left fielder Eddie...,W,L,L,HOU,ATL,B,,,2.0,2.0,2021.0,0.9,1.43,-1.15,1.51,,,488726.0,1.0,9.0,Bot,,,,,518595.0,,,-8.265487,-135.578023,-10.936295,12.932808,31.370147,-12.168466,3.46,1.68,,,,92.8,2239.0,6.2,660906.0,519293.0,518595.0,518692.0,645277.0,663586.0,621020.0,592696.0,628338.0,594807.0,54.28,,,,,,,,68.0,5.0,4-Seam Fastball,0.0,7.0,0.0,7.0,7.0,0.0,0.0,7.0,Infield shift,Standard,152.0,0.0,0.12


In [6]:
mlb2021.tail(3) # And it confirms the first games of the year when cross referencing baseball-reference.com

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,des,game_type,stand,p_throws,home_team,away_team,type,hit_location,bb_type,balls,strikes,game_year,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,inning_topbot,hc_x,hc_y,tfs_deprecated,tfs_zulu_deprecated,fielder_2,umpire,sv_id,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,game_pk,pitcher.1,fielder_2.1,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number,pitch_name,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
720634,22008,FF,2021-04-01,96.1,-1.76,6.27,"Márquez, Germán",605141.0,608566.0,,ball,,,,,11.0,Mookie Betts singles on a fly ball to left fie...,R,R,R,COL,LAD,B,,,1.0,1.0,2021.0,-0.42,0.86,-0.61,3.65,,,,0.0,1.0,Top,,,,,553869.0,,,3.953292,-140.09228,-3.371124,-6.331036,25.767684,-20.177597,3.34,1.64,,,,95.6,2063.0,5.5,634615.0,608566.0,553869.0,543068.0,572008.0,658069.0,596115.0,606132.0,641658.0,453568.0,55.01,,,,,,,,1.0,3.0,4-Seam Fastball,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Strategic,Standard,208.0,0.0,0.052
720635,22073,FF,2021-04-01,95.6,-2.04,6.03,"Márquez, Germán",605141.0,608566.0,,foul,,,,,5.0,Mookie Betts singles on a fly ball to left fie...,R,R,R,COL,LAD,S,,,1.0,0.0,2021.0,-0.25,0.88,0.19,2.6,,,,0.0,1.0,Top,,,,,553869.0,,,6.352185,-139.293012,-5.428763,-4.405496,25.441128,-19.766641,3.29,1.49,265.0,77.7,39.0,95.1,2988.0,5.5,634615.0,608566.0,553869.0,543068.0,572008.0,658069.0,596115.0,606132.0,641658.0,453568.0,54.98,,,,,,,,1.0,2.0,4-Seam Fastball,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Standard,Standard,195.0,0.0,-0.049
720636,22343,FF,2021-04-01,95.1,-1.95,6.11,"Márquez, Germán",605141.0,608566.0,,ball,,,,,12.0,Mookie Betts singles on a fly ball to left fie...,R,R,R,COL,LAD,B,,,0.0,0.0,2021.0,-0.3,1.01,0.07,4.16,,,,0.0,1.0,Top,,,,,553869.0,,,5.887082,-138.617211,-1.813448,-4.967683,25.52782,-18.81509,3.46,1.76,,,,94.5,,5.5,634615.0,608566.0,553869.0,543068.0,572008.0,658069.0,596115.0,606132.0,641658.0,453568.0,55.02,,,,,,,,1.0,1.0,4-Seam Fastball,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Standard,Standard,,0.0,0.038


# Next Steps

In order for me to be able to work with the data, it needs to go through some rigourous cleaning. There are a lot of columns that are not helpful in determining if I can run a model on this but it should be possible since we are only focused on a few things.

## Understanding the data

In [7]:
mlb2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720637 entries, 0 to 720636
Data columns (total 93 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   index                            720637 non-null  int64  
 1   pitch_type                       720272 non-null  object 
 2   game_date                        720637 non-null  object 
 3   release_speed                    720236 non-null  float64
 4   release_pos_x                    720059 non-null  float64
 5   release_pos_z                    720059 non-null  float64
 6   player_name                      720637 non-null  object 
 7   batter                           720637 non-null  float64
 8   pitcher                          720637 non-null  float64
 9   events                           184107 non-null  object 
 10  description                      720637 non-null  object 
 11  spin_dir                         0 non-null       float64
 12  sp

In [8]:
mlb2021.isnull().sum()

index                                   0
pitch_type                            365
game_date                               0
release_speed                         401
release_pos_x                         578
release_pos_z                         578
player_name                             0
batter                                  0
pitcher                                 0
events                             536530
description                             0
spin_dir                           720637
spin_rate_deprecated               720637
break_angle_deprecated             720637
break_length_deprecated            720637
zone                                  368
des                                     0
game_type                               0
stand                                   0
p_throws                                0
home_team                               0
away_team                               0
type                                    0
hit_location                      

In [9]:
mlb2021.describe()

Unnamed: 0,index,release_speed,release_pos_x,release_pos_z,batter,pitcher,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,hit_location,balls,strikes,game_year,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,hc_x,hc_y,tfs_deprecated,tfs_zulu_deprecated,fielder_2,umpire,sv_id,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,game_pk,pitcher.1,fielder_2.1,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,spin_axis,delta_home_win_exp,delta_run_exp
count,720637.0,720236.0,720059.0,720059.0,720637.0,720637.0,0.0,0.0,0.0,0.0,720269.0,160123.0,720637.0,720637.0,720637.0,720268.0,720269.0,720269.0,720269.0,67695.0,136361.0,222451.0,720637.0,720637.0,123459.0,123459.0,0.0,0.0,720592.0,0.0,0.0,720269.0,720269.0,720269.0,720269.0,720269.0,720269.0,720269.0,720269.0,237696.0,235863.0,235863.0,720028.0,717604.0,717681.0,720637.0,720637.0,720592.0,720592.0,720592.0,720592.0,720592.0,720592.0,720592.0,720592.0,720059.0,123146.0,123146.0,184107.0,183712.0,184107.0,184107.0,123146.0,720637.0,720637.0,720637.0,720637.0,720637.0,720637.0,720637.0,720637.0,720637.0,720637.0,717639.0,720635.0,720594.0
mean,11465.853512,88.863055,-0.702066,5.855274,599201.254543,599475.369626,,,,,9.091049,5.0489,0.88333,0.895841,2021.0,-0.106998,0.655249,0.044113,2.274805,600644.939168,600228.028249,598740.188797,0.977346,4.922555,125.457268,122.151382,,,581088.280786,,,2.109954,-129.241973,-4.109277,-2.01654,26.652685,-23.29502,3.392335,1.574079,154.530838,82.246271,16.841421,88.616839,2235.90589,6.304699,633825.644903,599475.369626,581088.280786,583863.174794,607382.227255,598524.691129,609891.798084,604635.276058,615342.234302,607113.54986,54.195154,0.325257,0.369779,0.325012,0.994535,0.184996,0.148957,3.18673,38.401019,2.913606,2.238417,2.316299,2.261981,2.292735,2.331498,2.254133,2.292896,2.292735,175.76287,0.000135,3.2e-05
std,6869.474401,6.053327,1.889028,0.529734,62794.921518,63890.169024,,,,,4.221432,2.63121,0.968423,0.828442,0.0,0.866944,0.7499,0.846251,0.985328,61606.733375,62041.165331,63112.942277,0.818433,2.587854,40.450682,41.119049,,,67295.612602,,,5.906189,8.762635,3.134675,10.562398,3.967949,9.056762,0.164627,0.089825,120.941466,15.474921,33.014148,8.000694,337.779728,0.447419,3414.592555,63890.169024,67295.612602,70232.844802,59638.30823,58669.537058,53030.004475,63191.442809,55059.001186,51359.572186,0.445868,0.289353,0.393597,0.518349,0.073724,0.388295,0.575212,1.296861,22.487272,1.734509,2.608727,2.630162,2.572363,2.666217,2.636198,2.615664,2.585636,2.666217,70.923962,0.02845,0.24613
min,0.0,30.1,-5.34,0.94,405395.0,424144.0,,,,,1.0,1.0,0.0,0.0,2021.0,-2.56,-2.13,-6.1,-5.07,405395.0,405395.0,405395.0,0.0,1.0,2.01,2.6,,,425772.0,,,-18.902697,-150.384713,-20.138144,-35.092977,-2.089829,-51.301206,2.5,0.77,0.0,2.8,-89.0,0.0,43.0,2.8,632169.0,424144.0,425772.0,405395.0,444876.0,405395.0,444876.0,444482.0,456715.0,451594.0,51.51,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.736,-1.538
25%,5463.0,84.6,-2.06,5.6,570256.0,571578.0,,,,,5.0,2.0,0.0,0.0,2021.0,-0.84,0.19,-0.53,1.64,570731.0,570731.0,553902.0,0.0,3.0,99.28,90.52,,,543510.0,,,-2.927206,-136.272362,-6.206549,-10.909218,23.675024,-29.741887,3.3,1.51,18.0,72.5,-5.0,84.6,2074.0,6.0,632800.0,571578.0,543510.0,527038.0,544725.0,570482.0,592743.0,592178.0,592178.0,573262.0,53.9,0.077,0.08,0.0,1.0,0.0,0.0,2.0,19.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,134.0,0.0,-0.068
50%,11361.0,89.9,-1.42,5.89,608070.0,608344.0,,,,,11.0,5.0,1.0,1.0,2021.0,-0.18,0.76,0.04,2.28,608336.0,608336.0,608070.0,1.0,5.0,124.62,127.17,,,595978.0,,,3.805308,-130.701257,-4.2395,-2.080913,26.605351,-22.824502,3.41,1.56,164.0,81.8,20.0,89.9,2257.0,6.3,633416.0,608344.0,595978.0,593934.0,624428.0,608070.0,621028.0,621438.0,628338.0,607680.0,54.2,0.234,0.231,0.0,1.0,0.0,0.0,3.0,38.0,3.0,1.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,198.0,0.0,-0.018
75%,17366.0,93.7,1.25,6.18,656305.0,656302.0,,,,,13.0,7.0,2.0,2.0,2021.0,0.59,1.28,0.62,2.92,656537.0,656555.0,656305.0,2.0,7.0,152.77,154.96,,,624513.0,,,6.49008,-123.167338,-2.17443,6.189999,29.615474,-15.412595,3.49,1.62,242.0,94.3,41.0,93.8,2433.0,6.6,634075.0,656302.0,624513.0,650490.0,663697.0,649966.0,643396.0,656582.0,663837.0,656976.0,54.5,0.531,0.548,0.7,1.0,0.0,0.0,4.0,57.0,4.0,3.0,4.0,3.0,4.0,4.0,3.0,4.0,4.0,221.0,0.0,0.038
max,25195.0,103.4,4.61,7.93,685503.0,685503.0,,,,,14.0,9.0,4.0,2.0,2021.0,2.84,3.04,9.11,9.39,683734.0,683734.0,683734.0,2.0,16.0,246.71,232.16,,,680777.0,,,20.914518,-40.422498,15.486221,29.761284,47.935965,16.787233,4.47,2.26,497.0,122.2,90.0,104.8,3722.0,9.0,660938.0,685503.0,680777.0,683734.0,683734.0,683734.0,680911.0,683734.0,680776.0,683734.0,57.66,1.0,2.013,2.0,1.0,1.0,3.0,6.0,125.0,17.0,22.0,24.0,24.0,24.0,24.0,22.0,24.0,24.0,360.0,0.904,3.654


We have a lot of Null values and columns that don't really help in determining if a pitch is a strike or a ball so we need to drop the unnecessary columns and assess what is important.

---

### Assessing Nulls/NaNs

Since there are a total of 720,637 total pitches thrown in-game during the 2021 MLB season, we are gonna drop columns that are:
    
- Above 50% nulls
- Deprecated columns
- Not related to a pitch's mechanics, physics, etc. This includes:
    - A batter's influence on a pitch
    - Logisitics of the game
- Values related to expected run scores, hit expectancy and other values that don't determine pitch physics
        
In regards to columns that have null values, they will be assessed individually to determine how important they are for determing strikes vs ball calls.

In [10]:
mlb2021.columns

Index(['index', 'pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
     

In [11]:
# Dropping Columns

mlb2021.drop(columns=['events','spin_dir','spin_rate_deprecated','break_angle_deprecated',
                     'break_length_deprecated','game_type','home_team','away_team',
                     'hit_location','game_year','on_3b','on_2b','on_1b',
                     'outs_when_up','inning','inning_topbot','hc_x','hc_y',
                     'tfs_deprecated','tfs_zulu_deprecated','umpire','sv_id',
                     'hit_distance_sc','launch_speed','launch_angle',
                     'game_pk', 'pitcher.1','fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 
                     'fielder_6', 'fielder_7', 'fielder_8', 'fielder_9',
                     'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
                     'woba_value', 'woba_denom', 'babip_value', 'iso_value',
                     'launch_speed_angle', 'at_bat_number', 'pitch_number',
                     'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
                     'post_home_score', 'post_bat_score', 'post_fld_score',
                     'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',
                     'delta_home_win_exp', 'delta_run_exp'], inplace=True)

In [12]:
# Assess what to do with columns that still have clean up
mlb2021.isnull().sum()

index                     0
pitch_type              365
game_date                 0
release_speed           401
release_pos_x           578
release_pos_z           578
player_name               0
batter                    0
pitcher                   0
description               0
zone                    368
des                       0
stand                     0
p_throws                  0
type                      0
bb_type              597099
balls                     0
strikes                   0
pfx_x                   369
pfx_z                   368
plate_x                 368
plate_z                 368
fielder_2                45
vx0                     368
vy0                     368
vz0                     368
ax                      368
ay                      368
az                      368
sz_top                  368
sz_bot                  368
effective_speed         609
release_spin_rate      3033
release_extension      2956
release_pos_y           578
pitch_name          

## Selecting what we need...

There is a column called `bb_type` where it means:

    Batted ball type, ground_ball, line_drive, fly_ball, popup.
    
Since this is referring to balls that were hit meaning there was no call about them being strikes or balls, we can omit the rows that have these filled to reduce the number of rows by about 20% (597,099/720637). We want the Null rows because there was a strike or ball call made, fouls are not accounted for but there is another column we can assess for that.

In [17]:
mlb2021 = mlb2021[mlb2021['bb_type'].isnull()]

Since not everything is based on the umpire's decision for a call, such as a foul or swinging strike, we are only going to be focused on `description` that are solely based on the umpire's call. We are going to focus on ball and called strikes only, unless one of the other descriptors is the same as either "called ball" or "called strike".

For clarification

- [swinging strikes and foul tips are the considered the same](https://www.mlb.com/glossary/rules/foul-tip)
- [swinging strike blocked is a swinging strike](https://www.mlb.com/braves/video/bruce-zimmermann-swinging-strike-blocked-to-j-d-martinez)
- hit by pitch could be either a ball or strike but since it hit the batter it will not be counted
- all bunts are considered to be strikes or fouls, the batter attempted to make contact resulting in a strike, not like a foul tip
- hit into play is basically means the batter made contact and the ball was thrown to get the batter out
- pitchouts are intentional balls in order to try and get the base runners stealing a base; since it's intentional we shall exclude them from being a "called ball"

In [20]:
mlb2021['description'].value_counts()

ball                       240415
foul                       127976
called_strike              118325
swinging_strike             76984
blocked_ball                17583
foul_tip                     6842
swinging_strike_blocked      4815
hit_by_pitch                 2137
foul_bunt                    1578
missed_bunt                   364
bunt_foul_tip                  41
pitchout                       34
hit_into_play                   4
foul_pitchout                   1
Name: description, dtype: int64

In [24]:
# We are selecting only "balls" and "called strikes"

mlb2021 = mlb2021.loc[((mlb2021['description'] == 'ball') | (mlb2021['description'] == 'called_strike'))]

In [25]:
mlb2021.isnull().sum()

index                     0
pitch_type              188
game_date                 0
release_speed           197
release_pos_x           293
release_pos_z           293
player_name               0
batter                    0
pitcher                   0
description               0
zone                    189
des                       0
stand                     0
p_throws                  0
type                      0
bb_type              358740
balls                     0
strikes                   0
pfx_x                   190
pfx_z                   189
plate_x                 189
plate_z                 189
fielder_2                23
vx0                     189
vy0                     189
vz0                     189
ax                      189
ay                      189
az                      189
sz_top                  189
sz_bot                  189
effective_speed         299
release_spin_rate      1567
release_extension      1537
release_pos_y           293
pitch_name          

After selecting the rows that we deemed necessary for pitch classification, we are left with a few hundred rows containing nulls, since the number is significantly smaller than what is remaining after the cleaning so far (358,740 rows), we are just going to drop the `bb_type` column and any rows containing nulls in any other column.

In [26]:
mlb2021.describe()

Unnamed: 0,index,release_speed,release_pos_x,release_pos_z,batter,pitcher,zone,balls,strikes,pfx_x,pfx_z,plate_x,plate_z,fielder_2,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,effective_speed,release_spin_rate,release_extension,release_pos_y
count,358740.0,358543.0,358447.0,358447.0,358740.0,358740.0,358551.0,358740.0,358740.0,358550.0,358551.0,358551.0,358551.0,358717.0,358551.0,358551.0,358551.0,358551.0,358551.0,358551.0,358551.0,358551.0,358441.0,357173.0,357203.0,358447.0
mean,11540.841696,88.854171,-0.708201,5.85389,599201.257652,599537.159062,10.405362,0.734947,0.698676,-0.105212,0.662264,0.052523,2.302656,580528.072177,2.142379,-129.210981,-4.05902,-2.017115,26.661348,-23.195176,3.371688,1.584925,88.567036,2240.011098,6.298616,54.201223
std,6873.348822,6.145198,1.891408,0.532926,63007.315559,63954.779138,3.798412,0.9204,0.79726,0.872371,0.760882,1.000161,1.109283,67542.889077,6.100224,8.889585,3.478183,10.660861,3.979958,9.163859,0.20609,0.11286,8.16796,330.073129,0.449637,0.44805
min,0.0,30.4,-4.64,1.02,405395.0,424144.0,1.0,0.0,0.0,-2.56,-2.13,-6.1,-4.98,425772.0,-18.902697,-149.318469,-20.138144,-35.092977,-2.089829,-50.549765,2.5,0.77,0.0,95.0,2.8,51.51
25%,5545.0,84.6,-2.06,5.6,570256.0,571578.0,8.0,0.0,0.0,-0.85,0.2,-0.71,1.51,543510.0,-2.622294,-136.236486,-6.523393,-10.972486,23.679222,-29.731756,3.24,1.51,84.5,2079.0,6.0,53.9
50%,11433.0,90.1,-1.43,5.89,608324.0,608344.0,12.0,0.0,0.0,-0.18,0.79,0.07,2.24,595978.0,3.615375,-131.072517,-4.18191,-2.16557,26.683038,-22.368308,3.37,1.59,90.1,2256.0,6.3,54.2
75%,17441.0,93.7,1.25,6.18,656371.0,656322.0,13.0,1.0,1.0,0.6,1.29,0.81,3.07,624513.0,6.703079,-123.130251,-1.773278,6.357195,29.638007,-15.301845,3.5,1.65,93.8,2431.0,6.6,54.51
max,25195.0,102.8,4.61,7.93,685503.0,685503.0,14.0,4.0,2.0,2.84,3.04,9.11,9.39,680777.0,20.914518,-40.960937,15.486221,27.668111,44.536799,16.787233,4.47,2.22,103.9,3598.0,9.0,57.66


In [27]:
mlb2021.drop(columns=['bb_type'], inplace=True)

mlb2021.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mlb2021.dropna(inplace=True)


In [28]:
mlb2021.describe()

Unnamed: 0,index,release_speed,release_pos_x,release_pos_z,batter,pitcher,zone,balls,strikes,pfx_x,pfx_z,plate_x,plate_z,fielder_2,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,effective_speed,release_spin_rate,release_extension,release_pos_y
count,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0,357046.0
mean,11545.508867,88.8564,-0.708696,5.854498,599222.045148,599562.152697,10.403388,0.734838,0.698428,-0.105523,0.662639,0.052597,2.30367,580544.272427,2.144993,-129.214298,-4.060907,-2.021197,26.662392,-23.191499,3.371667,1.584948,88.874061,2240.012813,6.298698,54.20136
std,6873.550643,6.14149,1.890495,0.5321,63005.146854,63930.026661,3.798998,0.920262,0.797202,0.871983,0.760989,1.000125,1.108578,67543.254693,6.096522,8.884014,3.476971,10.654251,3.97755,9.1647,0.206164,0.112884,6.297649,330.07253,0.449412,0.448513
min,0.0,31.1,-4.64,1.02,405395.0,424144.0,1.0,0.0,0.0,-2.56,-2.13,-6.1,-4.98,425772.0,-18.902697,-149.318469,-20.138144,-35.092977,2.47331,-50.549765,2.5,0.77,27.9,95.0,2.8,51.51
25%,5552.0,84.6,-2.06,5.6,570256.0,571578.0,8.0,0.0,0.0,-0.85,0.2,-0.71,1.51,543510.0,-2.613566,-136.239174,-6.524782,-10.969227,23.679844,-29.729233,3.24,1.51,84.6,2079.0,6.0,53.9
50%,11441.0,90.2,-1.43,5.89,608324.0,608344.0,12.0,0.0,0.0,-0.18,0.79,0.07,2.24,595978.0,3.61594,-131.075534,-4.184034,-2.171779,26.683502,-22.361046,3.37,1.59,90.1,2256.0,6.3,54.2
75%,17445.0,93.7,1.24,6.18,656371.0,656322.0,13.0,1.0,1.0,0.6,1.29,0.81,3.08,624513.0,6.702944,-123.132172,-1.775055,6.347693,29.638464,-15.297374,3.5,1.65,93.8,2431.0,6.6,54.51
max,25195.0,102.8,4.61,7.93,685503.0,685503.0,14.0,4.0,2.0,2.84,2.58,9.11,9.39,680777.0,20.914518,-42.52358,15.486221,27.668111,44.536799,-1.322756,4.47,2.22,103.9,3598.0,9.0,57.66


In [29]:
# Just by looking there is a change in numbers so we can easily assume
# that all nulls have been dropped but just to be sure we
# will run another

mlb2021.isnull().sum()

index                0
pitch_type           0
game_date            0
release_speed        0
release_pos_x        0
release_pos_z        0
player_name          0
batter               0
pitcher              0
description          0
zone                 0
des                  0
stand                0
p_throws             0
type                 0
balls                0
strikes              0
pfx_x                0
pfx_z                0
plate_x              0
plate_z              0
fielder_2            0
vx0                  0
vy0                  0
vz0                  0
ax                   0
ay                   0
az                   0
sz_top               0
sz_bot               0
effective_speed      0
release_spin_rate    0
release_extension    0
release_pos_y        0
pitch_name           0
dtype: int64

In [30]:
# mlb2021.to_csv('2021cleaned.csv', index=False) To avoid saving over it