## 2020 _'isStrike'_ Analysis

Our objective in this project to determine whethere a pitch is called a ball or a strike as accurately as possible based on the given data. The [test set in the data folder](./data/2020-test.csv) includes the pitches we have to estimate. The [trianing set](./data/2020-train.csv) on the other hand has the data we can use to train and analyze to result in the most accurate algorithm.

We can determine that the classification type of machine learning (ML) alogrithm is the best for this problem since the main task is to choose a category (a strike or a ball). We will start off with a common classification ML algorithm: Logistic Regression. 

Before we begin to writing codes, we have to analyze the data we are dealing with. First, we will look around the data files and see if anything is misplaced/skewed. To do so, we will use **python** and its popular libraries: _numpy_, _pandas_, and _matplotlib_.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# import the datasets
train_df = pd.read_csv('./data/2020-train.csv')

After looking at the train_df, particularly the column we are interested in (`pitch_type`), we can see that some of the pitches are ambiguous. For example, a pitch like `StrikeSwinging`, there is no way to tell whether the pitcher thew a strike or a ball based on the data and the way it was recorded. Therefore, for our purposes, we will only include the rows where the batter did not swing and either _ball_ or _strike_ was recorded. To filter the rows, we write a function:

In [3]:
def filter_df(df):
    strikes = np.array(df['pitch_call'] == 'StrikeCalled')
    balls = np.array(df['pitch_call'] == 'BallCalled')
    filtered = strikes | balls
    return df[filtered]

One more thing to notice is that some of Right/Left data are recorded as R/L. In this case, we can ignore these rows or try to fix the issue. Since it is relatively easy, we decided to fix it:

In [4]:
# some data are skewed
# for ex, pitcher_side has some cells with R, L 
# instead of right, left
def fix_skewed_data(df):
    df['pitcher_side'] = np.where(df.pitcher_side == 'R', 'Right', df.pitcher_side)
    df['pitcher_side'] = np.where(df.pitcher_side == 'L', 'Left', df.pitcher_side)
    df['batter_side'] = np.where(df.batter_side == 'R', 'Right', df.batter_side)
    df['batter_side'] = np.where(df.batter_side == 'L', 'Left', df.batter_side)
    return df

# Prepare training dataset
train_df = filter_df(train_df).copy()
train_df = fix_skewed_data(train_df)

We then look at the columns and decide the irrelevant columns.

In [5]:
train_df.columns

Index(['pitcher_id', 'pitcher_side', 'batter_id', 'batter_side', 'stadium_id',
       'umpire_id', 'catcher_id', 'inning', 'top_bottom', 'outs', 'balls',
       'strikes', 'release_speed', 'vert_release_angle', 'horz_release_angle',
       'spin_rate', 'spin_axis', 'tilt', 'rel_height', 'rel_side', 'extension',
       'vert_break', 'induced_vert_break', 'horz_break', 'plate_height',
       'plate_side', 'zone_speed', 'vert_approach_angle',
       'horz_approach_angle', 'zone_time', 'x55', 'y55', 'z55', 'pitch_type',
       'pitch_call', 'pitch_id'],
      dtype='object')

Here, we can determine that some of the columns are not correlated to DV at all. For example, most of the ids are irrelevant to the call of the pitch (however, note that particular umpires may have some tendacies to call the pitches more in 'their' ways). Thus, we will exclude them from the list of independent variables (IVs). To do so, we will simply drop the particular colums from the df.

In [6]:
X = train_df.copy()
X = X.drop(['pitcher_id', 'batter_id', 'stadium_id', \
                        'umpire_id', 'catcher_id', 'pitch_call', \
                        'pitch_id', 'tilt'], axis = 1)

I decided to exclude the 'tilt' column because it is a categorical variable with many different values; the column will make the df much bigger and impossible to control. For future improvements, we can find a way to give scores to different tilts and that can be easily included in IVs list. Since we are done with IV matrix, we can then set up the DV column:

In [7]:
y = train_df[['pitch_call']]

Note that upon this point, we have not included anything from [test.csv](./data/2020-test.csv). The reason is we will first determine which model (logistic regression, knn, kernel svm, naives bayes, decision tree, or random forest) has the best accuracy among 80-20 train/test split in train data. 

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Some of the columns are categorical. Thus, we must convert them to numbers using dummy variables method to fit/transfomr into ML algorithm. We do this by:

In [8]:
# convert categorical variables
# Create dummy variables for pitcher_side, batter_side, and pitch_type
# note all the dummies should be 1 category less than their original # of categories
dummies_pitcher_side = pd.get_dummies(X_train[['pitcher_side']])
dummies_pitcher_side = dummies_pitcher_side.iloc[:, 1:]

dummies_batter_side = pd.get_dummies(X_train[['batter_side']])
dummies_batter_side = dummies_batter_side.iloc[:, 1:]

dummies_pitch_type = pd.get_dummies(X_train[['pitch_type']])
dummies_pitch_type = dummies_pitch_type.iloc[:, 0:5]

X_train = X_train.drop(['pitcher_side', 'batter_side', \
                        'pitch_type'], axis = 1)
dummies = pd.concat([dummies_pitcher_side, dummies_batter_side, dummies_pitch_type],
                    axis = 1)
X_train = pd.concat([dummies, X_train], axis = 1)

# convert strikes and balls to zeros/ones
dummies_strikes = pd.get_dummies(y_train[['pitch_call']])
dummies_strikes = dummies_strikes.iloc[:, 1:]
y_train = np.array([x for x in dummies_strikes['pitch_call_StrikeCalled']])


# convert categorical variables for test values
# Create dummy variables for pitcher_side, batter_side, and pitch_type
# note all the dummies should be 1 category less than their original # of categories
dummies_pitcher_side = pd.get_dummies(X_test[['pitcher_side']])
dummies_pitcher_side = dummies_pitcher_side.iloc[:, 1:]

dummies_batter_side = pd.get_dummies(X_test[['batter_side']])
dummies_batter_side = dummies_batter_side.iloc[:, 1:]

dummies_pitch_type = pd.get_dummies(X_test[['pitch_type']])
dummies_pitch_type = dummies_pitch_type.iloc[:, 0:5]

X_test = X_test.drop(['pitcher_side', 'batter_side', \
                        'pitch_type'], axis = 1)
dummies = pd.concat([dummies_pitcher_side, dummies_batter_side, dummies_pitch_type],
                    axis = 1)
X_test = pd.concat([dummies, X_test], axis = 1)

# convert strikes and balls to zeros/ones
dummies_strikes = pd.get_dummies(y_test[['pitch_call']])
dummies_strikes = dummies_strikes.iloc[:, 1:]
y_test = np.array([x for x in dummies_strikes['pitch_call_StrikeCalled']])

The `pd.get_dummies()` will give two columns if there is two original categories. For example, in y_train, since there is _StrikeCalled_ and _BallCalled_, there will be two columns returned with the same information. In this line `dummies_strikes = pd.get_dummies(y_train[['pitch_call']])`, we will receive a df with two columns: `BallCalled` and `StrikeCalled`, filled with 0's and 1's. One column is the opposite of the other, and we only need one.

Since we want _strike_ to be **1**, we choose `StrikeCalled`.

We can then do the same to test set, except for y_test since we will be predicting later on. The code can be found [here]("./logistic_regression.py"). We are then ready to perform feature-scaling as the final step before training the dataset. We can use the standard scaler included in _scikit-learn_ library.

In [10]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  """


These warnings are just notions of ints and doubles being converted to floats all together. They should not impact the results.

Some values in `X_train` and `X_test` are NaN and cannot be fitted into the model. Thus, these rows are removed accordingly (from both `X` and `y` matrices).

In [None]:
m = len(X_train)
n = len(X_train[0])
rows_to_delete = []

for x in range(0, m):
    nan_found = False
    for y in range(0, n):
        if np.isnan(X_train[x][y]):
            nan_found = True
            break
    if nan_found:
        rows_to_delete.append(x)        

X_train = np.delete(X_train, rows_to_delete, axis = 0)
y_train = np.delete(y_train, rows_to_delete, axis = 0)

m = len(X_test)
n = len(X_test[0])
rows_to_delete = []

for x in range(0, m):
    nan_found = False
    for y in range(0, n):
        if np.isnan(X_test[x][y]):
            nan_found = True
            break
    if nan_found:
        rows_to_delete.append(x)

X_test = np.delete(X_test, rows_to_delete, axis = 0)
y_test = np.delete(y_test, rows_to_delete, axis = 0)

We are now ready to test out first model: Logistic Regression. The `scikit-learn` library has very useful models, so we will go ahead and use them simply.

In [None]:
# Logistic Regression
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

Note that some of the models may take a few minutes to complete modeling. The confusion matrix result from Log Regression is as follow:

|Actual/Pred | 0      | 1     |
|---------------|-------|-------|
| 0 | 37440 | 3706 |
| 1 | 16131 | 3351 |

The accuracy from the confusion matrix is measured as below:
$$ Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)} $$

$$ Logistion Regression Accuracy = \frac{37440 + 3351}{37440 + 3351 + 16131 + 3351} = 0.6728 = 68.27\%  $$

We can then calculate the accuracy profiles from other models. The code is given [here](./logistic_regression_multiple_tests.py).


##### KNN
|Actual/Pred | 0      | 1     |
|---------------|-------|-------|
| 0 | 35561 | 5585 |
| 1 | 4925 | 14557 |

$$ KNN Accuracy = 82.66\%$$


##### Kervel SVM
|Actual/Pred | 0      | 1     |
|---------------|-------|-------|
| 0 | 38915 | 2231 |
| 1 | 2350 | 17132 |

$$ Kernel SVM Accuracy = 92.44\%$$


##### Naive Bayes
|Actual/Pred | 0      | 1     |
|---------------|-------|-------|
| 0 | 35629 | 5517 |
| 1 | 5216 | 14266 |

$$ Naive Bayes Accuracy = 82.3\%$$


##### Decision Tree
|Actual/Pred | 0      | 1     |
|---------------|-------|-------|
| 0 | 38192 | 2954 |
| 1 | 3115 | 16367 |

$$ Decision Tree Accuracy = 89.99\%$$


##### Random Forest
|Actual/Pred | 0      | 1     |
|---------------|-------|-------|
| 0 | 35561 | 5585 |
| 1 | 4925 | 14557 |

$$ Random Forest Accuracy = 91.86\%$$

As you can see from above, **Kernel SVM** model is the most accurate algorithm with over 92% accuracy. However, it is good to note that _Kernel SVM_ model took more than 10 minutes to complete, whereas _Random Forest_ took only a few seconds and produced a very strong result. Since speed could be a potential factor, I decided to choose **Random Forest** model

In [None]:
# Random Forest --> 91.86%
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

We are then ready to predict the pitches in test.csv. First, we have to fix the data like we did for training dataset. We change R/L to Right/Left, convert categorical variables, and feature-scale the data. After feature-scaling, about _2000 rows_ or 1.5 percent of the data removed. We will take note of it and append it with either a strike or a ball at the end (50% chance of getting it right).

In [None]:
# Test.csv data
# Prepare the df
test_df = pd.read_csv('./data/2020-test.csv')
test_df = fix_skewed_data(test_df)
X_test = test_df.copy()
X_test = X_test.drop(['pitcher_id', 'batter_id', 'stadium_id', \
                        'umpire_id', 'catcher_id', 'is_strike', \
                        'pitch_id', 'tilt'], axis = 1)

# convert categorical variables
# Create dummy variables for pitcher_side, batter_side, and pitch_type
# note all the dummies should be 1 category less than their original # of categories
dummies_pitcher_side = pd.get_dummies(X_test[['pitcher_side']])
dummies_pitcher_side = dummies_pitcher_side.iloc[:, 1:]

dummies_batter_side = pd.get_dummies(X_test[['batter_side']])
dummies_batter_side = dummies_batter_side.iloc[:, 1:]

dummies_pitch_type = pd.get_dummies(X_test[['pitch_type']])
dummies_pitch_type = dummies_pitch_type.iloc[:, 0:5]

X_test = X_test.drop(['pitcher_side', 'batter_side', \
                        'pitch_type'], axis = 1)
dummies = pd.concat([dummies_pitcher_side, dummies_batter_side, dummies_pitch_type],
                    axis = 1)
X_test = pd.concat([dummies, X_test], axis = 1)

# feature scale
X_test = sc_X.transform(X_test)

# remove NaN values
# 2263 rows removed
m = len(X_test)
n = len(X_test[0])
rows_to_delete = []

for x in range(0, m):
    nan_found = False
    for y in range(0, n):
        if np.isnan(X_test[x][y]):
            nan_found = True
            break
    if nan_found:
        rows_to_delete.append(x)
        
X_test = np.delete(X_test, rows_to_delete, axis = 0)

In [None]:
# predict the results
y_test = classifier.predict(X_test)
# append with balls if a data in a row is NaN
result = []
index = 0
y_index = 0

while y_index < len(y_test):
    if index in rows_to_delete:
        result.append(0)
    else:
        result.append(y_test[y_index])
        y_index += 1
    index += 1

result = pd.DataFrame(result)
result.columns = ['Strikes']
pitch_id = test_df[['pitch_id']]
result = pd.concat([result, pitch_id], axis = 1)
result.to_csv('./result/2020-test-predictions.csv',
              index = None,
              encoding = 'utf-8',
              header = True)

The results are saved in this [file](./result/2020-test-predictions.csv). The accuracy is expected to be as follow: <br/>
91.86% of 98.5% of all rows --> 0.9186 * 0.985 = 0.9048 <br/>
50% of 1.5% rows --> 0.5 * 0.015 = 0.0075 <br/>
Total = 0.9123 = 91.23%

##### Improvement of the model
As mentioned above, we can **Kernel SVM** model to slightly improve the accuracies. We can also include the then-decided unrelated columns such as _tilt_ by making it _scores_ and independent variable. 

Another way would be to group by `umpire_id` and determine the `is_strike` since strike-calling is heavily dependent on umpires. Other general machine learnings rule apply here: more training data (preferably 80-20 split for train and test results), strike/ball recordings on all pitches types (including `foulBalls`, `strikeSwinging`), and so on.

With the data we have, the accuracy result of over 90 percent seems to be decent enough for the time being.