# Introduction

This project involves examining the 2019 regular season MLB pitching and hitting statistics for players (This is our training data). We will also bring in the current 2020 regular season MLB pitching and hitting stats (this is our testing data).
 
- All datasets were pulled from:  https://www.rotowire.com/baseball/stats.php

***

# MLB Pitching Analysis and Model Building

We will examine a model to predict whether or not a team has good pitchers - predicting the results of games based on W (Wins Earned). To accomplish this, we will compare 3 models.

## Model building:

- Linear Regression
- Decision Tree
- Random Forest

Each model will be evaluated based on R2 score


## Why use R2 score?

R-Squared is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model.

R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100% where 100% means that all movements of a dependent variable are completely explained by movements in the independent variable(s) you are interested in.


## To Do List:
- Imports
- Clean the data if need be
- Visualize the data and relationships
- Start building models
- Fine tune the models
- Find the best models for the datasets
- Compare the test data and the predictions

In [1]:
# Required Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import sklearn
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

# Models
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression 
from sklearn.tree import DecisionTreeRegressor

In [2]:
# Read-in 2019 Pitching Data
df_pitch = pd.read_csv('data/pitching_data/mlb-pitching-2019.csv', index_col='Player')
df_pitch.head()

FileNotFoundError: [Errno 2] File b'data/pitching_data/mlb-pitching-2019.csv' does not exist: b'data/pitching_data/mlb-pitching-2019.csv'

In [None]:
# Describe Data
df_pitch.describe()

In [None]:
#Get info on the Data
df_pitch.info()

In [None]:
#Drop Team Column
df_pitch = df_pitch.drop(['Team'], axis=1)

In [None]:
#Get info on the Data
df_pitch.info()

## Correlation Matrix

The correlation matrix allows us to compare the various data points to identify how correlated each is to the others.  From this information, we can determine which features to use for building our models.

In [None]:
# Correlation Matrix of quantitative features
c_pitch = df_pitch.corr()

In [None]:
#visualizing the correlation matrix
plt.figure(figsize=(30,15))
sns.heatmap(c_pitch,cmap="BrBG", annot=True)

In [None]:
# Getting Features that have a correlation of greater than 0.5 for model building

selected_features = []
for i in range(len(c_pitch['W'])):
  if c_pitch['W'].values[i] >= 0.5 and c_pitch['W'].values[i] != 1.0:
    selected_features.append(c_pitch['W'].index[i])
  
selected_features

# Splitting the Data to train and test the various models.

In [None]:
X_ = df_pitch[selected_features]
y_ = df_pitch.W

In [None]:
X_train_, X_test_, y_train_, y_test_ = train_test_split(X_, y_,test_size = .20)

## Linear Regression

In [None]:
# Linear Regression Pipeline
lr = Pipeline(steps=[('LinReg', LinearRegression())])
lr.fit(X_train_, y_train_)

lr_preds_ = lr.predict(X_test_)
print(f'Linear Regression model R2 score:  {r2_score(y_test_, lr_preds_)}')

## Decision Trees

In [None]:
#Hyperparameter optimization for DecisionTreeRegressor
parameters_ = {
    'max_depth':[15,20,30],
}
dtc_ = Pipeline(steps=[('CV',GridSearchCV(DecisionTreeRegressor(), parameters_, cv = 5))])
dtc_.fit(X_train_, y_train_)
dtc_.named_steps['CV'].best_params_

In [None]:
# DecisionTrees pipeline 
dt_ = Pipeline(steps=[('LogReg', DecisionTreeRegressor(max_depth=20))])
dt_.fit(X_train_,y_train_)
dt_preds_ = dt_.predict(X_test_)
print(f'Decision Trees model R2 score:  {r2_score(y_test_, dt_preds_)}')

## Random Forest

In [None]:
# Hyperparameter optimization of RandomForestClassifier
parameters = {
    'max_depth':[6,12,15,20],
    'n_estimators':[20,30]
}
rfc = Pipeline([('CV',GridSearchCV(RandomForestRegressor(), parameters, cv = 5))])
rfc.fit(X_train_, y_train_)
rfc.named_steps['CV'].best_params_

In [None]:
# RandomForestTrees pipeline 
rf = Pipeline(steps=[('LogReg', RandomForestRegressor(max_depth=6,n_estimators=30))])
rf.fit(X_train_,y_train_)
rf_preds = rf.predict(X_test_)
print(f'Random Forest model R2 score:  {r2_score(y_test_,rf_preds)}')

### Compare predictions from the original 2019 testing data.

We use our best model to show the predicted wins compared to the real W (Wins earned) by the players.

In [None]:
y_pred = lr.predict(X_)

In [None]:
predictions_df = pd.DataFrame({'Real W':df_pitch.W, 'Predicted W':y_pred})

In [None]:
predictions_df.head()

In [None]:
# Exporting Model for later use
import pickle

filename_lr = 'lr_pitching_model.sav'
pickle.dump(lr, open(filename_lr, 'wb'))

---

### 2020 Pitching Predictions

Now we are going to test our model again using the 2020 independent data to predict the 2020 actual results.

In [None]:
#Read in new Data
df = pd.read_csv('data/pitching_data/mlb-pitching-2020.csv', index_col='Player')
df.head()
df = df.drop(['Team'], axis=1)
df.head()

In [None]:
# List of Features that will help test the model
features = ['GS', 'IP', 'H', 'ER', 'K', 'BB', 'HR', 'L']

c = df.corr()
selected_features = []
for i in range(len(c['W'])):
  if c['W'].values[i] >= 0.5 and c['W'].values[i] != 1.0:
    selected_features.append(c['W'].index[i])
  
selected_features

In [None]:
# Creating X and y
X = df[features]
y = df.W

In [None]:
#This is a single test from 2016 stats - Justin Verlander
fake_test = pd.DataFrame({
    'GS':34,
    'IP':223.0,
    'H':137,
    'ER':64,
    'K':300,
    'BB':42,
    'HR':36,
    'L':6
}, index=[0])
fake_test.head()

In [None]:
# Load in Model
loaded_model = pickle.load(open('lr_pitching_model.sav', 'rb'))

In [None]:
#Making predictions
y_pred = loaded_model.predict(X)

# Quick Predictions to see if our model is any good because the 2020 data is very sparse at the moment
fake_pred = loaded_model.predict(fake_test)

# He was accreddited 20 wins for the 2016 season
print(f"Accual wins: 20, Predicted Wins: {fake_pred}")

In [None]:
# Getting r2 score
r2_score(y, y_pred)

In [None]:
# Making new DataFrame with real vs predicted
df_tested = pd.DataFrame({'Real W':y, 'Predicted W':y_pred})

In [None]:
df_tested.head()

In [None]:
# Plot out test data vs predicted
plt.figure(figsize=(16,8))
plt.title('Real vs Predicted Wins')
plt.scatter(df_tested.index[:20], df_tested['Real W'][:20], color='r')
plt.scatter(df_tested.index[:20], df_tested['Predicted W'][:20], color='blue')
plt.ylabel('W')
plt.xlabel('Player')
plt.xticks(rotation=90)
plt.legend(['Real W', 'Predicted W'])

## Findings

The 2020 MLB season and the players stats will be heavily effected by lack of games that will be played (60 total games). 

***

# 2019 MLB Hitting Analysis and Model Building

### About the data
- All datasets were pulled from:  https://www.rotowire.com/baseball/stats.php


## Project Overview

In this notebook, we will examine a model to predict whether or not a team has good hitters - predicting the results of games based on W (Wins Earned). 

To accomplish this, we will examine 5 models.

#### Model building:

- Linear Regression
- RandomForest
- Lasso
- Ridge
- Elastic Net

Each model will be evaluated based on model score, MAE, and MSE.

In [None]:
#Model Building Imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

#Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Read in dataset
df = pd.read_csv('data/hitting_data/2019-batting-players.csv', index_col='Player')

In [None]:
# looking at the first 5 rows of data
df.head(5)

In [None]:
df = df.drop('Team', axis=1)

In [None]:
# Describing the dataset
print(round(df.describe()))

In [None]:
# Getting count of missing values
missing_value_count = df.isnull().sum()

# Getting missing values for all columns
missing_value_count[:len(df.columns)]

### There are no missing values in this dataset

Below we have defined several functions that will help to visualize the relationships between data points.

In [None]:
# Visualization Functions
def create_distplot(df, column):
  plt.figure(figsize=(10,5))
  plt.title(f"Displot: {column}")
  sns.distplot(df[column], hist=True, color="g", kde_kws={"shade": True})

def create_scatter(df, x, y):
  plt.figure(figsize=(10,5))
  plt.title(f"Scatter: {x} vs {y}")
  sns.scatterplot(df[x], df[y])

def create_heatmap(df):
  corr = df.corr()
  plt.figure(figsize=(16,8))
  # Generate a custom diverging colormap
  cmap = sns.diverging_palette(220, 10, as_cmap=True)

  # Draw the heatmap with the mask and correct aspect ratio
  sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
              square=True, linewidths=.5, cbar_kws={"shrink": .5})
  
def create_lineplot(df, x, y, hue=None, style=None):
  plt.figure(figsize=(16,8))
  plt.title(f"Lineplot: {x} vs {y}")
  sns.lineplot(x=x, y=y,
               hue = hue,
               style=style,
               data=df)

In [None]:
# Scatter for Hits vs Batting Average
create_scatter(df, 'H', 'AVG')

In [None]:
# Scatter for Homeruns vs Batting Average
create_scatter(df, 'HR', 'AVG')

In [None]:
#Scatter for Runs Batted In vs Batting Average
create_scatter(df, 'RBI', 'AVG')

In [None]:
#Heatmap for data
create_heatmap(df.corr())

In [None]:
# Getting all cont features in one list
cont_features = df[['Age',
 'G',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'SB',
 'CS',
 'BB',
 'SO',
 'SH',
 'SF',
 'HBP',
 'AVG',
 'OBP',
 'SLG',
 'OPS']]

In [None]:
# making dist plots for all cont features
for feature in cont_features:
  create_distplot(df, feature)

## Correlation Matrix

The correlation matrix allows us to compare the various data points to identify how correlated each is to the others.  From this information, we can determine which features to use for building our models.

In [None]:
df_corr = df.corr()
c = df_corr

In [None]:
#visualizing the correlation matrix
plt.figure(figsize=(30,15))
sns.heatmap(c,cmap="BrBG", annot=True)

In [None]:
# Getting Features that have a correlation of greater than 0.5 for model building

selected_features = []
for i in range(len(df_corr['AVG'])):
  if df_corr['AVG'].values[i] >= 0.5 and df_corr['AVG'].values[i] != 1.0:
    selected_features.append(df_corr['AVG'].index[i])

In [None]:
selected_features

# Splitting the Data to train and test the various models.

In [None]:
# Splitting Data into train and test
X = df[selected_features]
y = df['AVG']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### Evaluating the Models

The Mean absolute error (MAE) is a linear score which means that all the individual differences are weighted equally in the average. 

Mean squared error (MSE) is a quadratic scoring rule which measures the average magnitude of the error.

In [None]:
#Function to test different Models
def test_model(model_to_test, X_test, X_train, y_test, y_train):
    f_steps = [
                ('model', model_to_test)
      ]

    f_pipe = Pipeline(steps=f_steps)
    f_pipe.fit(X_train, y_train)

    f_preds = f_pipe.predict(X_test)
    print("-" * 10 + " Model Stats " + "-" * 10)
    print('\n')
    print(f"Model Score: {f_pipe.score(X_test, y_test)}")
    print(f"MAE: {mean_absolute_error(y_test, f_preds)}")
    print(f"MSE: {mean_squared_error(y_test, f_preds)}")
    print('\n')
    print('-' * 80)
    print('\n')
    plt.figure(figsize=(10,5))
    plt.title("Model Predictions")
    plt.scatter(y_test.index[:20], y_test.values[:20])
    plt.scatter(y_test.index[:20], f_preds[:20])
    plt.xticks(rotation=90)
    plt.legend(['Test Data', 'Predictions'])

    return f_pipe

## Linear Regression

In [None]:
#LinearRegression
lr_model = test_model(LinearRegression(), X_test, X_train, y_test, y_train)

## Random Forest

In [None]:
rf_model = test_model(RandomForestRegressor(), X_test, X_train, y_test, y_train)

## Lasso

In [None]:
#Lasso Model
lasso_model = test_model(Lasso(alpha=0.3), X_test, X_train, y_test, y_train)

## Ridge

In [None]:
#Ridge Model
ridge_model = test_model(Ridge(alpha=0.7), X_test, X_train, y_test, y_train)

## Elastic Net

In [None]:
#ElasticNet Model
elastic_model = test_model(ElasticNet(alpha=0.5), X_test, X_train, y_test, y_train)

***

### 2020 Pitching Predictions

Now we are going to test our model again using the 2020 independent data to predict the 2020 actual results.

In [None]:
# Bring in 2020 Data
df_2020 = pd.read_csv('data/hitting_data/2020-batting-players.csv', index_col='Player')

In [None]:
#Split out data into X and y
X_2020 = df_2020[selected_features]
y_2020 = df_2020['AVG']

In [None]:
# Make predictions
y_preds = lr_model.predict(X_2020)

In [None]:
# Score the Model
lr_model.score(X_2020, y_2020)

In [None]:
def score_new_dataset(model, X, y):
  # Make predictions
  f_preds = model.predict(X)

  # Score the Model
  f_score = model.score(X, y)

  #Print out findings
  print("Model Metrics")
  print("-" * 50)
  print('\n')
  print(f"Model Score: {f_score}")
  print(f"MAE: {mean_absolute_error(y, f_preds)}")
  print(f"MSE: {mean_squared_error(y, y_preds)}")

  # Graph Predictions
  print('\n')
  print('-' * 80)
  print('\n')
  plt.figure(figsize=(10,5))
  plt.title("First 10 Model Predictions")
  plt.scatter(y.index[:10], y.values[:10])
  plt.scatter(y.index[:10], f_preds[:10])
  plt.xticks(rotation=45)
  plt.legend(['Test Data', 'Predictions'])

  #Return predictions
  return f_preds



### Linear Regression 2020

In [None]:
#LinearRegression Scoring on 2020 Data
lr_preds = score_new_dataset(lr_model, X_2020, y_2020)

### Random Forest 2020

In [None]:
#RandomForest Scoring on 2020 Data
rf_preds = score_new_dataset(rf_model, X_2020, y_2020)

### Ridge 2020

In [None]:
#RidgeRegression Scoring on 2020 Data
ridge_preds = score_new_dataset(ridge_model, X_2020, y_2020)

### Lasso 2020

In [None]:
#LassoRegression Scoring on 2020 Data
lasso_preds = score_new_dataset(lasso_model, X_2020, y_2020)

### Elastic Net 2020

In [None]:
#ElasticNetRegression Scoring on 2020 Data
elastic_preds = score_new_dataset(elastic_model, X_2020, y_2020)

---

# Comparing the Top Models

- Linear Regression
- Random Forest
- Ridge

Plotting the various models agains the actual results from 2020.

In [None]:
# Comparing All Models Top 20 Predictions
plt.figure(figsize=(16,8))
plt.title("First 20 Model Predictions")
plt.scatter(y_2020.index[:20], y_2020.values[:20])
plt.scatter(y_2020.index[:20], lr_preds[:20])
plt.scatter(y_2020.index[:20], ridge_preds[:20])
plt.scatter(y_2020.index[:20], rf_preds[:20])
plt.xticks(rotation=90)
plt.legend(['Test Data', 'LinearRegression', 'Ridge', 'Random Forest'], loc='upper left')

In [None]:
# LinearRegression Predictions
lr_predictions_df = pd.DataFrame({'Test Data': y_2020, 'LR Predictions':lr_preds})

In [None]:
# Lets see the lr_predictions_df 
lr_predictions_df.head()

In [None]:
# RidgeRegression Predictions
ridge_predictions_df = pd.DataFrame({'Test Data': y_2020, 'Ridge Predictions':ridge_preds})

In [None]:
# Lets see the ridge_regression_df
ridge_predictions_df.head()

In [None]:
# Exporting Model for later use
import pickle

filename_ridge = 'ridge_hitting_model.sav'
pickle.dump(ridge_model, open(filename_ridge, 'wb'))

filename_lr = 'lr_hitting_model.sav'
pickle.dump(lr_model, open(filename_lr, 'wb'))

---

# Final End of Season Statistical Predictions

In [None]:
# Bringing in the predictions
pitching_df = pd.read_csv('data/output_data/pitching_predictions.csv', index_col='Player')

lr_hitting_df = pd.read_csv('data/output_data/lr_hitting_predictions.csv', index_col='Player')
ridge_hitting_df = pd.read_csv('data/output_data/ridge_hitting_predictions.csv', index_col='Player')

## Pitching Predictions 2020 End of Season

In [None]:
# Pitching Preds
pitching_df.head()

In [None]:
# Plotting out accuracy of pitching predictions
plt.figure(figsize=(10,5))
plt.title("First 10 Real Wins vs Predicted Wins")
plt.plot(pitching_df.index[:10], pitching_df['Real W'][:10], color='b')
plt.plot(pitching_df.index[:10], pitching_df['Predicted W'][:10], color='g')
plt.xticks(rotation=45)
plt.legend(['Real Wins', 'Predicted Wins'])
plt.show()

In [None]:
# Function to calculate Expected Pitcher Wins
def calc_xW(w): 
  if w != 0:
    return ((w / 18) * 60) * 2.25
  else:
    return 0

In [None]:
# Apply function from above
pitching_df['xW'] = pitching_df['Real W'].apply(calc_xW)

In [None]:
# Rename the columns
pitching_df = pitching_df.rename(columns={'Real W':'Current Wins', 'Predicted W':'Predicted Wins',
                                          'xW':'Current Predicted Wins (EOS)'})
pitching_df.head()

***

## Hitting Predictions 2020 End of Season

In [None]:
# LR Hitting
lr_hitting_df.head()

In [None]:
# Plotting out Accuracy of LinearRegressor Predictions for hitting
plt.figure(figsize=(10,5))
plt.title("First 5 Test Data vs Predicted Batting Average")
plt.plot(lr_hitting_df.index[:10], lr_hitting_df['Test Data'][:10], color='b')
plt.plot(lr_hitting_df.index[:10], lr_hitting_df['LR Predictions'][:10], color='g')
plt.xticks(rotation=45)
plt.legend(['Test Data', 'LR Predictions'])
plt.show()

In [None]:
# Ridge Hitting
ridge_hitting_df.head()

In [None]:
# Plotting out Accuracy of RidgeRegressor Predictions for hitting
plt.figure(figsize=(10,5))
plt.title("First 5 Test Data vs Predicted Batting Average")
plt.plot(ridge_hitting_df.index[:10], ridge_hitting_df['Test Data'][:10], color='b')
plt.plot(ridge_hitting_df.index[:10], ridge_hitting_df['Ridge Predictions'][:10], color='g')
plt.xticks(rotation=45)
plt.legend(['Test Data', 'Ridge Predictions'])
plt.show()

In [None]:
# Calculate xBA based on total 2020 games and number of games played 
def calc_xba(ba):
  if ba < 1.0: 
    return (ba / 18) * 60 - ba * 2
  else:
    return ba

In [None]:
# Apply function from above
ridge_hitting_df['xBA'] = ridge_hitting_df['Ridge Predictions'].apply(calc_xba)

In [None]:
# Rename some columns
ridge_hitting_df = ridge_hitting_df.rename(columns={'Test Data':'Current BA', 'Ridge Predictions':'BA Prediction',
                                                    'xBA':'Expected BA (EOS)'})
ridge_hitting_df.head()

---

# Findings

The 2020 MLB season and the players stats will be heavily effected by lack of games that will be played (60 total games). 

With this unexpected hitch in the 2020 season it makes it difficult to make time series predictions on the entire season. 

For this reason I had to use some math to predict the xBA (Expected Batting Averrage), and xW (Expected Pitching Wins). 
When using this math there is an upper limit of the accuracy.