# **HOF Predictions of Current MLB Players**

The purpose of this file is to generate HOF predictions for current MLB players. I will do so by completing the following tasks:

* Loading the data (cleaned and prepared in R)
* Partitioning the data into training and validation
  * Oversampling the response variable `inducted` (binary Y/N)
* Testing different binary classification models
* Evaluating models and selecting the best one

Questions:
* Select single best model or summarize predictions of all models tested?
  * e.g. Mike Trout was predicted to make the HOF in 7/8 models
  * A: Single best model and make predictions
* Where in the process to implement SMOTE oversampling?
  * A: Right before fitting the model
* Github as a medium for blog/results?
  * A: Perfect!

# **1) Load the data**

In [1]:
import pandas as pd

# Read in the training/validation data
df = pd.read_csv("train.csv")

In [2]:
# View the first 5 observations of the data frame
df.head()

Unnamed: 0.1,Unnamed: 0,playerID,LOS,recent_year,G,AB,R,H,X2B,X3B,...,SB,CS,SO,BB,IBB,HBP,SH,SF,GIDP,inducted
0,1,aaronha01,23,1976,143.391304,537.565217,94.521739,163.956522,27.130435,4.26087,...,10.434783,3.173913,60.130435,60.956522,12.782609,1.391304,0.913043,5.26087,14.26087,Y
1,2,aaronto01,7,1971,62.428571,134.857143,14.571429,30.857143,6.0,0.857143,...,1.285714,1.142857,20.714286,12.285714,0.428571,0.0,1.285714,0.857143,5.142857,N
2,3,abbated01,8,1910,103.375,367.75,43.25,93.5,11.875,5.375,...,17.25,1.0,33.25,35.125,1.0,4.0,11.375,1.0,3.0,N
3,4,abbotfr01,3,1905,53.333333,171.0,16.0,35.666667,7.0,2.0,...,4.666667,1.0,25.0,6.333333,1.0,2.666667,6.666667,1.0,3.0,N
4,5,abbotje01,5,2001,46.6,119.2,16.4,31.4,6.6,0.4,...,1.2,1.0,18.2,7.6,0.4,0.6,1.0,1.4,2.4,N


In [3]:
# View the dimensions of the data frame
df.shape

(6098, 22)

# **2) Partition the data into training and validation subsets**

In [4]:
# Create the X and y vectors
import numpy as np

X = df.drop(['inducted', 'Unnamed: 0', 'playerID', 'recent_year', 'LOS'], axis = 1)
X = np.array(X)

y = df['inducted'].map(dict(Y = 1, N = 0))

In [5]:
# SMOTE Oversampling
from imblearn.over_sampling import SMOTE

oversample = SMOTE(random_state = 630)
X, y = oversample.fit_resample(X, y)

In [6]:
# Partition the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 630)

In [7]:
X_train.shape

(9515, 17)

In [8]:
X_test.shape

(2379, 17)

In [9]:
y_train.shape

(9515,)

In [10]:
y_test.shape

(2379,)

# **3) Build classifier algorithms on training and validation data**

The algorithms we will test in this project are:
* Logistic Regression
* Decision Tree
* Random Forest
* AdaBoost
* XGBoost
* Multilayer-Perceptron Neural Network

### **Logistic Regression**

In [27]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'liblinear')

In [28]:
## .fit method here on the training data
lr.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_lr_train = lr.predict(X_train)
yhat_lr_test = lr.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_logistic = roc_auc_score(y_train, yhat_lr_train)
auc_test_logistic = roc_auc_score(y_test, yhat_lr_test)
acc_train_logistic = accuracy_score(y_train, yhat_lr_train)
acc_test_logistic = accuracy_score(y_test, yhat_lr_test)

print(f'              Metrics Log. Reg                ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_logistic}')
print(f'Test AUC: {auc_test_logistic}')
print(f'Training Accuracy: {acc_train_logistic}')
print(f'Test Accuracy: {acc_test_logistic}')
print(f'_________________________________________________\n')

              Metrics Log. Reg                
_________________________________________________

Training AUC: 0.8903415548001619
Test AUC: 0.8850098743407093
Training Accuracy: 0.8903836048344719
Test Accuracy: 0.8848255569567045
_________________________________________________



### **Decision Tree**

In [16]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = 'entropy')

In [17]:
## .fit method here on the training data
tree.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_tree_train = tree.predict(X_train)
yhat_tree_test = tree.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_tree = roc_auc_score(y_train, yhat_tree_train)
auc_test_tree = roc_auc_score(y_test, yhat_tree_test)
acc_train_tree = accuracy_score(y_train, yhat_tree_train)
acc_test_tree = accuracy_score(y_test, yhat_tree_test)

print(f'              Metrics Tree                ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_tree}')
print(f'Test AUC: {auc_test_tree}')
print(f'Training Accuracy: {acc_train_tree}')
print(f'Test Accuracy: {acc_test_tree}')
print(f'_________________________________________________\n')

              Metrics Tree                
_________________________________________________

Training AUC: 1.0
Test AUC: 0.9654391006928296
Training Accuracy: 1.0
Test Accuracy: 0.9655317360235393
_________________________________________________



### **Random Forest**

In [18]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)

In [19]:
## .fit method here on the training data
rf.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_rf_train = rf.predict(X_train)
yhat_rf_test = rf.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_rf = roc_auc_score(y_train, yhat_rf_train)
auc_test_rf = roc_auc_score(y_test, yhat_rf_test)
acc_train_rf = accuracy_score(y_train, yhat_rf_train)
acc_test_rf = accuracy_score(y_test, yhat_rf_test)

print(f'              Metrics Rforest                ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_rf}')
print(f'Test AUC: {auc_test_rf}')
print(f'Training Accuracy: {acc_train_rf}')
print(f'Test Accuracy: {acc_test_rf}')
print(f'_________________________________________________\n')

              Metrics Rforest                
_________________________________________________

Training AUC: 1.0
Test AUC: 0.9838872247094597
Training Accuracy: 1.0
Test Accuracy: 0.9840269020596889
_________________________________________________



### **AdaBoost**

In [20]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier 
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=200)

In [21]:
## .fit method here on the training data
ada.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_ada_train = ada.predict(X_train)
yhat_ada_test = ada.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_ada = roc_auc_score(y_train, yhat_ada_train)
auc_test_ada = roc_auc_score(y_test, yhat_ada_test)
acc_train_ada = accuracy_score(y_train, yhat_ada_train)
acc_test_ada = accuracy_score(y_test, yhat_ada_test)

print(f'              Metrics Ada Boost             ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_ada}')
print(f'Test AUC: {auc_test_ada}')
print(f'Training Accuracy: {acc_train_ada}')
print(f'Test Accuracy: {acc_test_ada}')
print(f'_________________________________________________\n')

              Metrics Ada Boost             
_________________________________________________

Training AUC: 1.0
Test AUC: 0.9826382655087935
Training Accuracy: 1.0
Test Accuracy: 0.9827658680117697
_________________________________________________



### **XGBoost**

In [22]:
from xgboost import XGBClassifier
xgb = XGBClassifier()

In [23]:
## .fit method here on the training data
xgb.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_xgb_train = xgb.predict(X_train)
yhat_xgb_test = xgb.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_xgb = roc_auc_score(y_train, yhat_xgb_train)
auc_test_xgb = roc_auc_score(y_test, yhat_xgb_test)
acc_train_xgb = accuracy_score(y_train, yhat_xgb_train)
acc_test_xgb = accuracy_score(y_test, yhat_xgb_test)

print(f'              Metrics Ada Boost             ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_xgb}')
print(f'Test AUC: {auc_test_xgb}')
print(f'Training Accuracy: {acc_train_xgb}')
print(f'Test Accuracy: {acc_test_xgb}')
print(f'_________________________________________________\n')

              Metrics Ada Boost             
_________________________________________________

Training AUC: 0.9740868186048806
Test AUC: 0.961464625545492
Training Accuracy: 0.9740409879138203
Test Accuracy: 0.9617486338797814
_________________________________________________



### **Multilayer-Perceptron Neural Network**

In [24]:
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(activation = 'logistic')

In [25]:
## .fit method here on the training data
nn.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_nn_train = nn.predict(X_train)
yhat_nn_test = nn.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_nn = roc_auc_score(y_train, yhat_nn_train)
auc_test_nn = roc_auc_score(y_test, yhat_nn_test)
acc_train_nn = accuracy_score(y_train, yhat_nn_train)
acc_test_nn = accuracy_score(y_test, yhat_nn_test)

print(f'              Metrics Neural Network             ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_nn}')
print(f'Test AUC: {auc_test_nn}')
print(f'Training Accuracy: {acc_train_nn}')
print(f'Test Accuracy: {acc_test_nn}')
print(f'_________________________________________________\n')

              Metrics Neural Network             
_________________________________________________

Training AUC: 0.9806039001887188
Test AUC: 0.9758064516129032
Training Accuracy: 0.9805570152390962
Test Accuracy: 0.9760403530895334
_________________________________________________





### **Visualize Model Comparison**

In [29]:
pd.DataFrame([[auc_train_logistic, acc_train_logistic, auc_test_logistic, acc_test_logistic],
  [auc_train_tree, acc_train_tree, auc_test_tree, acc_test_tree],
  [auc_train_rf, acc_train_rf, auc_test_rf, acc_test_rf],
  [auc_train_ada, acc_train_ada, auc_test_ada, acc_test_ada],
  [auc_train_xgb, acc_train_xgb, auc_test_xgb, acc_test_xgb],
  [auc_train_nn, acc_train_nn, auc_test_nn, acc_test_nn]], 
  index = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'AdaBoost', 'XGBoost', 'MLP Neural Network'], 
  columns = ['AUC Train', 'Accuracy Train', 'AUC Test', 'Accuracy Test'])

Unnamed: 0,AUC Train,Accuracy Train,AUC Test,Accuracy Test
Logistic Regression,0.890342,0.890384,0.88501,0.884826
Decision Tree,1.0,1.0,0.965439,0.965532
Random Forest,1.0,1.0,0.983887,0.984027
AdaBoost,1.0,1.0,0.982638,0.982766
XGBoost,0.974087,0.974041,0.961465,0.961749
MLP Neural Network,0.980604,0.980557,0.975806,0.97604


# **4) Apply best model(s) to the test data set**

In [15]:
# Load the test data set
test = pd.read_csv("test.csv")
test.head()

Unnamed: 0.1,Unnamed: 0,playerID,LOS,recent_year,G,AB,R,H,X2B,X3B,...,RBI,SB,CS,SO,BB,IBB,HBP,SH,SF,GIDP
0,13,abreujo02,7,2020,137.285714,541.0,75.142857,159.142857,33.285714,2.0,...,95.857143,1.428571,0.714286,119.285714,37.571429,7.285714,11.857143,0.0,5.0,17.142857
1,18,acunaro01,3,2020,104.333333,406.333333,83.666667,114.0,19.666667,2.0,...,64.666667,20.333333,5.0,123.666667,53.0,2.666667,6.333333,0.0,1.333333,5.0
2,21,adamewi01,3,2020,97.0,334.666667,47.0,87.666667,15.666667,0.666667,...,36.333333,4.0,2.666667,107.333333,32.333333,1.333333,1.333333,1.333333,1.0,6.333333
3,29,adamsma01,11,2020,75.818182,216.818182,26.727273,56.181818,11.727273,0.545455,...,36.090909,0.363636,0.363636,57.636364,14.636364,1.454545,1.090909,0.0,1.454545,4.363636
4,39,adelljo01,1,2020,38.0,124.0,9.0,20.0,4.0,0.0,...,7.0,0.0,1.0,55.0,7.0,0.0,1.0,0.0,0.0,3.0


In [30]:
# Create the X vector
X2 = test.drop(['Unnamed: 0', 'playerID', 'recent_year', 'LOS'], axis = 1)
X2 = np.array(X2)

In [31]:
# Generate predictions on the test data set
# Using random forest because it was deemed to be the 'best' model
preds = rf.predict(X2)

In [32]:
# Display the predictions
preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [33]:
# Append the predictions to the test playerID's
submission = pd.DataFrame({'Player' : test['playerID'],
                          'HOF' : preds})

In [35]:
# Display the results
submission

Unnamed: 0,Player,HOF
0,abreujo02,0
1,acunaro01,0
2,adamewi01,0
3,adamsma01,0
4,adelljo01,0
...,...,...
478,wongko01,0
479,yastrmi01,0
480,yelicch01,1
481,zimmebr01,0
