# **HOF Predictions of Current MLB Players**

The purpose of this file is to generate HOF predictions for current MLB players. I will do so by completing the following tasks:

* Loading the data (cleaned and prepared in R)
* Partitioning the data into training and validation
  * Oversampling the response variable `inducted` (binary Y/N)
* Testing different binary classification models
* Evaluating models and selecting the best one

Questions:
* Select single best model or summarize predictions of all models tested?
  * e.g. Mike Trout was predicted to make the HOF in 7/8 models
  * A: Single best model and make predictions
* Where in the process to implement SMOTE oversampling?
  * A: Right before fitting the model
* Github as a medium for blog/results?
  * A: Perfect!

# **1) Load the data**

In [None]:
import pandas as pd

# Read in the training/validation data
df = pd.read_csv('/content/drive/MyDrive/Academics/MSBA/ISA 630/Final Project/train.csv')

In [None]:
# View the first 5 observations of the data frame
df.head()

Unnamed: 0.1,Unnamed: 0,playerID,LOS,recent_year,G,AB,R,H,X2B,X3B,...,SB,CS,SO,BB,IBB,HBP,SH,SF,GIDP,inducted
0,1,aaronha01,23,1976,143.391304,537.565217,94.521739,163.956522,27.130435,4.26087,...,10.434783,3.173913,60.130435,60.956522,12.782609,1.391304,0.913043,5.26087,14.26087,Y
1,2,aaronto01,7,1971,62.428571,134.857143,14.571429,30.857143,6.0,0.857143,...,1.285714,1.142857,20.714286,12.285714,0.428571,0.0,1.285714,0.857143,5.142857,N
2,3,abbated01,8,1910,103.375,367.75,43.25,93.5,11.875,5.375,...,17.25,1.0,33.25,35.125,1.0,4.0,11.375,1.0,3.0,N
3,4,abbotfr01,3,1905,53.333333,171.0,16.0,35.666667,7.0,2.0,...,4.666667,1.0,25.0,6.333333,1.0,2.666667,6.666667,1.0,3.0,N
4,5,abbotje01,5,2001,46.6,119.2,16.4,31.4,6.6,0.4,...,1.2,1.0,18.2,7.6,0.4,0.6,1.0,1.4,2.4,N


In [None]:
# View the dimensions of the data frame
df.shape

(5813, 22)

# **2) Partition the data into training and validation subsets**

In [None]:
# Create the X and y vectors
import numpy as np

X = df.drop(['inducted', 'Unnamed: 0', 'playerID', 'recent_year', 'LOS'], axis = 1)
X = np.array(X)

y = df['inducted'].map(dict(Y = 1, N = 0))

In [None]:
# SMOTE Oversampling
from imblearn.over_sampling import SMOTE

oversample = SMOTE(random_state = 630)
X, y = oversample.fit_resample(X, y)

In [None]:
# Partition the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 630)

In [None]:
X_train.shape

(9059, 17)

In [None]:
X_test.shape

(2265, 17)

In [None]:
y_train.shape

(9059,)

In [None]:
y_test.shape

(2265,)

# **3) Build classifier algorithms on training and validation data**

The algorithms we will test in this project are:
* Logistic Regression
* Decision Tree
* Random Forest
* AdaBoost
* XGBoost
* Multilayer-Perceptron Neural Network

### **Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'liblinear')

In [None]:
## .fit method here on the training data
lr.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_lr_train = lr.predict(X_train)
yhat_lr_test = lr.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_logistic = roc_auc_score(y_train, yhat_lr_train)
auc_test_logistic = roc_auc_score(y_test, yhat_lr_test)
acc_train_logistic = accuracy_score(y_train, yhat_lr_train)
acc_test_logistic = accuracy_score(y_test, yhat_lr_test)

print(f'              Metrics Log. Reg                ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_logistic}')
print(f'Test AUC: {auc_test_logistic}')
print(f'Training Accuracy: {acc_train_logistic}')
print(f'Test Accuracy: {acc_test_logistic}')
print(f'_________________________________________________\n')

              Metrics Log. Reg                
_________________________________________________

Training AUC: 0.8901800212298888
Test AUC: 0.8851622244538979
Training Accuracy: 0.890164477315377
Test Accuracy: 0.8852097130242825
_________________________________________________



### **Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = 'entropy')

In [None]:
## .fit method here on the training data
tree.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_tree_train = tree.predict(X_train)
yhat_tree_test = tree.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_tree = roc_auc_score(y_train, yhat_tree_train)
auc_test_tree = roc_auc_score(y_test, yhat_tree_test)
acc_train_tree = accuracy_score(y_train, yhat_tree_train)
acc_test_tree = accuracy_score(y_test, yhat_tree_test)

print(f'              Metrics Tree                ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_tree}')
print(f'Test AUC: {auc_test_tree}')
print(f'Training Accuracy: {acc_train_tree}')
print(f'Test Accuracy: {acc_test_tree}')
print(f'_________________________________________________\n')

              Metrics Tree                
_________________________________________________

Training AUC: 1.0
Test AUC: 0.9704505264536734
Training Accuracy: 1.0
Test Accuracy: 0.9704194260485651
_________________________________________________



### **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)

In [None]:
## .fit method here on the training data
rf.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_rf_train = rf.predict(X_train)
yhat_rf_test = rf.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_rf = roc_auc_score(y_train, yhat_rf_train)
auc_test_rf = roc_auc_score(y_test, yhat_rf_test)
acc_train_rf = accuracy_score(y_train, yhat_rf_train)
acc_test_rf = accuracy_score(y_test, yhat_rf_test)

print(f'              Metrics Rforest                ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_rf}')
print(f'Test AUC: {auc_test_rf}')
print(f'Training Accuracy: {acc_train_rf}')
print(f'Test Accuracy: {acc_test_rf}')
print(f'_________________________________________________\n')

              Metrics Rforest                
_________________________________________________

Training AUC: 1.0
Test AUC: 0.9850242954627678
Training Accuracy: 1.0
Test Accuracy: 0.9849889624724062
_________________________________________________



### **AdaBoost**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier 
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=200)

In [None]:
## .fit method here on the training data
ada.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_ada_train = ada.predict(X_train)
yhat_ada_test = ada.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_ada = roc_auc_score(y_train, yhat_ada_train)
auc_test_ada = roc_auc_score(y_test, yhat_ada_test)
acc_train_ada = accuracy_score(y_train, yhat_ada_train)
acc_test_ada = accuracy_score(y_test, yhat_ada_test)

print(f'              Metrics Ada Boost             ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_ada}')
print(f'Test AUC: {auc_test_ada}')
print(f'Training Accuracy: {acc_train_ada}')
print(f'Test Accuracy: {acc_test_ada}')
print(f'_________________________________________________\n')

              Metrics Ada Boost             
_________________________________________________

Training AUC: 1.0
Test AUC: 0.9819323937424369
Training Accuracy: 1.0
Test Accuracy: 0.9818984547461369
_________________________________________________



### **XGBoost**

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()

In [None]:
## .fit method here on the training data
xgb.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_xgb_train = xgb.predict(X_train)
yhat_xgb_test = xgb.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_xgb = roc_auc_score(y_train, yhat_xgb_train)
auc_test_xgb = roc_auc_score(y_test, yhat_xgb_test)
acc_train_xgb = accuracy_score(y_train, yhat_xgb_train)
acc_test_xgb = accuracy_score(y_test, yhat_xgb_test)

print(f'              Metrics Ada Boost             ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_xgb}')
print(f'Test AUC: {auc_test_xgb}')
print(f'Training Accuracy: {acc_train_xgb}')
print(f'Test Accuracy: {acc_test_xgb}')
print(f'_________________________________________________\n')

              Metrics Ada Boost             
_________________________________________________

Training AUC: 0.9732703533443898
Test AUC: 0.9620987662021732
Training Accuracy: 0.97328623468374
Test Accuracy: 0.9620309050772627
_________________________________________________



### **Multilayer-Perceptron Neural Network**

In [None]:
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(activation = 'logistic')

In [None]:
## .fit method here on the training data
nn.fit(X_train, y_train)

## .predict to check how we do on the training and testing data
yhat_nn_train = nn.predict(X_train)
yhat_nn_test = nn.predict(X_test)

## Performance Metrics
from sklearn.metrics import accuracy_score, roc_auc_score

auc_train_nn = roc_auc_score(y_train, yhat_nn_train)
auc_test_nn = roc_auc_score(y_test, yhat_nn_test)
acc_train_nn = accuracy_score(y_train, yhat_nn_train)
acc_test_nn = accuracy_score(y_test, yhat_nn_test)

print(f'              Metrics Neural Network             ')
print(f'_________________________________________________\n')
print(f'Training AUC: {auc_train_nn}')
print(f'Test AUC: {auc_test_nn}')
print(f'Training Accuracy: {acc_train_nn}')
print(f'Test Accuracy: {acc_test_nn}')
print(f'_________________________________________________\n')

              Metrics Neural Network             
_________________________________________________

Training AUC: 0.9856412624501875
Test AUC: 0.9792997355256428
Training Accuracy: 0.985649630202009
Test Accuracy: 0.9792494481236204
_________________________________________________





### **Visualize Model Comparison**

In [None]:
pd.DataFrame([[auc_train_logistic, acc_train_logistic, auc_test_logistic, acc_test_logistic],
  [auc_train_tree, acc_train_tree, auc_test_tree, acc_test_tree],
  [auc_train_rf, acc_train_rf, auc_test_rf, acc_test_rf],
  [auc_train_ada, acc_train_ada, auc_test_ada, acc_test_ada],
  [auc_train_xgb, acc_train_xgb, auc_test_xgb, acc_test_xgb],
  [auc_train_nn, acc_train_nn, auc_test_nn, acc_test_nn]], 
  index = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'AdaBoost', 'XGBoost', 'MLP Neural Network'], 
  columns = ['AUC Train', 'Accuracy Train', 'AUC Test', 'Accuracy Test'])

Unnamed: 0,AUC Train,Accuracy Train,AUC Test,Accuracy Test
Logistic Regression,0.89018,0.890164,0.885162,0.88521
Decision Tree,1.0,1.0,0.970451,0.970419
Random Forest,1.0,1.0,0.985024,0.984989
AdaBoost,1.0,1.0,0.981932,0.981898
XGBoost,0.97327,0.973286,0.962099,0.962031
MLP Neural Network,0.985641,0.98565,0.9793,0.979249


# **4) Apply best model(s) to the test data set**

In [None]:
# Load the test data set
test = pd.read_csv('/content/drive/MyDrive/Academics/MSBA/ISA 630/Final Project/test.csv')
test.head()

Unnamed: 0.1,Unnamed: 0,playerID,LOS,recent_year,G,AB,R,H,X2B,X3B,...,RBI,SB,CS,SO,BB,IBB,HBP,SH,SF,GIDP
0,13,abreujo02,8,2021,139.125,544.125,76.5,157.75,32.875,2.0,...,98.5,1.375,0.625,122.25,40.5,6.75,13.125,0.0,5.625,18.5
1,18,acunaro01,4,2021,98.75,379.0,80.75,106.5,19.5,1.75,...,61.5,19.5,5.25,114.0,52.0,2.5,7.0,0.0,2.25,3.75
2,20,adamecr01,5,2019,35.2,65.6,6.4,14.0,1.8,0.8,...,4.4,0.4,0.8,15.4,6.0,0.2,1.0,0.8,0.0,1.2
3,21,adamewi01,5,2021,86.2,300.2,43.6,78.6,15.8,0.6,...,36.4,3.4,2.4,95.6,30.8,1.0,0.8,0.8,0.8,5.6
4,28,adamsla01,3,2018,39.0,45.666667,10.0,12.0,1.666667,0.333333,...,8.666667,3.666667,0.0,15.666667,4.666667,0.0,0.333333,0.333333,0.333333,1.333333


In [None]:
# Create the X vector
X2 = test.drop(['Unnamed: 0', 'playerID', 'recent_year', 'LOS'], axis = 1)
X2 = np.array(X2)

In [None]:
# Generate predictions on the test data set
# Using AdaBoost
preds = rf.predict(X2)

In [None]:
len(preds)

864

In [None]:
# Display the predictions
preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,

In [None]:
# Append the predictions to the test playerID's
submission = pd.DataFrame({'Player' : test['playerID'],
                          'HOF' : preds})

In [None]:
# Display the results
submission

Unnamed: 0,Player,HOF
0,abreujo02,0
1,acunaro01,0
2,adamecr01,0
3,adamewi01,0
4,adamsla01,0
...,...,...
859,zavalse01,0
860,zimmebr01,0
861,zimmery01,0
862,zobribe01,0
