# **HOF Predictions of Current MLB Players**

The purpose of this file is to generate HOF predictions for current MLB players. I will do so by completing the following tasks:

* Loading the data (cleaned and prepared in R)
* Partitioning the data into training and validation
* Oversampling the response variable `inducted` (binary Y/N)
* Testing different binary classification models
* Evaluating models and selecting the best one

Questions:
* Select single best model or summarize predictions of all models tested?
  * e.g. Mike Trout was predicted to make the HOF in 7/8 models
* Where in the process to implement SMOTE oversampling?
* Github as a medium for blog/results?

# **1) Load the data**

In [6]:
import pandas as pd

# Read in the training/validation data
df = pd.read_csv("train.csv")

In [7]:
# View the first 5 observations of the data frame
df.head()

Unnamed: 0.1,Unnamed: 0,playerID,LOS,recent_year,G,AB,R,H,X2B,X3B,...,SB,CS,SO,BB,IBB,HBP,SH,SF,GIDP,inducted
0,1,aaronha01,23,1976,143.391304,537.565217,94.521739,163.956522,27.130435,4.26087,...,10.434783,3.173913,60.130435,60.956522,12.782609,1.391304,0.913043,5.26087,14.26087,Y
1,2,aaronto01,7,1971,62.428571,134.857143,14.571429,30.857143,6.0,0.857143,...,1.285714,1.142857,20.714286,12.285714,0.428571,0.0,1.285714,0.857143,5.142857,N
2,3,abbated01,8,1910,103.375,367.75,43.25,93.5,11.875,5.375,...,17.25,1.0,33.25,35.125,1.0,4.0,11.375,1.0,3.0,N
3,4,abbotfr01,3,1905,53.333333,171.0,16.0,35.666667,7.0,2.0,...,4.666667,1.0,25.0,6.333333,1.0,2.666667,6.666667,1.0,3.0,N
4,5,abbotje01,5,2001,46.6,119.2,16.4,31.4,6.6,0.4,...,1.2,1.0,18.2,7.6,0.4,0.6,1.0,1.4,2.4,N


In [8]:
# View the dimensions of the data frame
df.shape

(6098, 22)

# **2) Partition the data into training and validation subsets**

In [15]:
# Create the X and y vectors
import numpy as np

X = df.drop(['inducted', 'Unnamed: 0', 'playerID', 'recent_year'], axis = 1)
X = np.array(X)

y = df['inducted'].values

In [16]:
# Partition the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 630)

In [17]:
X_train.shape

(4878, 18)

In [18]:
X_test.shape

(1220, 18)

In [19]:
y_train.shape

(4878,)

In [20]:
y_test.shape

(1220,)

# **3) SMOTE Oversampling**

# **4) Build classifier algorithms on training and validation data**

The algorithms we will test in this project are:
* Logistic Regression
* Decision Tree
* Random Forest
* AdaBoost
* XGBoost
* Multilayer-Perceptron Neural Network

# **5) Apply best model(s) to the test data set**