# Challenge: Make a Neural Network

For this challenge you have two options for how to use neural networks . Choose one of the following:

* Use RBM to perform feature extraction on an image-based dataset that you find or create. If you go this route, present the features you extract and explain why this is a useful feature extraction method in the context you’re operating in. DO NOT USE either the MNIST digit recognition database or the iris data set. They’ve been worked on in very public ways very very many times and the code is easily available. (However, that code could be a useful resource to refer to).

* **Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!**

Once you've chosen which option you prefer, get to modeling and submit your work below.

## Background

### Goal
I chose the 'NBA Team Game Stats from 2014 to 2018' data set from Kaggle (https://www.kaggle.com/ionaskel/nba-games-stats-from-2014-to-2018#nba.games.stats.csv).

I will use this data set to train a multi-layer perceptron neural network model and a random forest model to predict if the Boston Celtics will win or lose.

### Data

This dataset is ideal for using a classification model to understand the impact of every statistic category in wins and losses.

**Columns:**
* Team
* Game
* Date
* Home
* Opponent
* WINorLOSS
* TeamPoints
* OpponentPoints
* FieldGoals
* FieldGoalsAttempted
* FieldGoals.
* X3PointShots
* X3PointShotsAttempted
* X3PointShots.
* FreeThrows
* FreeThrowsAttempted
* FreeThrows.
* OffRebounds
* TotalRebounds
* Assists
* Steals
* Blocks
* Turnovers
* TotalFouls
* Opp.FieldGoals
* Opp.FieldGoalsAttempted
* Opp.FieldGoals.
* Opp.3PointShots
* Opp.3PointShotsAttempted
* Opp.3PointShots.
* Opp.FreeThrows
* Opp.FreeThrowsAttempted
* Opp.FreeThrows.
* Opp.OffRebounds
* Opp.TotalRebounds
* Opp.Assists
* Opp.Steals
* Opp.Blocks
* Opp.Turnovers
* Opp.TotalFouls

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report
%matplotlib inline

## Load Data

In [2]:
nba_df = pd.read_csv('nba.games.stats.csv')

In [3]:
# Select only the data pertaining to the Boston Celtics
celtics = nba_df[nba_df['Team'] == 'BOS']

# Preview the data
celtics

Unnamed: 0.1,Unnamed: 0,Team,Game,Date,Home,Opponent,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
82,110,BOS,1,2014-10-29,Home,BRK,W,121,105,49,...,20,26,0.769,9,39,20,6,7,20,21
83,210,BOS,2,2014-11-01,Away,HOU,L,90,104,37,...,28,40,0.700,10,49,20,9,8,17,24
84,310,BOS,3,2014-11-03,Away,DAL,L,113,118,45,...,16,20,0.800,7,36,24,5,9,17,17
85,410,BOS,4,2014-11-05,Home,TOR,L,107,110,40,...,19,23,0.826,6,24,18,11,4,7,18
86,510,BOS,5,2014-11-07,Home,IND,W,101,98,39,...,13,13,1.000,14,42,18,2,6,13,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7539,7891,BOS,78,2018-04-04,Away,TOR,L,78,96,25,...,6,8,0.750,12,48,23,10,6,10,25
7540,7991,BOS,79,2018-04-06,Home,CHI,W,111,104,45,...,10,17,0.588,9,43,25,15,0,17,22
7541,8091,BOS,80,2018-04-08,Home,ATL,L,106,112,42,...,11,14,0.786,5,37,23,7,4,13,19
7542,8193,BOS,81,2018-04-10,Away,WAS,L,101,113,33,...,16,20,0.800,10,51,32,8,3,20,24


## Clean Data

In [4]:
# Get data types
celtics.dtypes

Unnamed: 0                    int64
Team                         object
Game                          int64
Date                         object
Home                         object
Opponent                     object
WINorLOSS                    object
TeamPoints                    int64
OpponentPoints                int64
FieldGoals                    int64
FieldGoalsAttempted           int64
FieldGoals.                 float64
X3PointShots                  int64
X3PointShotsAttempted         int64
X3PointShots.               float64
FreeThrows                    int64
FreeThrowsAttempted           int64
FreeThrows.                 float64
OffRebounds                   int64
TotalRebounds                 int64
Assists                       int64
Steals                        int64
Blocks                        int64
Turnovers                     int64
TotalFouls                    int64
Opp.FieldGoals                int64
Opp.FieldGoalsAttempted       int64
Opp.FieldGoals.             

In [5]:
# Drop Unnamed: 0, Team, Date, and Opponent columns
columns_to_drop = ['Unnamed: 0', 'Team', 'Date', 'Opponent']
celtics.drop(columns_to_drop, axis=1, inplace=True)

# Preview resulting data frame
celtics

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Game,Home,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
82,1,Home,W,121,105,49,88,0.557,8,22,...,20,26,0.769,9,39,20,6,7,20,21
83,2,Away,L,90,104,37,98,0.378,1,25,...,28,40,0.700,10,49,20,9,8,17,24
84,3,Away,L,113,118,45,102,0.441,11,31,...,16,20,0.800,7,36,24,5,9,17,17
85,4,Home,L,107,110,40,78,0.513,10,26,...,19,23,0.826,6,24,18,11,4,7,18
86,5,Home,W,101,98,39,85,0.459,9,27,...,13,13,1.000,14,42,18,2,6,13,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7539,78,Away,L,78,96,25,75,0.333,3,22,...,6,8,0.750,12,48,23,10,6,10,25
7540,79,Home,W,111,104,45,81,0.556,14,34,...,10,17,0.588,9,43,25,15,0,17,22
7541,80,Home,L,106,112,42,82,0.512,9,23,...,11,14,0.786,5,37,23,7,4,13,19
7542,81,Away,L,101,113,33,87,0.379,13,33,...,16,20,0.800,10,51,32,8,3,20,24


In [6]:
# Replace strings in Home and WINorLOSS columns with 0s and 1s
celtics['Home'] = celtics['Home'].replace({'Home': 0, 'Away': 1})
celtics['WINorLOSS'] = celtics['WINorLOSS'].replace({'L': 0, 'W': 1})

# Preview resulting data frame
celtics

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Game,Home,WINorLOSS,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
82,1,0,1,121,105,49,88,0.557,8,22,...,20,26,0.769,9,39,20,6,7,20,21
83,2,1,0,90,104,37,98,0.378,1,25,...,28,40,0.700,10,49,20,9,8,17,24
84,3,1,0,113,118,45,102,0.441,11,31,...,16,20,0.800,7,36,24,5,9,17,17
85,4,0,0,107,110,40,78,0.513,10,26,...,19,23,0.826,6,24,18,11,4,7,18
86,5,0,1,101,98,39,85,0.459,9,27,...,13,13,1.000,14,42,18,2,6,13,17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7539,78,1,0,78,96,25,75,0.333,3,22,...,6,8,0.750,12,48,23,10,6,10,25
7540,79,0,1,111,104,45,81,0.556,14,34,...,10,17,0.588,9,43,25,15,0,17,22
7541,80,0,0,106,112,42,82,0.512,9,23,...,11,14,0.786,5,37,23,7,4,13,19
7542,81,1,0,101,113,33,87,0.379,13,33,...,16,20,0.800,10,51,32,8,3,20,24


In [43]:
# Look at class imbalance
wins = celtics['WINorLOSS'].sum()
losses = len(celtics['WINorLOSS']) - wins

print('Wins: {}\nLosses: {}'.format(wins, losses))

Wins: 196
Losses: 132


About 60/40 in favor of wins.

In [7]:
# Set input (X) and output (Y)
X = celtics.drop('WINorLOSS', axis=1)
y = celtics['WINorLOSS']

In [8]:
# Scale input (X) from 0 to 1 and reassign to data frame
min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, index=X.index, columns=X.columns)

# Check results
X.describe()

Unnamed: 0,Game,Home,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
count,328.0,328.0,328.0,328.0,328.0,328.0,328.0,328.0,328.0,328.0,...,328.0,328.0,328.0,328.0,328.0,328.0,328.0,328.0,328.0,328.0
mean,0.5,0.5,0.465272,0.481657,0.474979,0.396341,0.420324,0.503557,0.444033,0.53974,...,0.405589,0.378267,0.5665,0.393986,0.482842,0.426829,0.42561,0.363676,0.466768,0.450457
std,0.292664,0.500764,0.161792,0.181732,0.158712,0.151882,0.173341,0.185397,0.155594,0.145789,...,0.207027,0.186925,0.193994,0.173182,0.143436,0.189103,0.182541,0.17697,0.188571,0.174727
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.246914,0.0,0.359375,0.344262,0.37931,0.311111,0.303512,0.388889,0.333333,0.445217,...,0.266667,0.238095,0.432781,0.272727,0.372093,0.28,0.266667,0.214286,0.35,0.333333
50%,0.5,0.5,0.46875,0.47541,0.482759,0.388889,0.40301,0.5,0.440476,0.532174,...,0.366667,0.357143,0.581952,0.363636,0.488372,0.44,0.4,0.357143,0.45,0.458333
75%,0.753086,1.0,0.578125,0.590164,0.586207,0.488889,0.535117,0.611111,0.547619,0.628261,...,0.533333,0.5,0.695672,0.5,0.581395,0.52,0.533333,0.5,0.6,0.541667
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
# Preview resulting data frame
X

Unnamed: 0,Game,Home,TeamPoints,OpponentPoints,FieldGoals,FieldGoalsAttempted,FieldGoals.,X3PointShots,X3PointShotsAttempted,X3PointShots.,...,Opp.FreeThrows,Opp.FreeThrowsAttempted,Opp.FreeThrows.,Opp.OffRebounds,Opp.TotalRebounds,Opp.Assists,Opp.Steals,Opp.Blocks,Opp.Turnovers,Opp.TotalFouls
82,0.000000,0.0,0.718750,0.524590,0.827586,0.422222,0.785953,0.388889,0.285714,0.563478,...,0.466667,0.428571,0.574586,0.318182,0.348837,0.36,0.333333,0.500000,0.75,0.500000
83,0.012346,1.0,0.234375,0.508197,0.413793,0.644444,0.187291,0.000000,0.357143,0.000000,...,0.733333,0.761905,0.447514,0.363636,0.581395,0.36,0.533333,0.571429,0.60,0.625000
84,0.024691,1.0,0.593750,0.737705,0.689655,0.733333,0.397993,0.555556,0.500000,0.547826,...,0.333333,0.285714,0.631676,0.227273,0.279070,0.52,0.266667,0.642857,0.60,0.333333
85,0.037037,0.0,0.500000,0.606557,0.517241,0.200000,0.638796,0.500000,0.380952,0.600000,...,0.433333,0.357143,0.679558,0.181818,0.000000,0.28,0.666667,0.285714,0.10,0.375000
86,0.049383,0.0,0.406250,0.409836,0.482759,0.355556,0.458194,0.444444,0.404762,0.509565,...,0.233333,0.119048,1.000000,0.545455,0.418605,0.28,0.066667,0.428571,0.40,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7539,0.950617,1.0,0.046875,0.377049,0.000000,0.133333,0.036789,0.111111,0.285714,0.166957,...,0.000000,0.000000,0.539595,0.454545,0.558140,0.48,0.600000,0.428571,0.25,0.666667
7540,0.962963,0.0,0.562500,0.508197,0.689655,0.266667,0.782609,0.722222,0.571429,0.646957,...,0.133333,0.214286,0.241252,0.318182,0.441860,0.56,0.933333,0.000000,0.60,0.541667
7541,0.975309,0.0,0.484375,0.639344,0.586207,0.288889,0.635452,0.444444,0.309524,0.610435,...,0.166667,0.142857,0.605893,0.136364,0.302326,0.48,0.400000,0.285714,0.40,0.416667
7542,0.987654,1.0,0.406250,0.655738,0.275862,0.400000,0.190635,0.666667,0.547619,0.615652,...,0.333333,0.285714,0.631676,0.363636,0.627907,0.84,0.466667,0.214286,0.75,0.625000


In [10]:
# Split input and output into random train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

## Build Models

### Random Forest Classifier

In [16]:
# Optimize model
parameters = {
    'n_estimators': [100, 500, 1000],
    'criterion': ['entropy', 'gini'],
    'bootstrap': [True, False],
    'max_depth': [5, 50, 100],
    'max_features': [2, 10, 20]
}

# Run grid search
clf = GridSearchCV(RandomForestClassifier(), parameters, cv=10, scoring='f1')
clf.fit(X_train, y_train)

print(clf.best_params_)

{'bootstrap': False, 'criterion': 'entropy', 'max_depth': 100, 'max_features': 10, 'n_estimators': 500}


In [19]:
# Initialize and fit model
rfc = RandomForestClassifier(**clf.best_params_, random_state=123)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred, target_names=['Loss', 'Win']))

              precision    recall  f1-score   support

        Loss       0.87      0.81      0.84        32
         Win       0.83      0.88      0.86        34

    accuracy                           0.85        66
   macro avg       0.85      0.85      0.85        66
weighted avg       0.85      0.85      0.85        66



In [20]:
# Cross validation to look at overfitting
cross_val_score(rfc, X_train, y_train, cv=5)

array([0.98113208, 0.81132075, 0.90384615, 0.98076923, 0.94230769])

### Multi-Layer Perceptron Neural Network Model

In [36]:
# Optimize model
parameters = {
    'hidden_layer_sizes': [(18, 1), (18, 2), (18, 5),
                           (36, 1), (36, 2), (36, 5)],
    'alpha': (0.0001, 0.001, 0.01, 0.1, 1),
    'activation': ('identity', 'logistic', 'tanh', 'relu')
}

# Run grid search
clf = GridSearchCV(MLPClassifier(), parameters, cv=10, scoring='f1')
clf.fit(X_train, y_train)

print(clf.best_params_)































































{'activation': 'tanh', 'alpha': 0.01, 'hidden_layer_sizes': (18, 5)}




In [37]:
# Initialize and fit model
mlp = MLPClassifier(**clf.best_params_, random_state=123)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred, target_names=['Loss', 'Win']))

              precision    recall  f1-score   support

        Loss       1.00      0.97      0.98        32
         Win       0.97      1.00      0.99        34

    accuracy                           0.98        66
   macro avg       0.99      0.98      0.98        66
weighted avg       0.99      0.98      0.98        66





In [38]:
# Cross validation to look at overfitting
cross_val_score(mlp, X_train, y_train, cv=5)



array([1.        , 0.88679245, 0.96153846, 0.96153846, 0.94230769])

## Conclusion
* The random forest classifier (gridsearch and modeling) took significantly longer than the MLP neural network.
* The MLP neural network (98%) had a much higher accuracy than the random forest classifier (85%).
* The MLP neural network (98%) had a much higher average f1 score than the random forest classifier (85%).
* Both models were slightly overfit as assessed by cross validation score.