# Statistical Modeling

In [5]:
import pandas as pd

## Machine Learning

Here is the dataframe we created at the beginning, grouping all the necessary information together for our data analysis. It contains the desired statistics for all 30 teams in the last 16 seasons.

In [9]:
grouped = pd.read_pickle('/home/jnogle/finalproject/dataframes/grouped')
grouped.head()

Unnamed: 0,year,team_id,div_id,rank,g,w,ws_win,name,e,ba,era
0,2000,ANA,W,3,162,82,N,Anaheim Angels,2.310345,0.271819,5.3844
1,2000,ARI,W,3,162,85,N,Arizona Diamondbacks,1.945455,0.247616,5.31
2,2000,ATL,E,1,162,95,N,Atlanta Braves,2.388889,0.184948,5.575909
3,2000,BAL,E,4,162,74,N,Baltimore Orioles,1.966102,0.202421,6.768636
4,2000,BOS,E,2,162,85,N,Boston Red Sox,1.651515,0.207478,5.046667


Let's take out some columns for the sake of the machine learning process.

In [11]:
grouped = grouped.drop(['year','team_id','div_id','g','name'],axis=1)
grouped.head()

Unnamed: 0,rank,w,ws_win,e,ba,era
0,3,82,N,2.310345,0.271819,5.3844
1,3,85,N,1.945455,0.247616,5.31
2,1,95,N,2.388889,0.184948,5.575909
3,4,74,N,1.966102,0.202421,6.768636
4,2,85,N,1.651515,0.207478,5.046667


The column we would like to predict (our "target array") will be the "ws_win" column, telling us whether or not this team won the World Series, based off the rest of the columns and their data. Thus, the "feature matrix" will be this dataframe minus the "ws_win" column.

In [14]:
X = grouped.drop('ws_win',axis=1)
X.shape

(480, 5)

In [15]:
y = grouped['ws_win']
y.shape

(480,)

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
                                                random_state=0,train_size=0.7)

## Gaussian Naive-Bayes classifier

In [18]:
from sklearn.naive_bayes import GaussianNB
Gmodel = GaussianNB()
Gmodel.fit(Xtrain, ytrain);

In [21]:
from sklearn.metrics import accuracy_score

In [23]:
Gtraining_Acc = accuracy_score(ytrain,Gmodel.predict(Xtrain))
print("Training Set Prediction Accuracy =", Gtraining_Acc)

Gtest_Acc = accuracy_score(ytest,Gmodel.predict(Xtest))
print("Test Set Prediction Accuracy =", Gtest_Acc)

Training Set Prediction Accuracy = 0.907738095238
Test Set Prediction Accuracy = 0.840277777778


Using the Gaussian Naive-Bayes classification algorithm, we get a pretty effective prediction on World Series wins. However, we will double check the accuracy with cross-validation.

### Cross-Validation

First we will split the data into five groups and and use each individual group to evaluate the fit of the model on the other 4/5 of the data. The resulting array contains the accuracy scores for these five groups.

In [27]:
from sklearn.cross_validation import cross_val_score
cross_val_score(Gmodel, X, y, cv=5)

array([ 0.92783505,  0.92708333,  0.88541667,  0.85416667,  0.90526316])

Next, we will use a cross-validation scheme in which we train on all points but one in each trial known as "leave-one-out" cross-validation.

In [31]:
from sklearn.cross_validation import LeaveOneOut
Gscores = cross_val_score(Gmodel, X, y, cv=LeaveOneOut(len(X)))
Gscores

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1

We have 480 records, so there are 480 values in this array. 1's represent a successful prediction where a 0 represents an unsuccessful prediction. The average of these values gives yet another estimate of the accuracy score.

In [32]:
Gscores.mean()

0.88749999999999996

Overall, I would consider the GNB classifier to be pretty accurate, but let's see if we can get any better using a different model.

## Random Forest Classifier

In [39]:
from sklearn.ensemble import RandomForestClassifier
RFmodel = RandomForestClassifier(n_estimators=100, random_state=0)
RFmodel.fit(Xtrain,ytrain);

In [40]:
RFtraining_Acc = accuracy_score(ytrain,RFmodel.predict(Xtrain))
print("Training Set Prediction Accuracy =", RFtraining_Acc)

RFtest_Acc = accuracy_score(ytest,RFmodel.predict(Xtest))
print("Test Set Prediction Accuracy =", RFtest_Acc)

Training Set Prediction Accuracy = 1.0
Test Set Prediction Accuracy = 0.958333333333


This is a great accuracy score, but let's check again with cross-validation.

### Cross-Validation

In [41]:
from sklearn.cross_validation import cross_val_score
cross_val_score(RFmodel, X, y, cv=5)

array([ 0.95876289,  0.95833333,  0.96875   ,  0.96875   ,  0.96842105])

Nice, this is better than the array we got using Naive-Bayes model. Let's also try the leave-one-out method again.

In [42]:
from sklearn.cross_validation import LeaveOneOut
RFscores = cross_val_score(RFmodel, X, y, cv=LeaveOneOut(len(X)))
RFscores

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1

In [43]:
RFscores.mean()

0.96458333333333335

Overall, this is a much better accuracy score using the Random Forest Classifier. However, let's try one more model I've never experimented with before, just for fun. After all, Data Science is REALLY FUN.

## Linear SVC

Linear SVC is a method of classification under the broad scope of SVMs (Support Vector Machines). 
The advantages of SVMs are:
- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of SVMs include:
- If the number of features is much greater than the number of samples, the method is likely to give poor performances.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

Let's try it out.

In [48]:
from sklearn.svm import LinearSVC
LSVCmodel = LinearSVC(dual=False)
LSVCmodel.fit(Xtrain,ytrain);

In [49]:
LSVCtraining_Acc = accuracy_score(ytrain,LSVCmodel.predict(Xtrain))
print("Training Set Prediction Accuracy =", LSVCtraining_Acc)

LSVCtest_Acc = accuracy_score(ytest,LSVCmodel.predict(Xtest))
print("Test Set Prediction Accuracy =", LSVCtest_Acc)

Training Set Prediction Accuracy = 0.970238095238
Test Set Prediction Accuracy = 0.958333333333


Looks like this is also a good model for predicting our World Series winners. Let's check it with cross-validation like the rest.

### Cross-Validation

In [50]:
from sklearn.cross_validation import cross_val_score
cross_val_score(LSVCmodel, X, y, cv=5)

array([ 0.95876289,  0.96875   ,  0.96875   ,  0.96875   ,  0.96842105])

This array looks slightly better than the one from the Random Forest Classification. Nice! Let's try the leave-one-out method one last time.

In [51]:
from sklearn.cross_validation import LeaveOneOut
LSVCscores = cross_val_score(LSVCmodel, X, y, cv=LeaveOneOut(len(X)))
LSVCscores

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1

In [52]:
LSVCscores.mean()

0.96666666666666667

Slightly better than the average we got from the Random Forest Classification! Looks like we found our model.

## Feature Importances

Lastly, let's check to see which features in our feature array had the most importance in predicting a World Series win. This will be based off of our Random Forest Classification model.

In [55]:
fi = pd.DataFrame(columns=['features','importance'])
fi.features = X.columns
fi.importance = RFmodel.feature_importances_
fi.sort('importance', ascending=False)



Unnamed: 0,features,importance
3,ba,0.271557
1,w,0.227763
2,e,0.214586
4,era,0.2136
0,rank,0.072493


Have we found our answer? Looks like batting average was the most informative feature for predicting a World Series win in this case. However, number of wins, number of average fielding errors, and ERA also had a significant importance.