# Naive Bayes
Classifing student success data by means of the [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) from the sklearn module.

## Import Data
Import the data and create the response vector (r *x* 1) and design matrix (r *x* c). Create a normalized design matrix for comparison of accuracy to the non-normalized design matrix.

In [1]:
import time
import random
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils.extmath import cartesian
from sklearn import metrics
from sklearn import preprocessing


df = pd.read_csv('student-por2.csv')
df = pd.get_dummies(df)#, drop_first=True)

def response_conv(arr):
    new = []
    for i in arr:
        if (i > 0 and i < 10):           # condition where student failed
            new.append(0)                 
                                          
        elif (i >= 10):                   # condition where student passed
            new.append(1)                 
    
        else:                             # condition where student received an incomplete
            new.append(2)
    return(new)                           # 1-dimensional response varibale returned

X = df.drop('G3',1)                       # This is the design matrix
y = list(df.G3)                           # This is the discrete response vector
y_new = response_conv(y)                  # This is the multinomial response vector
X_norm = preprocessing.normalize(X)

## Splitting Data and Optimal Alpha 
We split both design matracies in training and test sets. By means of 10-Fold cross validation we return optimal values of alpha for each design matrix. Optimal is decided by selecting alphas that minimize mean squared error (mse) of the model.

In [2]:
random.seed(42)
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y_new, test_size=0.33, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X_norm, y_new, test_size=0.33, random_state=42)

def opt(X,y):
    mse = []
    alphas = 10.0**-np.arange(1,5)
    for a in alphas:
        nb = MultinomialNB(alpha=a)
        scores = cross_val_score(nb, X, y, cv=10, scoring='neg_mean_squared_error')
        mse.append(scores.mean())
    

    opt_ = alphas[mse.index(min(mse))]
    return(opt_)

a1 = opt(X1_train,y1_train)
a3 = opt(X3_train,y3_train)

print("The optimal alpha value is %r for Non-standardized design matrix." % a1)
print("The optimal alpha value is %r for Normalized design matrix." % a3)

The optimal alpha value is 0.001 for Non-standardized design matrix.
The optimal alpha value is 0.10000000000000001 for Normalized design matrix.


## Fit and Predict
Fit a model for both the non-normalized and normalized design matrix using optimal alphas. Predict using the respective testing set and compare predictions into a data frame. 

In [3]:
nb1 = MultinomialNB(alpha=a1).fit(X1_train,y1_train)
nb3 = MultinomialNB(alpha=a3).fit(X3_train,y3_train)

nb_pred1 = nb1.predict(X1_test)
nb_pred3 = nb3.predict(X3_test)

pred = pd.DataFrame(list(zip(y1_test, nb_pred1, nb_pred3)), columns=['y_act','y_nb','y_nb_norm'])
pred.index.name = 'Obs'

pred

Unnamed: 0_level_0,y_act,y_nb,y_nb_norm
Obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,1,1
1,1,1,1
2,1,1,1
3,1,1,1
4,1,1,1
5,1,1,1
6,1,1,1
7,0,0,1
8,1,1,1
9,1,1,1


## Results
Accuracy, confusion matrix, and classification reports are returned for each design matirx.

In [4]:
cm_nb1 = pd.DataFrame(metrics.confusion_matrix(y1_test, nb_pred1), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_nb3 = pd.DataFrame(metrics.confusion_matrix(y3_test, nb_pred3), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])

print ("The accuracy of the Non-standardized Random Forest model is: ", nb1.score(X1_test,y1_test))
print ("\n")
print ("The accuracy of the Normalized Random Forest model is: ", nb3.score(X3_test,y3_test))
print ("\n")

print("Non-standardized Random Forest Confusion Matrix: \n", cm_nb1)
print ("\n")
print("Normalized Random Forest Confusion Matrix: \n", cm_nb3)
print ("\n")

print("Classification report for Non-standardized design matrix:\n", metrics.classification_report(y1_test,nb_pred1))
print("\n")
print("Classification report for Normalized design matrix:\n", metrics.classification_report(y3_test,nb_pred3))

The accuracy of the Non-standardized Random Forest model is:  0.855813953488


The accuracy of the Normalized Random Forest model is:  0.86976744186


Non-standardized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       12        9       2
Pass(1)       18      169       0
Inc(2)         1        1       3


Normalized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)        0       23       0
Pass(1)        0      187       0
Inc(2)         0        5       0


Classification report for Non-standardized design matrix:
              precision    recall  f1-score   support

          0       0.39      0.52      0.44        23
          1       0.94      0.90      0.92       187
          2       0.60      0.60      0.60         5

avg / total       0.88      0.86      0.86       215



Classification report for Normalized design matrix:
              precision    recall  f1-score   support

          0       0.00      0.00      0.0

  'precision', 'predicted', average, warn_for)
