# Random Forest Classifier
Classifing student success data by means of the [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from the sklearn module. The data set comes from UCI's machine learning repository and can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00320/). A description of the data can be found [here](https://archive.ics.uci.edu/ml/datasets/student+performance).

## Import Data
Import the data into a pandas dataframe. Get dummy variables for each categorical predictor in the data set and return the design matirx. Create a normalized and standardized design matrix as well to compare model preformance. Also note we convert the response variable to three classes *0 (fail) , 1 (pass),* and *2 (incomplete)*. A student is put into the passing class if they got a score bigger or equal to 10, the failing class if they got a score below 10 but above 0, and the incomplete calss if the student received a 0.

In [1]:
import time
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import cartesian
from sklearn import metrics
from sklearn import preprocessing

df = pd.read_csv('student-por2.csv')
df = pd.get_dummies(df)#, drop_first=True)

def response_conv(arr):
    new = []
    for i in arr:
        if (i > 0 and i < 10):           # condition where student failed
            new.append(0)                 
                                          
        elif (i >= 10):                   # condition where student passed
            new.append(1)                 
    
        else:                             # condition where student received an incomplete
            new.append(2)
    return(new)                           # 1-dimensional response varibale returned

X = df.drop('G3',1)                       # This is the design matrix
y = list(df.G3)                           # This is the discrete response vector
y_new = response_conv(y)                  # This is the multinomial response vector

clf = RandomForestClassifier()
clf.fit(X,y)

model = SelectFromModel(clf,prefit=True)  
newX = model.transform(X)                # design matrix with most influential predictors only

X_scale = preprocessing.scale(newX)
X_norm = preprocessing.normalize(newX)

random.seed(42)
X1_train, X1_test, y1_train, y1_test = train_test_split(newX, y_new, test_size=0.33, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X_scale, y_new, test_size=0.33, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X_norm, y_new, test_size=0.33, random_state=42)



## Niave Accuracy
Before we start training and selecting parametrs for our model, we must find the distribution of the classes amongst the response variable. Depnding on which class is the dominate class, our model should preform better than just guessing the dominate class for each observation. For example, if the dominate class is 1 and 1's comprise of 83% of the response data, then our model should have higher than 83% accuracy. 

In [None]:
zero = 0
one = 0
two = 0
for i in y1_train:
    if i == 0:
        zero += 1
    elif i == 1:
        one += 1
    else:
        two += 1
num1 = round(zero/len(y1_train),2)
num2 = round(one/len(y1_train),2)
num3 = round(two/len(y1_train),2)
print("The response vector has the following distribution: \nzeros: %r \nones: %r \ntwos: %r" % (num1,num2,num3))
print("\n")

## Optimal Number of Trees and Features
The fucntion *opt* finds the optimal parameters for *number of trees* and *number of features* to be used in the bagging process of the model. Optimal is decided based on the parameters used in the model that returns the smallest mean squared error.

In [2]:
start_time = time.time()
combos = cartesian([['auto','log2',None],np.arange(10,101,10)])

def opt(X,y):
    acc = []

    for m,t in combos:
        rf = RandomForestClassifier(n_estimators=t,max_features=m,random_state=42)
        scores = cross_val_score(rf, X, y, cv=10, scoring='accuracy')
        acc.append(scores.mean())
    
    opt_k = combos[acc.index(max(acc))]
    return(opt_k)

m1,t1 = opt(X1_train,y1_train)
m2,t2 = opt(X2_train,y2_train)
m3,t3 = opt(X3_train,y3_train)

print ("The optimal number of trees and number of features to consider is %r and %r respectively for Non-standardized design matrix." % (int(t1),str(m1)))
print ("The optimal number of trees and number of features to consider is %r and %r respectively for Non-standardized design matrix." % (int(t2),str(m2)))
print ("The optimal number of trees and number of features to consider is %r and %r respectively for Non-standardized design matrix." % (int(t3),str(m3)))
print("Run time: %r minutes" % (int(time.time() - start_time)/60))

The optimal number of trees and number of features to consider is 80 and 'None' respectively for Non-standardized design matrix.
The optimal number of trees and number of features to consider is 80 and 'None' respectively for Non-standardized design matrix.
The optimal number of trees and number of features to consider is 50 and 'auto' respectively for Non-standardized design matrix.
Run time: 1.4666666666666666 minutes


## Fit and Predict
After tuning model parameters to be optimal we fit each design matrix to its optimal model. Predictions are made and returned in a data frame for comparison.

In [3]:
rf1 = RandomForestClassifier(n_estimators=t1,max_features=m1,random_state=42).fit(X1_train,y1_train)
rf2 = RandomForestClassifier(n_estimators=t2,max_features=m2,random_state=42).fit(X2_train,y2_train)
rf3 = RandomForestClassifier(n_estimators=t3,max_features=m3,random_state=42).fit(X3_train,y3_train)

rf_pred1 = rf1.predict(X1_test)
rf_pred2 = rf2.predict(X2_test)
rf_pred3 = rf3.predict(X3_test)

pred = pd.DataFrame(list(zip(y1_test, rf_pred1, rf_pred2, rf_pred3)), columns=['y_act','y_rf','y_rf_stan','y_rf_norm'])
pred.index.name = 'Obs'

# remove comment below to save the predictions in a csv file and view the full data frame in excel
#pred.to_csv("preds.csv")
pred

Unnamed: 0_level_0,y_act,y_rf,y_rf_stan,y_rf_norm
Obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1
5,1,1,1,1
6,1,1,1,1
7,0,1,1,1
8,1,1,1,1
9,1,1,1,1


## Results
Accuracy, confusion matrix, and classification reports are returned for each design matirx. **Note** Results may vary as a random generator is used to shuffle data, and thus different accuracy may be returned. In this case all three design matricies return better accuracy than the niave accuracy. I personally would select either non-standarized/standardized design matrix becasue they make smaller type 1 errors. 

In [4]:
cm_rf1 = pd.DataFrame(metrics.confusion_matrix(y1_test, rf_pred1), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_rf2 = pd.DataFrame(metrics.confusion_matrix(y2_test, rf_pred2), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_rf3 = pd.DataFrame(metrics.confusion_matrix(y3_test, rf_pred3), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])

print ("The accuracy of the Non-standardized Random Forest model is: ", rf1.score(X1_test,y1_test))
print ("\n")
print ("The accuracy of the Standardized Random Forest model is: ", rf2.score(X2_test,y2_test))
print ("\n")
print ("The accuracy of the Normalized Random Forest model is: ", rf3.score(X3_test,y3_test))
print ("\n")

print("Non-standardized Random Forest Confusion Matrix: \n", cm_rf1)
print ("\n")
print("Standardized Random Forest Confusion Matrix: \n", cm_rf2)
print ("\n")
print("Normalized Random Forest Confusion Matrix: \n", cm_rf3)
print ("\n")

print("Classification report for Non-standardized design matrix:\n", metrics.classification_report(y1_test,rf_pred1))
print("\n")
print("Classification report for standardized design matrix:\n", metrics.classification_report(y2_test,rf_pred2))
print("\n")
print("Classification report for Normalized design matrix:\n", metrics.classification_report(y3_test,rf_pred3))

The response vector has the following distribution: 
zeros: 0.14 
ones: 0.83 
twos: 0.02


The accuracy of the Non-standardized Random Forest model is:  0.920930232558


The accuracy of the Standardized Random Forest model is:  0.920930232558


The accuracy of the Normalized Random Forest model is:  0.902325581395


Non-standardized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       16        7       0
Pass(1)        9      178       0
Inc(2)         0        1       4


Standardized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       16        7       0
Pass(1)        9      178       0
Inc(2)         0        1       4


Normalized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       10       12       1
Pass(1)        6      181       0
Inc(2)         1        1       3


Classification report for Non-standardized design matrix:
              precision    recall  f1-score   support

          0 