## Data Modeling

In the data modeling part, the purpose of the model is predictive. In other words, the target is to fit a model that given student information as collected in the data set, would be able to predcit the academic performace of the student, namely G3 (final grade) in this dataset. 


### Outcome Variable Selection and Imputation

In the original paper, variables highly correlated with G3, namely G1 and G2 are also included in the training data set. Since these two variables are intrinsically same measurement as the output variable G3 is, it is not informative to use G1 and G2 to predict G3. For the purpose of more meaningful investigation into the dateset, G1 and G2 are dropped.

In the original paper, three approaches are used to measure the G3 grade:  
1. Binary classification – pass if G3≥10, else fail;
2. 5-Level classification – based on the Erasmus grade conversion system (Table 2);
3. Regression – the G3 value (numeric output between 0 and 20).

In the preliminary tryout of the models, the accuracy of the models based on latter two approaches are significantly lower than benchmark of the distribution of G3 itself. i.e. the accuracy of predicting a student gets an A is smaller than blindly guessing that the student gets an A. The drop of accuracy compared with the original paper is due to the drop of G1 and G2. 
Therefore, the G3 is changed into a binary categorical variable ("Pass" or "Fail").


### Data Selection

Although there are 382 students from both data sets, it is notable that students' performance vary in different classes as explored in data exploratory section. Therefore, the two datasets are used seperately in training and prediction. For each method of classification, the model is seperately tuned fitted, and tested for accuracy on each of the two data sets.  


### Model Selection and Fitting Procedure

Three classfication methods are tried in the modeling part:
1. Support Vector Machine
2. Tree model
3. Random Forest

The procedure of fitting a model on a dataset is as following:
1. The data set is seperated into 70% and 30% for training and testing.
2. Cross validation is conducted over the training data to perform grid search for the purpose of tuning the model.
3. The tuned model is tested over the testing data, and metrics such as accuracy and classification report are used to show the predictive performance of the model.
4. After conducting a total of 6 model fittings (3 methods over 2 datasets each time), the models are tested by 5-fold cross validation over the whole data set for a comparision of performacec between each model over each data set.

It is notable that the tree model and random forest model output different best tuning parameters each time given different random states. So for these two models, the "further tuning" was applying a more comprehensive grid according to the result of 10 runs over original grid. Given the huge amount of time it takes to train the model, only the final grid parameters are shown and used in this report.

In [56]:
'''
Main Code Chunk of the paper
1. Modules and functions import
2. Functions written for models
'''

#modules and functions needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler 
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import *
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier


def find_accuracy(model, fold, X, y):
    '''
    Finds the accuracy of a model over the whole data set.
    '''
    scores = cross_val_score(model, X, y, cv=fold)
    print("Accuracy based on %i-fold cross validation is: %0.3f (+/- %0.3f)"
          % (int(fold), scores.mean(), scores.std() * 2))
    result = "%0.3f (+/- %0.3f)"% (scores.mean(), scores.std() * 2)
    return result


def SVC_Grid_Search(X, y, c_vals, gamma_vals, rs=22, testSize = 0.3):
    '''
    SVC model, parameters to tune are C and gamma
    '''
    #1. Split data into 70 30
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=testSize, random_state=rs)
    
    
    #2. CV for tuning
    steps = [('scaler', StandardScaler()),
             ('SVM', SVC())]
    pipeline = Pipeline(steps)
    parameters = {'SVM__C':c_vals,
                  'SVM__gamma':gamma_vals}
    cv = GridSearchCV(pipeline, parameters, cv=10)
    cv.fit(X_train, y_train)
    y_pred = cv.predict(X_test)

    #3. Compute and print metrics
    print("Accuracy: {}".format(cv.score(X_test, y_test)))
    print(classification_report(y_test, y_pred))
    print("Tuned Model Parameters: {}".format(cv.best_params_))
    best = cv.best_estimator_
    return best


def Tree_Grid_Search(X, y, depth, feature, minleaf, rs=22, testSize = 0.3):
    '''
    Tree model, parameters to tune are max_depth, 
    max_features and min_samples_leaf
    '''
    #1. Split data into 70 30
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=testSize, random_state=rs)

    #2. CV for tuning
    param = {"max_depth": depth,
             "max_features": feature,
             "min_samples_leaf": minleaf}
    tree = DecisionTreeClassifier(random_state=rs)
    tree_cv = RandomizedSearchCV(tree, param, cv=5, random_state=rs)
    tree_cv.fit(X_train,y_train)
    tree_pred = tree_cv.predict(X_test)


    #3. Compute and print metrics
    print("Accuracy: {}".format(tree_cv.score(X_test, y_test)))
    print(classification_report(y_test, tree_pred))
    print("Tuned Model Parameters: {}".format(tree_cv.best_params_))
    best = tree_cv.best_estimator_
    return best


def RF_Grid_Search(X, y, ne, depth, feature, ms, rs=22, testSize = 0.3):
    '''
    Random forest model, parameters to tune are 
    n_estimators, max_depth, max_features
    min_samples_split, and bootstrap
    '''
    #1. Split data into 70 30
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=testSize, random_state=rs)
    
    #2. CV for tuning
    param_dist = {"n_estimators": ne,
                  "max_depth": depth,
                  "max_features": feature,
                  "min_samples_split": ms,
                  "bootstrap": [True, False]}

    rf = RandomForestClassifier(random_state=rs)
    rf_cv = RandomizedSearchCV(rf, param_dist, cv=5, random_state=rs)
    rf_cv.fit(X_train,y_train)
    rf_pred = rf_cv.predict(X_test)
    
    #3. Compute and print metrics
    print("Accuracy: {}".format(rf_cv.score(X_test, y_test)))
    print(classification_report(y_test, rf_pred))
    print("Tuned Model Parameters: {}".format(rf_cv.best_params_))
    best = rf_cv.best_estimator_
    return best

def find_accuracy_and_cr(model, fold, X, y):
    '''
    Finds the accuracy of a model over the whole data set.
    Also outputs the classification report.
    '''
    scores = cross_val_score(model, X, y, cv=fold)
    print("Accuracy based on %i-fold cross validation is: %0.3f (+/- %0.3f)"
          % (int(fold), scores.mean(), scores.std() * 2))
    result = "%0.3f (+/- %0.3f)"% (scores.mean(), scores.std() * 2)
    return result

In [3]:
mat = pd.read_csv('student-mat.csv')
por = pd.read_csv('student-por.csv')
def data_reform(df):
    df = pd.get_dummies(df, drop_first=True)
    for i in range(df.shape[0]):
        if df.iloc[i, 15] < 10:
            df.iloc[i, 15] = "Fail"
        else:
            df.iloc[i, 15] = "Pass"
    return df

mat = data_reform(mat)
por = data_reform(por)

Xmat = mat.drop(columns=['G1', 'G2', 'G3'])
ymat = mat.G3
Xpor = por.drop(columns=['G1', 'G2', 'G3'])
ypor = por.G3


In [4]:
print(sum(ymat=="Pass"), sum(ymat=="Fail"))
print(sum(ymat=="Pass")/len(ymat))
print(sum(ymat=="Fail")/len(ymat))

265 130
0.6708860759493671
0.3291139240506329


In [63]:
print(sum(ypor=="Pass"), sum(ypor=="Fail"))
print(sum(ypor=="Pass")/len(ypor))
print(sum(ypor=="Fail")/len(ypor))

549 100
0.8459167950693375
0.15408320493066255


## Models over Math

### Support Vector Machine

In [5]:
#preliminary C and gama
cm1 = [1, 10, 100]
gm1 = [1, 0.1, 0.01, 0.001]
svm1 = SVC_Grid_Search(Xmat, ymat, cm1, gm1)

Accuracy: 0.7310924369747899
             precision    recall  f1-score   support

       Fail       0.60      0.18      0.27        34
       Pass       0.74      0.95      0.84        85

avg / total       0.70      0.73      0.67       119

Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.001}


In [20]:
#Closer look at C and gama
cm2 = [7, 8, 9, 10,15,20]
gm2 = [0.0001,0.0008,0.0009,0.001, 0.0012, 0.0015]
svm2 = SVC_Grid_Search(Xmat, ymat, cm2, gm2)

Accuracy: 0.7310924369747899
             precision    recall  f1-score   support

       Fail       0.60      0.18      0.27        34
       Pass       0.74      0.95      0.84        85

avg / total       0.70      0.73      0.67       119

Tuned Model Parameters: {'SVM__C': 7, 'SVM__gamma': 0.0015}


The second tuned model does not have an improvement in the accuracy. In order to determine if the tuning has a tendancy of over fitting, a 5-fold cross validation is coducted for the estimation of accuracy.

In [41]:
print("For the first tuned SVM model over math data: ")
sm1 = find_accuracy(svm1, 5, Xmat, ymat)
print("For the second tuned SVM model over math data: ")
sm2 = find_accuracy(svm2, 5, Xmat, ymat)

For the first tuned SVM model over math data: 
Accuracy based on 5-fold cross validation is: 0.706 (+/- 0.063)
For the second tuned SVM model over math data: 
Accuracy based on 5-fold cross validation is: 0.706 (+/- 0.063)


The accuracies are the same, indicating that the original model is enough.

### Tree Model

In [64]:
dt1 = np.arange(1,42)
ft1 = np.arange(1,21)
mt1 = np.arange(1,21)
tree1 = Tree_Grid_Search(Xmat, ymat, dt1, ft1, mt1, rs=22)
tm1 = find_accuracy(tree1, 5, Xmat, ymat)

Accuracy: 0.7226890756302521
             precision    recall  f1-score   support

       Fail       0.57      0.12      0.20        34
       Pass       0.73      0.96      0.83        85

avg / total       0.69      0.72      0.65       119

Tuned Model Parameters: {'min_samples_leaf': 9, 'max_features': 1, 'max_depth': 38}
Accuracy based on 5-fold cross validation is: 0.646 (+/- 0.053)


### Random Forest Model

In [65]:
nr1 = [10, 100, 200, 500, 1000]
dr1 = np.arange(1,42)
fr1 = np.arange(1,21)
mr1 = np.arange(2,21)
rf1 = RF_Grid_Search(Xmat, ymat, nr1, dr1, fr1, mr1, rs=28)
rm1 = find_accuracy(rf1, 5, Xmat, ymat)

Accuracy: 0.6722689075630253
             precision    recall  f1-score   support

       Fail       0.53      0.20      0.29        40
       Pass       0.69      0.91      0.79        79

avg / total       0.64      0.67      0.62       119

Tuned Model Parameters: {'n_estimators': 500, 'min_samples_split': 16, 'max_features': 5, 'max_depth': 8, 'bootstrap': False}
Accuracy based on 5-fold cross validation is: 0.699 (+/- 0.076)


## Models over Portuguese

### Support Vector Machine

In [70]:
svm3 = SVC_Grid_Search(Xpor, ypor, cm1, gm1)
sp3 = find_accuracy(svm3, 5, Xpor, ypor)

Accuracy: 0.8358974358974359
             precision    recall  f1-score   support

       Fail       0.39      0.25      0.30        28
       Pass       0.88      0.93      0.91       167

avg / total       0.81      0.84      0.82       195

Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.01}
Accuracy based on 5-fold cross validation is: 0.809 (+/- 0.166)


In [71]:
cm3 = [4, 5,10,15,20]
gm3 = [0.007,0.008,0.009,0.01, 0.012, 0.015]
svm4 = SVC_Grid_Search(Xpor, ypor, cm3, gm3)
sp4 = find_accuracy(svm4, 5, Xpor, ypor)

Accuracy: 0.841025641025641
             precision    recall  f1-score   support

       Fail       0.41      0.25      0.31        28
       Pass       0.88      0.94      0.91       167

avg / total       0.81      0.84      0.82       195

Tuned Model Parameters: {'SVM__C': 5, 'SVM__gamma': 0.01}
Accuracy based on 5-fold cross validation is: 0.809 (+/- 0.155)


The further-tuned model seems to have a slightly better performance given better accuracy on testing, and smaller fluctuation in cross validation. To decide which model to adopt, a 10-fold cross validation is applied on both.

In [72]:
print("For the first tuned SVM model over Portuguese data: ")
sp31 = find_accuracy(svm3, 10, Xpor, ypor)
print("For the second tuned SVM model over Portuguese data: ")
sp41 = find_accuracy(svm4, 10, Xpor, ypor)

For the first tuned SVM model over Portuguese data: 
Accuracy based on 10-fold cross validation is: 0.820 (+/- 0.132)
For the second tuned SVM model over Portuguese data: 
Accuracy based on 10-fold cross validation is: 0.835 (+/- 0.095)


Given the 10-fold cv results, the second model is adopted.

### Tree Model

In [73]:
tree2 = Tree_Grid_Search(Xpor, ypor, dt1, ft1, mt1, rs=28)
tp2 = find_accuracy(tree2, 5, Xpor, ypor)

Accuracy: 0.8153846153846154
             precision    recall  f1-score   support

       Fail       0.45      0.26      0.33        34
       Pass       0.86      0.93      0.89       161

avg / total       0.79      0.82      0.80       195

Tuned Model Parameters: {'min_samples_leaf': 8, 'max_features': 12, 'max_depth': 19}
Accuracy based on 5-fold cross validation is: 0.830 (+/- 0.044)


### Random Forest Model

In [74]:
rf2 = RF_Grid_Search(Xpor, ypor, nr1, dr1, fr1, mr1, rs=21)
rp2 = find_accuracy(rf2, 5, Xpor, ypor)

Accuracy: 0.8666666666666667
             precision    recall  f1-score   support

       Fail       0.50      0.12      0.19        26
       Pass       0.88      0.98      0.93       169

avg / total       0.83      0.87      0.83       195

Tuned Model Parameters: {'n_estimators': 1000, 'min_samples_split': 5, 'max_features': 4, 'max_depth': 23, 'bootstrap': True}
Accuracy based on 5-fold cross validation is: 0.840 (+/- 0.035)


## Summary of Results

The previous sections have shown that all 6 final models all had a decent accuracy over testing data after trained over 70% of the data. Cross validations are performed to ensure that the models are not over fitting. To summarize the predictive performance of the the models, two metrics are used:

1. The classfication report over precisions of each category are used to compare with the proportion of students in that category. This is necessary since both data sets are imbalanced data sets. And the precision of each category is crucial in showing the the classification model at least outperforms the raw blind classification.

2. The accuracy based on 10-fold cross validation is summarized into a table for comparing.

### Precision Report

#### Model precision for each level compared with actual proportion

<tr>
<td> <img src="BMath p.png" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="BPortuguese p.png" alt="Drawing" style="width: 250px;"/> </td>
</tr>

The tables summarize the precisions of six final models for each level (namely Pass and Fail) of G3. All the precisions are above the baseline given by proportion, indicating all the classifiers perform better than blind classifier (i.e. blindly classifying students as "Pass" in math course, which would end up with a precision of 0.67. The worst performing model, RF in this case, still outperforms the blind classifier with a precision of 0.69.).

These two tables are crucial since the two data sets are imbalanced. So accuracy alone is not sufficient to show performance.

#### Model accuracy acquired via 5-fold cross validation over whole data

<tr>
<td> <img src="Accuracy.png" alt="Drawing" style="width: 500px;"/> </td>
</tr>  

The table summarizes the accuracy estimates of each model for each dataset given by a 5-fold cross validation over the whole dataset. 

As can be told row wise, SVM performs the best in Math case, and Random Forest performs the best in Portuguese case.

The three methods all perform better in the Portuguese case, which is reasonable, since the training data size is larger, allowing the model to learn better.

