# Portuguese Scores

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
os.chdir('C:\\Users\\mhous\\Test Scores')

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR

train_por = pd.read_csv('train_por.csv')
test_por = pd.read_csv('test_por.csv')

# Regression

We begin with regression on the Portuguese G3 scores, which range from 1-20. We create a baseline model, which is the mean G3 score in training data. We can then compare the RMSE of this model to the random forest models. We then train a baseline random forest model, and then use RandomizedSearchCV to tune the hyperparameters. Finally we evaluate the best estimator found in the RandomizedSearchCV and analyze the feature importances.

In [2]:
baseline = train_por['G3'].mean()
baseline_array = np.full(len(test_por), baseline)
rmse = np.sqrt(mean_squared_error(test_por['G3'], baseline_array))
print("RMSE of baseline model: %.3f" %rmse)

RMSE of baseline model: 2.681


In [3]:
X_train = train_por.drop(['G1', 'G2', 'G3', 'G3_five_levels', 'G3_pass_fail'], axis=1)
y_train = train_por['G3']

X_test = test_por.drop(['G1', 'G2', 'G3', 'G3_five_levels', 'G3_pass_fail'], axis=1)
y_test = test_por['G3']

In [4]:
def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    r2 = r2_score(y_test, predictions)
    print("RMSE: % .3f" %rmse)
    print("R-Squared: % .3f"%r2)

In [5]:
rf = RandomForestRegressor(random_state=1)
rf.fit(X_train, y_train)

RandomForestRegressor(random_state=1)

In [6]:
evaluate_model(rf, X_test, y_test)

RMSE:  2.406
R-Squared:  0.185


In [7]:
n_estimators = [int(x) for x in np.linspace(100,  2000, 20)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [8]:
rf_random = RandomizedSearchCV(RandomForestRegressor(random_state=1), param_distributions = random_grid, n_iter = 50, cv = 5, verbose=1, random_state=1, n_jobs=-1)
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  7.5min finished


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=1),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500, 600, 700, 800,
                                                         900, 1000, 1100, 1200,
                                                         1300, 1400, 1500, 1600,
                                                         1700, 1800, 

In [9]:
print(rf_random.best_params_)
rf_best = rf_random.best_estimator_

{'n_estimators': 1900, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}


In [10]:
evaluate_model(rf_best, X_test, y_test)

RMSE:  2.315
R-Squared:  0.246


The best estimator from the RandomizedSearchCV was able to reduce the RMSE by about .1 from our initial random forest model. This is also an improvement of .35 from our baseline model. The R-squared difference from the first random forest to the the best estimator is .061, which shows a decent increase in explanatory power. 

In [11]:
def variable_importances(model):
    importances = model.feature_importances_
    std = np.std([tree.feature_importances_ for tree in model.estimators_],
             axis=0)
    indices = np.argsort(importances)[::-1]
    print("Feature ranking:")
    for f in range(10):
        print("%d. Feature: %s (%f)" % (f + 1, X_train.columns[indices[f]], importances[indices[f]]))

In [12]:
variable_importances(rf_best)

Feature ranking:
1. Feature: failures (0.123063)
2. Feature: higher (0.058370)
3. Feature: absences (0.048330)
4. Feature: Medu (0.046176)
5. Feature: age (0.043466)
6. Feature: goout (0.043195)
7. Feature: Dalc (0.042807)
8. Feature: Walc (0.041092)
9. Feature: freetime (0.038160)
10. Feature: school (0.038014)


We find that number of failures and the desire to attend higher education are the most important predictors, followed by other logical variables such as number of absences and mother's education. 

## 5 Levels

Now we move onto a five level classification task. The five levels classification is as follows:

| Level      | G3 Score |
| :----------- | -----------|
| 1      |    16-20   |
| 2   |    14-15     |
| 3   |   12-13     |
| 4   |   10-11      |
| 5   | 0-9 |

In this case, our baseline model is the most common class in the training data. 

We again begin with creating a base model, then using RandomizedSearchCV to tune the hyperparameters, then evaluating the model using the classification report and confusion matrix. 



In [13]:
most_common_class = train_por['G3_five_levels'].value_counts().sort_values(ascending=False).index[0]
classified_correctly = 0
for i in range(len(train_por)):
    if train_por['G3_five_levels'][i] == most_common_class:
        classified_correctly +=1 
print("Mean accuracy of baseline model: " + str(round(classified_correctly/len(train_por), 3)))

Mean accuracy of baseline model: 0.314


In [14]:
X_train = train_por.drop(['G1', 'G2', 'G3', 'G3_five_levels', 'G3_pass_fail'], axis=1)
y_train = train_por['G3_five_levels']

X_test = test_por.drop(['G1', 'G2', 'G3', 'G3_five_levels', 'G3_pass_fail'], axis=1)
y_test = test_por['G3_five_levels']

In [15]:
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train, y_train)
rf_base_score = round(rf.score(X_test, y_test), 3)
print("Mean accuracy: " + str(rf_base_score))

Mean accuracy: 0.346


In [16]:
rf_random = RandomizedSearchCV(RandomForestClassifier(random_state=1), param_distributions = random_grid, n_iter = 50, cv = 5, verbose=1, random_state=1, n_jobs=-1)
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   58.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  3.3min finished


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=1),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500, 600, 700, 800,
                                                         900, 1000, 1100, 1200,
                                                         1300, 1400, 1500, 1600,
                                                         1700, 1800,

In [17]:
print(rf_random.best_params_)
rf_best = rf_random.best_estimator_

{'n_estimators': 1500, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': None, 'bootstrap': True}


In [18]:
print("Mean accuracy: " + str(round(rf_best.score(X_test, y_test), 4)))

Mean accuracy: 0.3231


Our RandomizedSearchCV best estimator is worse than our base random forest, so that's the model we'll use to to analyze our predictions. 

In [19]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       0.30      0.23      0.26        13
           2       0.11      0.08      0.10        24
           3       0.44      0.40      0.42        40
           4       0.37      0.55      0.44        38
           5       0.33      0.20      0.25        15

    accuracy                           0.35       130
   macro avg       0.31      0.29      0.29       130
weighted avg       0.33      0.35      0.33       130



In [20]:
confusion_matrix(y_test, predictions)

array([[ 3,  3,  4,  3,  0],
       [ 4,  2,  8,  9,  1],
       [ 1,  8, 16, 15,  0],
       [ 1,  4,  7, 21,  5],
       [ 1,  1,  1,  9,  3]], dtype=int64)

We find our model does a good decent job at predicting who will score a 3/4, but poorly everywhere else.

In [21]:
variable_importances(rf_best)

Feature ranking:
1. Feature: failures (0.066995)
2. Feature: absences (0.059327)
3. Feature: Medu (0.050092)
4. Feature: Walc (0.046229)
5. Feature: health (0.045712)
6. Feature: age (0.041973)
7. Feature: freetime (0.040540)
8. Feature: Fedu (0.040072)
9. Feature: goout (0.038092)
10. Feature: higher (0.036883)


## Pass/Fail

Finally, we move onto our pass/fail classification system. Passing counts as scoring a 10 or higher on the G3 grades, and failing is scoring 0-9. Our baseline is again the most common class in the training data. 

In [22]:
X_train = train_por.drop(['G1', 'G2', 'G3', 'G3_five_levels', 'G3_pass_fail'], axis=1)
y_train = train_por['G3_pass_fail']

X_test = test_por.drop(['G1', 'G2', 'G3', 'G3_five_levels', 'G3_pass_fail'], axis=1)
y_test = test_por['G3_pass_fail']

In [23]:
most_common_class = train_por['G3_pass_fail'].value_counts().sort_values(ascending=False).index[0]
classified_correctly = 0
for i in range(len(train_por)):
    if train_por['G3_pass_fail'][i] == most_common_class:
        classified_correctly +=1 
print("Mean accuracy of baseline model: " + str(round(classified_correctly/len(train_por), 3)))

Mean accuracy of baseline model: 0.836


In [24]:
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)
rf_base_score = round(rf.score(X_test, y_test), 3)
print("Mean accuracy: " + str(rf_base_score))

Mean accuracy: 0.854


In [25]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        15
           1       0.88      0.97      0.92       115

    accuracy                           0.85       130
   macro avg       0.44      0.48      0.46       130
weighted avg       0.78      0.85      0.81       130



In [None]:
rf_random = RandomizedSearchCV(RandomForestClassifier(random_state=1), param_distributions = random_grid, n_iter = 50, cv = 5, verbose=1, random_state=1, n_jobs=-1)
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   26.3s


In [None]:
print(rf_random.best_params_)
rf_best = rf_random.best_estimator_

In [None]:
print("Mean accuracy: " + str(round(rf_best.score(X_test, y_test), 4)))

In [None]:
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

In [None]:
confusion_matrix(y_test, predictions)

Our RandomizedSearchCV is less accurate than the initial random forest model, so we use that model to analyze our predictions. The model is really well at predicting who will pass, and does poorly at predicting who will fail. The model predicted 111/115 passing scores, but predicted 4 people to fail who actually passed. The model also predicted that 15 people would pass who actually ended up failing. 

In [None]:
variable_importances(rf_best)