## Machine Learning: Students' Performance
In this notebook, supervised machine learning techniques are used to predict a student's performance in a Mathematics, based on a set of 33 attributes (or features). The data set used in this notebook is publicly available at [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Student+Performance), contributed by:

> P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.  
[[Web Link]](http://www3.dsi.uminho.pt/pcortez/student.pdf)

### Data set Information (UCI Machine Learning Repository):
This data set consists of student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features, and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). 

In this notebook, only the 'mat' file shall be used. 

### Objective
To predict a student's performance (G3) on a mathematical subject based on a set of attributes, including demographic, social and school related features. 

For more information on the attributes, please visit UCI Machine Learning Repository.

### Approaches 
In the paper by P. Cortez and A. Silva, they suggested three approaches:

 1. **Binary Classification** : If $G3 \geq 10$, the student passes (1). If not, the student fails (0). 
 2. **Multi class Classification** : The student receives a grade (Band) depending on her G3 score
	- Band 1: $16 \leq G3 \leq 20$
	- Band 2: $14 \leq G3 \leq 15$
	- Band 3: $12 \leq G3 \leq 13$
	- Band 4: $10 \leq G3 \leq 11$
	- Band 5: $0 \leq G3 \leq 9$
 3. **Regression** : Using the set of attributes, predict the actual $G3$ score of a student. 

#### Scenarios
For each approach, P. Cortez and A. Silva also suggested testing the Machine Learning model using three different scenarios: 

 1. Prediction takes into account of ALL features, excluding $G3$ (the target).
 2. Prediction takes into account of ALL features but $G1$ and $G3$. 
 3. Prediction takes into account of ALL features but $G2$ and $G3$. 

### Models
In this practice, the following classifiers are used: **Decision Tree** (DT), **Random Forest** (RF), **Single Layer Perceptron Neural Network** (NN), and **Support Vector Machine** (SVC / SVR). 

As far as possible, default parameters will be used (ie. C, max_iter, alpha) with the exception of *random_state = 0*, and *max_depth* for DT and RF to reduce computational time. 

#### Measures 
At the end of each scenario for each approach, the following scores will be calculated for each model: 

 - Precision Score
 - Recall Score
 - AUROC 
 - Accuracy scores for testing set and training set against the classifier
 
For regressors, we will be examining the models using Root Squared Mean Error (RMSE)

## Summary
This section summarises the results from different approaches and scenarios. 
#### Approach 1 : Binary Classification of Student's Performance 
<table border = "1" style="width:95%">
  <tr>
      <th>Measures</th>
      <th colspan="4" style="text-align : center">Scenario 1</th>
      <th colspan="4" style="text-align : center">Scenario 2</th>
      <th colspan="4" style="text-align : center">Scenario 3</th>
  </tr>
  <tr>
      <th style="text-align : center">Models</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVC</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVC</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVC</th>
  </tr>
  <tr>
      <td style="text-align : center">Precision</td>
      <td style="text-align : center">0.826</td>
      <td style="text-align : center">0.843</td>
      <td style="text-align : center">0.921</td>
      <td style="text-align : center">0.831</td>
      <td style="text-align : center">0.824</td>
      <td style="text-align : center">0.833</td>
      <td style="text-align : center">0.951</td>
      <td style="text-align : center">0.795</td>
      <td style="text-align : center">0.743</td>
      <td style="text-align : center">0.726</td>
      <td style="text-align : center">0.818</td>
      <td style="text-align : center">0.831</td>
  </tr>
   <tr>
      <td style="text-align : center">Recall</td>
      <td style="text-align : center">0.919</td>
      <td style="text-align : center">0.952</td>
      <td style="text-align : center">0.935</td>
      <td style="text-align : center">0.952</td>
      <td style="text-align : center">0.903</td>
      <td style="text-align : center">0.968</td>
      <td style="text-align : center">0.935</td>
      <td style="text-align : center">0.935</td>
      <td style="text-align : center">0.935</td>
      <td style="text-align : center">0.984</td>
      <td style="text-align : center">0.871</td>
      <td style="text-align : center">0.952</td>
  </tr>
   <tr>
      <td style="text-align : center">AUROC</td>
      <td style="text-align : center">0.917</td>
      <td style="text-align : center">0.974</td>
      <td style="text-align : center">0.955</td>
      <td style="text-align : center">0.963</td>
      <td style="text-align : center">0.905</td>
      <td style="text-align : center">0.957</td>
      <td style="text-align : center">0.953</td>
      <td style="text-align : center">0.958</td>
      <td style="text-align : center">0.825</td>
      <td style="text-align : center">0.930</td>
      <td style="text-align : center">0.853</td>
      <td style="text-align : center">0.946</td>
  </tr>
   <tr>
      <td style="text-align : center">Acc. (Train)</td>
      <td style="text-align : center">0.953</td>
      <td style="text-align : center">0.959</td>
      <td style="text-align : center">1.000</td>
      <td style="text-align : center">0.929</td>
      <td style="text-align : center">0.953</td>
      <td style="text-align : center">0.953</td>
      <td style="text-align : center">1.000</td>
      <td style="text-align : center">0.919</td>
      <td style="text-align : center">0.885</td>
      <td style="text-align : center">0.899</td>
      <td style="text-align : center">1.000</td>
      <td style="text-align : center">0.875</td>
  </tr>
   <tr>
      <td style="text-align : center">Acc. (Test)</td>
      <td style="text-align : center">0.828</td>
      <td style="text-align : center">0.858</td>
      <td style="text-align : center">0.909</td>
      <td style="text-align : center">0.848</td>
      <td style="text-align : center">0.818</td>
      <td style="text-align : center">0.859</td>
      <td style="text-align : center">0.929</td>
      <td style="text-align : center">0.808</td>
      <td style="text-align : center">0.758</td>
      <td style="text-align : center">0.758</td>
      <td style="text-align : center">0.798</td>
      <td style="text-align : center">0.848</td>
  </tr>
</table>

In this first approach, using the Neural Network generally overfits the training data (perfect accuracy for training set), despite having a relatively high accuracy for predicting the test set. In general, the RF model minimised Type I and Type II errors across all three different scenarios, generalised the training set and have 85.5% accuracy on the test set. 

#### Approach 2: Multi-class Classification of Student's Performance
<table border = "1" style="width:95%">
  <tr>
      <th>Measures</th>
      <th colspan="4" style="text-align : center">Scenario 1</th>
      <th colspan="4" style="text-align : center">Scenario 2</th>
      <th colspan="4" style="text-align : center">Scenario 3</th>
  </tr>
  <tr>
      <th style="text-align : center">Models</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVC</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVC</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVC</th>
  </tr>
  <tr>
      <td style="text-align : center">Precision</td>
      <td style="text-align : center">0.717</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.646</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.626</td>
      <td style="text-align : center">0.545</td>
      <td style="text-align : center">0.556</td>
      <td style="text-align : center">0.545</td>
      <td style="text-align : center">0.606</td>
  </tr>
   <tr>
      <td style="text-align : center">Recall</td>
      <td style="text-align : center">0.717</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.646</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.626</td>
      <td style="text-align : center">0.545</td>
      <td style="text-align : center">0.556</td>
      <td style="text-align : center">0.545</td>
      <td style="text-align : center">0.606</td>
  </tr>
   <tr>
      <td style="text-align : center">AUROC</td>
      <td style="text-align : center">0.925</td>
      <td style="text-align : center">0.914</td>
      <td style="text-align : center">0.900</td>
      <td style="text-align : center">0.928</td>
      <td style="text-align : center">0.911</td>
      <td style="text-align : center">0.885</td>
      <td style="text-align : center">0.886</td>
      <td style="text-align : center">0.933</td>
      <td style="text-align : center">0.811</td>
      <td style="text-align : center">0.813</td>
      <td style="text-align : center">0.788</td>
      <td style="text-align : center">0.859</td>
  </tr>
   <tr>
      <td style="text-align : center">Acc. (Train)</td>
      <td style="text-align : center">0.814</td>
      <td style="text-align : center">0.868</td>
      <td style="text-align : center">0.979</td>
      <td style="text-align : center">0.747</td>
      <td style="text-align : center">0.811</td>
      <td style="text-align : center">0.828</td>
      <td style="text-align : center">0.970</td>
      <td style="text-align : center">0.662</td>
      <td style="text-align : center">0.672</td>
      <td style="text-align : center">0.784</td>
      <td style="text-align : center">1.000</td>
      <td style="text-align : center">0.598</td>
  </tr>
   <tr>
      <td style="text-align : center">Acc. (Test)</td>
      <td style="text-align : center">0.717</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.646</td>
      <td style="text-align : center">0.697</td>
      <td style="text-align : center">0.626</td>
      <td style="text-align : center">0.545</td>
      <td style="text-align : center">0.556</td>
      <td style="text-align : center">0.545</td>
      <td style="text-align : center">0.606</td>
  </tr>
</table>

In the second approach, the classifiers are able to identify the bands (classes) with high reliability (ie. High AUROC). However, the accuracy for classifiers are still limited, with some classifiers overfitting the training data and generating low test set accuracy. 

#### Approach 3: Regression
<table border = "1" style="width:95%">
  <tr>
      <th>Measures</th>
      <th colspan="4" style="text-align : center">Scenario 1</th>
      <th colspan="4" style="text-align : center">Scenario 2</th>
      <th colspan="4" style="text-align : center">Scenario 3</th>
  </tr>
  <tr>
      <th style="text-align : center">Models</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVR</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVR</th>
      <th style="text-align : center">DT</th>
      <th style="text-align : center">RF</th>
      <th style="text-align : center">NN</th>
      <th style="text-align : center">SVR</th>
  </tr>
  <tr>
      <td style="text-align : center">RMSE</td>
      <td style="text-align : center">1.980</td>
      <td style="text-align : center">1.920</td>
      <td style="text-align : center">2.331</td>
      <td style="text-align : center">2.748</td>
      <td style="text-align : center">2.013</td>
      <td style="text-align : center">1.939</td>
      <td style="text-align : center">2.508</td>
      <td style="text-align : center">2.714</td>
      <td style="text-align : center">3.607</td>
      <td style="text-align : center">2.681</td>
      <td style="text-align : center">3.613</td>
      <td style="text-align : center">3.602</td>
  </tr>
</table>

Finally, in approach 3, we can see that the Random Forest Regressor is the regressor that minimises the RMSE across all three scenarios. However, the error seems pretty big, approximately 2 to 3 points. This error may differentiate between 1 to 2 bands (ie. Approach 2). 

### In Conclusion...
The Random Forest Classifier and Regressor (RF) seems to be the best model to be used within the models selected. Also, as stated on UCI Machine Learning Repository, G3 is highly correlated with G1 and G2 because they are interdependent (ie. G3 depends on G1 and G2). Removing these features (scenario 2 and 3) lead to a decrease in accuracy across all models. 

This notebook will demonstrate the use of Random Forest Classifier and Regressor for all nine approaches and scenarios.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, precision_score, recall_score, mean_squared_error

maths = pd.read_csv('student-mat.csv', sep = ';')

In [2]:
"""
From the loading of the dataset, we can see that there are some columns with values that are not int or floats. In many classifiers, strs are not allowed. Therefore, we have to encode these labels into int/floats. Using the LabelEncoder, we take each column and convert them into respective classes (eg. 'GP' = 0, 'MS' = 1). We also have to store the encoders in a dictionary as it may be necessary for us to transform these values back for explanatory data analysis (using the .inverse_transform attribute).
"""
le_dict = {} 
for c in range(len(maths.columns)):
    series = maths[maths.columns[c]]
    if type(series.iloc[0]) == str:
        labelEncode = LabelEncoder()
        le_dict[str(maths.columns[c])] = labelEncode
        transformed = labelEncode.fit_transform(series)
        maths[maths.columns[c]] = transformed

##### Approach 1

In [3]:
"""
Approach 1, Scenario 1: Binary Classification considering all features
"""
maths['PF'] = np.where(maths['G3'] >= 10, 1, 0)

X = maths[maths.columns[:-2]] # All columns except for G3 and the new column 'PF'
y = maths[maths.columns[-1]] # PF: the output value
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

clf = RandomForestClassifier(random_state = 0, max_depth = 4).fit(X_train, y_train)
grades_predicted = clf.predict(X_test)
probability = clf.predict_proba(X_test)
prec_score = precision_score(y_test, grades_predicted)
rec_score = recall_score(y_test, grades_predicted)
AUROC_score = roc_auc_score(y_test, probability[:, 1])
accuracy_train = clf.score(X_train, y_train)
accuracy_test = clf.score(X_test, y_test)

print('\033[1m'+'Approach 1, Scenario 1', '\033[0m' + '\nPrecision: {} \nRecall: {} \nAUROC: {} \nAcc. (Train): {} \nAcc.(Test): {}'.format(prec_score, rec_score, AUROC_score, accuracy_train, accuracy_test))

[1mApproach 1, Scenario 1 [0m
Precision: 0.8428571428571429 
Recall: 0.9516129032258065 
AUROC: 0.9742807323452485 
Acc. (Train): 0.9594594594594594 
Acc.(Test): 0.8585858585858586


In [4]:
"""
Approach 1, Scenario 2: Binary classification without considering G1
"""
X = maths[list(maths.columns[:-4]) + [maths.columns[-3]]] # Excludes G1, G3, and PF
y = maths[maths.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

clf = RandomForestClassifier(random_state = 0, max_depth = 4).fit(X_train, y_train)
grades_predicted = clf.predict(X_test)
probability = clf.predict_proba(X_test)
prec_score = precision_score(y_test, grades_predicted)
rec_score = recall_score(y_test, grades_predicted)
AUROC_score = roc_auc_score(y_test, probability[:, 1])
accuracy_train = clf.score(X_train, y_train)
accuracy_test = clf.score(X_test, y_test)

print('\033[1m'+'Approach 1, Scenario 2', '\033[0m' + '\nPrecision: {} \nRecall: {} \nAUROC: {} \nAcc. (Train): {} \nAcc.(Test): {}'.format(prec_score, rec_score, AUROC_score, accuracy_train, accuracy_test))

[1mApproach 1, Scenario 2 [0m
Precision: 0.8333333333333334 
Recall: 0.967741935483871 
AUROC: 0.9572798605056669 
Acc. (Train): 0.9527027027027027 
Acc.(Test): 0.8585858585858586


In [5]:
"""
Approach 1, Scenario 3: Binary classification without considering G2
"""
X = maths[list(maths.columns[:-4]) + [maths.columns[-4]]] # Excludes G1, G3, and PF
y = maths[maths.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

clf = RandomForestClassifier(random_state = 0, max_depth = 4).fit(X_train, y_train)
grades_predicted = clf.predict(X_test)
probability = clf.predict_proba(X_test)
prec_score = precision_score(y_test, grades_predicted)
rec_score = recall_score(y_test, grades_predicted)
AUROC_score = roc_auc_score(y_test, probability[:, 1])
accuracy_train = clf.score(X_train, y_train)
accuracy_test = clf.score(X_test, y_test)

print('\033[1m'+'Approach 1, Scenario 3', '\033[0m' + '\nPrecision: {} \nRecall: {} \nAUROC: {} \nAcc. (Train): {} \nAcc.(Test): {}'.format(prec_score, rec_score, AUROC_score, accuracy_train, accuracy_test))

[1mApproach 1, Scenario 3 [0m
Precision: 0.7261904761904762 
Recall: 0.9838709677419355 
AUROC: 0.9298169136878813 
Acc. (Train): 0.8986486486486487 
Acc.(Test): 0.7575757575757576


##### Approach 2

In [6]:
"""
Approach 2, Scenario 1: Multi-class Classification considering ALL features
"""
for i in range(len(maths)):
    if (maths.loc[i, 'G3'] >= 16) & (maths.loc[i, 'G3'] <= 20):
        maths.loc[i, 'Band'] = 1
    elif (maths.loc[i, 'G3'] >= 14) & (maths.loc[i, 'G3'] <= 15):
        maths.loc[i, 'Band'] = 2
    elif (maths.loc[i, 'G3'] >= 12) & (maths.loc[i, 'G3'] <= 13):
        maths.loc[i, 'Band'] = 3
    elif (maths.loc[i, 'G3'] >= 10) & (maths.loc[i, 'G3'] <= 11):
        maths.loc[i, 'Band'] = 4
    else:
        maths.loc[i, 'Band'] = 5
        
X = maths[maths.columns[:-3]] # All columns except for G3, 'PF', and 'Band'
y = maths[maths.columns[-1]] # Band: the output value
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

clf = RandomForestClassifier(random_state = 0, max_depth = 4).fit(X_train, y_train)
grades_predicted = clf.predict(X_test)
probability = clf.predict_proba(X_test)
prec_score = precision_score(y_test, grades_predicted, average = 'micro')
rec_score = recall_score(y_test, grades_predicted, average = 'micro')
AUROC_score = roc_auc_score(y_test, probability, multi_class = 'ovo')
accuracy_train = clf.score(X_train, y_train)
accuracy_test = clf.score(X_test, y_test)

print('\033[1m'+'Approach 2, Scenario 1', '\033[0m' + '\nPrecision: {} \nRecall: {} \nAUROC: {} \nAcc. (Train): {} \nAcc.(Test): {}'.format(prec_score, rec_score, AUROC_score, accuracy_train, accuracy_test))

[1mApproach 2, Scenario 1 [0m
Precision: 0.696969696969697 
Recall: 0.696969696969697 
AUROC: 0.9135694135694135 
Acc. (Train): 0.8682432432432432 
Acc.(Test): 0.696969696969697


In [7]:
"""
Approach 2, Scenario 2: Multi-class Classification without considering G1
"""
X = maths[list(maths.columns[:-5]) + [maths.columns[-4]]] # All columns except for G3, 'PF', and 'Band'
y = maths[maths.columns[-1]] # Band: the output value
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

clf = RandomForestClassifier(random_state = 0, max_depth = 4).fit(X_train, y_train)
grades_predicted = clf.predict(X_test)
probability = clf.predict_proba(X_test)
prec_score = precision_score(y_test, grades_predicted, average = 'micro')
rec_score = recall_score(y_test, grades_predicted, average = 'micro')
AUROC_score = roc_auc_score(y_test, probability, multi_class = 'ovo')
accuracy_train = clf.score(X_train, y_train)
accuracy_test = clf.score(X_test, y_test)

print('\033[1m'+'Approach 2, Scenario 2', '\033[0m' + '\nPrecision: {} \nRecall: {} \nAUROC: {} \nAcc. (Train): {} \nAcc.(Test): {}'.format(prec_score, rec_score, AUROC_score, accuracy_train, accuracy_test))

[1mApproach 2, Scenario 2 [0m
Precision: 0.6464646464646465 
Recall: 0.6464646464646465 
AUROC: 0.8854971854971856 
Acc. (Train): 0.8277027027027027 
Acc.(Test): 0.6464646464646465


In [8]:
"""
Approach 2, Scenario 3: Multi-class Classification without considering G2
"""
X = maths[maths.columns[:-4]] # All columns except for G3, 'PF', and 'Band'
y = maths[maths.columns[-1]] # Band: the output value
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

clf = RandomForestClassifier(random_state = 0, max_depth = 4).fit(X_train, y_train)
grades_predicted = clf.predict(X_test)
probability = clf.predict_proba(X_test)
prec_score = precision_score(y_test, grades_predicted, average = 'micro')
rec_score = recall_score(y_test, grades_predicted, average = 'micro')
AUROC_score = roc_auc_score(y_test, probability, multi_class = 'ovo')
accuracy_train = clf.score(X_train, y_train)
accuracy_test = clf.score(X_test, y_test)

print('\033[1m'+'Approach 2, Scenario 3', '\033[0m' + '\nPrecision: {} \nRecall: {} \nAUROC: {} \nAcc. (Train): {} \nAcc.(Test): {}'.format(prec_score, rec_score, AUROC_score, accuracy_train, accuracy_test))

[1mApproach 2, Scenario 3 [0m
Precision: 0.5555555555555556 
Recall: 0.5555555555555556 
AUROC: 0.8126594126594127 
Acc. (Train): 0.7837837837837838 
Acc.(Test): 0.5555555555555556


##### Approach 3

In [9]:
"""
Approach 3, Scenario 1: Regression considering all features
"""
X = maths[maths.columns[:-3]]
y = maths[maths.columns[-3]] #G3 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

reg = RandomForestRegressor(random_state = 0).fit(X_train, y_train)
grades_predicted = reg.predict(X_test)
mse = mean_squared_error(y_test, grades_predicted)
rmse = np.sqrt(mse)
print('\033[1m'+'Approach 3, Scenario 1', '\033[0m' + '\nRMSE: {}'.format(rmse))

[1mApproach 3, Scenario 1 [0m
RMSE: 1.9197858676888364


In [10]:
"""
Approach 3, Scenario 2: Regression without considering G1
"""
X = maths[list(maths.columns[:-5]) + [maths.columns[-4]]]
y = maths[maths.columns[-3]] #G3 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

reg = RandomForestRegressor(random_state = 0).fit(X_train, y_train)
grades_predicted = reg.predict(X_test)
mse = mean_squared_error(y_test, grades_predicted)
rmse = np.sqrt(mse)
print('\033[1m'+'Approach 3, Scenario 2', '\033[0m' + '\nRMSE: {}'.format(rmse))

[1mApproach 3, Scenario 2 [0m
RMSE: 1.9390268828784332


In [11]:
"""
Approach 3, Scenario 3: Regression without considering G2
"""
X = maths[maths.columns[:-4]]
y = maths[maths.columns[-3]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

reg = RandomForestRegressor(random_state = 0).fit(X_train, y_train)
grades_predicted = reg.predict(X_test)
mse = mean_squared_error(y_test, grades_predicted)
rmse = np.sqrt(mse)
print('\033[1m'+'Approach 3, Scenario 3', '\033[0m' + '\nRMSE: {}'.format(rmse))

[1mApproach 3, Scenario 3 [0m
RMSE: 2.6811585389736114
