# Presentation of experiment results

In the dataset "Alcohol Effects On Study" there is given extensive information about students of two Portuguese schools and their final grades in two subjects: maths and Portuguese. I trained a few models that predict the final grades of the students. As instructed in the description of the dataset, I didn't use information about the mid-term grades of the students as this would make the task easy and uninteresting.

Two models that I used are Decision Tree Regressor and Linear Regression. At first, I fit them to the whole dataset. However, as a result, they strongly overfitted to the training data. The R Squared score for the Tree Regressor was negative, and for the Linear Regression it was positive but close to 0.

The reason is that the dataset is not very big (only 395 entries), and there is a lot of information about each student (30 columns). Some of the information seem not very relevant for the task (e.g. "going out with friends" score, or the time needed for travelling to school). I checked that the importance of the nine most important features account only for 66% of the total explanation of data variance. The model didn't focus only on the most important features.

Next, I picked columns which seem to contain the most relevant information:

*   'studytime' - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
*   'failures' - number of past class failures (numeric: n if 1<=n<3, else 4)
*   'schoolsup' - extra educational support (binary: yes or no)
*   'famsup' - family educational support (binary: yes or no)
*   'paid' - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
*   'higher' - wants to take higher education (binary: yes or no)
*   'Dalc' - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
*   'Walc' - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
*   'absences' - number of school absences (numeric: from 0 to 93)

Given only data from these columns, I trained a new Decision Tree Regressor and a new Linear Regression model, both for math grades data and for Portuguese grades data.

This time, I put additional constraints on the decision tree to avoid overfitting. Fine-tuning these constraints (maximal depth of the tree and maximal number of features used for each split) was not easy. In the end, the values with which I came up let the model obtain positive R Squared for both the math and the Portuguese data. However, I needed to use different values for each of these datasets which suggests that the models are not robust.

Results of the Linear Regression model were also better for the dataset with selected columns. The discrepancy between the R Squared score for training data and test data was not big.

In the Tree Regressors for both the math data and the Portuguese data, 'workday alcohol consumption' was one of the most important explanatory features. Below, I present the feature importances which were higher than 0%.

Feature importances for Math grades:
*   absences: 35%
*   failures: 29%
*   studytime: 10%
*   schoolsup: 9%
*   Walc: 6%
*   higher: 5%
*   Dalc: 4%


Feature importances for Portuguese grades:
*   failures: 37%
*   Dalc: 22%
*   higher: 20%
*   studytime: 7%
*   absences: 7%
*   schoolsup: 5%
*   famsup: 2%

For the math grades, the most important factors are the number of school absences and the number of past class failures. The workday and weekend alcohol consumption combined together account for 10% explanatory power, similar to the weekly study time or the extra educational support.

For the Portuguese grades, the most important factor are past class failures, and the second most important is the workday alocohol consumption.

Below, I present R squared scores of all the models:
For the full math dataset:
*   Tree Regressor on training data: 1.00
*   Tree Regressor on test data: -0.29
*   Linear Regression on training data: 0.29
*   Linear Regression on test data: 0.08

For the math dataset with selected features:
*   Tree Regressor on training data: 0.30
*   Tree Regressor on test data: 0.18
*   Linear Regression on training data: 0.12
*   Linear Regression on test data: 0.12

For the Portuguese dataset with selected features:
*   Tree Regressor on training data: 0.16
*   Tree Regressor on test data: 0.14
*   Linear Regression on training data: 0.27
*   Linear Regression on test data: 0.21

# Code with comments

Importing libraries and data...


In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
pd.set_option('display.max_columns', None)
maths_dataset = pd.read_csv('Maths.csv')
portuguese_dataset = pd.read_csv('Portuguese.csv')

I preprocess the data in the following way:


*   Drop columns containing mid-term grades
*   Change columns containing strings into binary data (if there were only two categories - yes/no), or into one-hot encoded data with a new column for each category
*   Scale the values in columns containing numbers so that the lowest value is 0 and the biggest is 1

I also select the columns which seem to be the most important. The model will be trained twice: once on the full dataset, and once only on the dataset with the selected columns.



In [101]:
def preprocess_dataset(old_dataset, drop_g1_g2=True, scale_g3=False):
  dataset = old_dataset.copy()
  if drop_g1_g2:
    dataset.drop(['G1', 'G2'], axis=1, inplace=True)
  columns = dataset.columns
  for col in columns:
    if dataset[col].astype(str).str.isnumeric().all() and (scale_g3 or col != 'G3'):
      col_max, col_min = dataset[col].max(), dataset[col].min()
      multiplier = 1 / (col_max - col_min)
      dataset[col] = multiplier * (dataset[col].astype(np.float64) - col_min)
    else:
      values = dataset[col].unique()
      if len(values) <= 1:
        dataset.drop(col, axis=1, inplace=True)
      elif len(values) == 2:
        val1 = 'yes' if 'yes' in values else values[0]
        dataset[col] = np.where(dataset[col] == val1, 1, 0)
      else:
        dummies = pd.get_dummies(dataset[[col]], prefix=col)
        dataset.drop(col, axis=1, inplace=True)
        dataset = pd.concat([dataset, dummies], axis=1)
  return dataset.drop('G3', axis=1).astype(np.float64), dataset['G3'].astype(np.float64)

def keep_selected_columns(old_dataset):
  return old_dataset.copy()[['studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'higher', 'Dalc', 'Walc', 'absences']]

math_X, math_y = preprocess_dataset(maths_dataset)
portuguese_X, portuguese_y = preprocess_dataset(portuguese_dataset)
math_small_X = keep_selected_columns(math_X)
portuguese_small_X = keep_selected_columns(portuguese_X)

Function fitting regressors to given data. It splits data into the train and test sets. Returns R Squared scores for train and test data for each regressor.

In [102]:
def fit_and_evaluate(X, y, regressors):
  X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.3, random_state=0)
  r2s = []
  for regressor in regressors:
    regressor.fit(X_train, y_train)
    pred_test = regressor.predict(X_test)
    r2s.append((regressor.score(X_train, y_train), regressor.score(X_test, y_test)))
  return r2s

Fitting regressors to the full dataset of math grades. Visible overfitting.

In [104]:
tree_reg = DecisionTreeRegressor(random_state=0)
lin_reg = LinearRegression()
regressors = (tree_reg, lin_reg)
math_r2s = fit_and_evaluate(math_X, math_y, regressors)
print('Results of the tree regressor for math grades:')
print('R squared score on the train set equal: %.2f' % math_r2s[0][0])
print('R squared score on the test set equal: %.2f' % math_r2s[0][1])
print('Importances of nine most important features:')
sorted_importance_feature_pairs = sorted(zip(tree_reg.feature_importances_, math_X.columns), reverse=True)
for importance, feature in sorted_importance_feature_pairs[:9]:
  print('%s: %.0f%%' % (feature, 100*importance))
sum_of_low_importances = sum(list(list(zip(*sorted_importance_feature_pairs))[0][9:]))
print('Sum of importances of features ranked lower than 9th: %.3f\n' % sum_of_low_importances)

print('Results of the linear regressor for math grades:')
print('R squared score on the train set equal: %.2f' % math_r2s[1][0])
print('R squared score on the test set equal: %.2f' % math_r2s[1][1])

Results of the tree regressor for math grades:
R squared score on the train set equal: 1.00
R squared score on the test set equal: -0.29
Importances of nine most important features:
absences: 19%
failures: 9%
Walc: 7%
studytime: 6%
age: 6%
Fedu: 5%
Fjob_other: 5%
Mjob_at_home: 4%
romantic: 4%
Sum of importances of features ranked lower than 9th: 0.332

Results of the linear regressor for math grades:
R squared score on the train set equal: 0.29
R squared score on the test set equal: 0.08


Fitting regressors to the dataset of math grades with nine selected columns. I limit the tree regressor by defining the maximal depth of the tree, and the maximal number of features used during each split.

In [105]:
tree_reg = DecisionTreeRegressor(random_state=0, max_depth=4, max_features=4)
lin_reg = LinearRegression()
regressors = (tree_reg, lin_reg)
math_r2s = fit_and_evaluate(math_small_X, math_y, regressors)
print('Results of the tree regressor for math grades using only the important columns:')
print('R squared score on the train set equal: %.2f' % math_r2s[0][0])
print('R squared score on the test set equal: %.2f' % math_r2s[0][1])
print('Feature importances:')
sorted_importance_feature_pairs = sorted(zip(tree_reg.feature_importances_, math_small_X.columns), reverse=True)
for importance, feature in sorted_importance_feature_pairs:
  print('%s: %.0f%%' % (feature, 100*importance))

print('Results of the linear regressor for math grades:')
print('R squared score on the train set equal: %.2f' % math_r2s[1][0])
print('R squared score on the test set equal: %.2f' % math_r2s[1][1])

Results of the tree regressor for math grades using only the important columns:
R squared score on the train set equal: 0.30
R squared score on the test set equal: 0.18
Feature importances:
absences: 35%
failures: 29%
studytime: 10%
schoolsup: 9%
Walc: 6%
higher: 5%
Dalc: 4%
paid: 0%
famsup: 0%
Results of the linear regressor for math grades:
R squared score on the train set equal: 0.12
R squared score on the test set equal: 0.12


Fitting regressors to the dataset of Portuguese grades with nine selected columns.

In [106]:
tree_reg = DecisionTreeRegressor(random_state=0, max_depth=3, max_features=3)
lin_reg = LinearRegression()
regressors = (tree_reg, lin_reg)
portuguese_r2s = fit_and_evaluate(portuguese_small_X, portuguese_y, regressors)
print('Results of the tree regressor for portuguese grades using only the important columns:')
print('R squared score on the train set equal: %.2f' % portuguese_r2s[0][0])
print('R squared score on the test set equal: %.2f' % portuguese_r2s[0][1])
print('Feature importances:')
sorted_importance_feature_pairs = sorted(zip(tree_reg.feature_importances_, portuguese_small_X.columns), reverse=True)
for importance, feature in sorted_importance_feature_pairs:
  print('%s: %.0f%%' % (feature, 100*importance))

print('Results of the linear regressor for portuguese grades:')
print('R squared score on the train set equal: %.2f' % portuguese_r2s[1][0])
print('R squared score on the test set equal: %.2f' % portuguese_r2s[1][1])

Results of the tree regressor for portuguese grades using only the important columns:
R squared score on the train set equal: 0.16
R squared score on the test set equal: 0.14
Feature importances:
failures: 37%
Dalc: 22%
higher: 20%
studytime: 7%
absences: 7%
schoolsup: 5%
famsup: 2%
paid: 0%
Walc: 0%
Results of the linear regressor for portuguese grades:
R squared score on the train set equal: 0.27
R squared score on the test set equal: 0.21
