It is the prediction of students' grades based on the Effects on Study dataset. Dataset consists of both numerical and categorical data. Following methods are used: Linear Regression, Stochastic Gradient Descend Regression and Random Forest Regression.
Despite the grades being integers, I would argue that this task can be also treated as a regression problem. The grades are supposed to measure students' knowledge and their discretization has only a purpose of making things easier, not to reflect the nature of the problem it operationalizes. 

Firstly, in order to download Alcohol Effects on Study dataset, we mount google drive and upload kaggle API.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

from google.colab import files

Mounted at /content/gdrive


In [2]:
files.upload() #this will prompt you to upload the kaggle.jso

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"paulinakaczyska","key":"a590f81fa200cd408d5cb174df96882a"}'}

In [3]:
  #make sure kaggle.json file is present
  !ls -lha kaggle.json

  #Install kaggle API client
  !pip install -q kaggle

  #kaggle API client expects the file to be in ~/.kaggle

  #so move it there
  !mkdir -p ~/.kaggle
  !cp kaggle.json ~/.kaggle/
  

-rw-r--r-- 1 root root 71 Oct 13 14:53 kaggle.json


In [4]:
  #seting permissions
  !chmod 600 /root/.kaggle/kaggle.json

  #check your directory before downloading the datasets
  !pwd

/content


In [5]:
#downloading
!kaggle datasets download -d whenamancodes/alcohol-effects-on-study

#unzipping
!unzip alcohol-effects-on-study.zip

Downloading alcohol-effects-on-study.zip to /content
  0% 0.00/18.1k [00:00<?, ?B/s]
100% 18.1k/18.1k [00:00<00:00, 11.5MB/s]
Archive:  alcohol-effects-on-study.zip
  inflating: Maths.csv               
  inflating: Portuguese.csv          


# Data Preprocessing

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import numpy as np

from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

In [7]:
binary = ['school', 'sex', 'address', 'famsize', 'Pstatus','schoolsup', 'famsup', 'paid', 'activities', 'nursery','higher', 'internet', 'romantic']
categorical_nb = ['Mjob', 'Fjob', 'reason', 'guardian']
numerical = ['Medu', 'Fedu','traveltime', 'studytime','failures','famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

The categorical data are one-hot encoded. Data is divided to input and output.

In [8]:
math = pd.read_csv("Maths.csv", encoding = "UTF-8")
port = pd.read_csv("Portuguese.csv", encoding = "UTF-8")

math_i_c = pd.get_dummies(math[binary+categorical_nb],drop_first = True)
port_i_c = pd.get_dummies(port[binary+categorical_nb],drop_first = True)

math_o = math.iloc[:,-3:]
port_o = port.iloc[:,-3:]


Data is combined into input X and output Y arrays.

In [73]:

math_array = pd.concat([math_i_c,math[numerical]],axis=1).to_numpy()
port_array = pd.concat([port_i_c,port[numerical]],axis=1).to_numpy()
X = np.concatenate((math_array,port_array),axis = 0)

math_array_o = math_o.to_numpy()
port_array_o = port_o.to_numpy()
Y = np.concatenate((math_array_o,port_array_o),axis = 0)


Division for test and train dataset is made using train_test_split from sklearn library.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=200, train_size=len(X)-200)

#Linear Regression

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [74]:
model = LinearRegression()
reg = model.fit(X_train, y_train)

The performance is evaluated with Mean Square Error. MSE for grades for 1 period, second period and combined is as follows:

In [75]:
np.mean((y_test-reg.predict(X_test))**2, axis = 0)

array([6.20646681, 6.05804657, 7.83107989])

#Stochastic Gradient Descent Regression

In [52]:
 from sklearn.linear_model import SGDRegressor

In [70]:
sgdr1 = SGDRegressor(loss='squared_error', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=2000).fit(X_train,y_train[:,0])
sgdr2 = SGDRegressor(loss='squared_error', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=2000).fit(X_train,y_train[:,1])
sgdr3 = SGDRegressor(loss='squared_error', penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=2000).fit(X_train,y_train[:,2])

In [71]:
print('MSE for the first period',np.mean((y_test[:,0]-sgdr1.predict(X_test))**2, axis = 0))
print('MSE for the second period',np.mean((y_test[:,1]-sgdr2.predict(X_test))**2, axis = 0))
print('MSE for the whole period',np.mean((y_test[:,2]-sgdr3.predict(X_test))**2, axis = 0))


MSE for the first period 7.479915889222295
MSE for the second period 6.925738058989849
MSE for the whole period 8.358964749275772


#Random Forest

Next method is Random Forest Regressor

In [76]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators = 300, max_depth=5, random_state=0)
forest = rfr.fit(X_train, y_train)

MSE for grades for 1 period, second period and combined is as follows:

In [77]:
np.mean((y_test-forest.predict(X_test))**2, axis = 0)

array([6.42762741, 6.6490194 , 7.38385581])

All methods achieve similar results, although Linear Regression is slightly better. The error of prediction for the whole period is bigger than for the semesters. It is interesting, since, intuitively, long-term variables' effects should be more visible in the longer time horizon.