# Using linear regression to predict final student grades

In this notebook, I will be using the linear regression algorithm to predict a student's final grades based on

* first term grades
* second term grades
* time spent studying
* relationship with family
* health

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

## Taking a look at the data

First, I wanted to select a good breadth of attributes to best utilize the data gathered. My selection of attributes suggested the use of linear regression.

I had hesitations since I have typically seen linear regression done with one or two independent variables in my training data. For the sake of this notebook, I wanted to see how a multiple linear regression model performed.

In [3]:
df = pd.read_csv('student-mat.csv', sep=";")
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [4]:
predict = "G3"
data = df[["G1", "G2", "G3", "studytime", "famrel", "health"]]
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

  X = np.array(data.drop([predict], 1))


## Creating and training the model

In [5]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.10)

In [6]:
model = LinearRegression().fit(X_train, y_train)
print("Training accuracy: ", model.score(X_train, y_train))
print("Intercept: ", model.intercept_)
print("Weights: ", model.coef_)

Training accuracy:  0.8262491324002031
Intercept:  -3.095263497021433
Weights:  [ 0.15426626  1.00044711 -0.19823018  0.32454595  0.0600401 ]


## Evaluating the model

In [7]:
y_pred = model.predict(X_val)
print("Model accuracy: ", model.score(X_val, y_val))

Model accuracy:  0.8456869984396834


In [8]:
print("Mean absolute error: ", metrics.mean_absolute_error(y_val, y_pred))
print("Mean squared error: ", metrics.mean_squared_error(y_val, y_pred))
print("Root mean squared error: ", np.sqrt(metrics.mean_squared_error(y_val, y_pred)))

Mean absolute error:  0.9950035204787799
Mean squared error:  2.2140057898866425
Root mean squared error:  1.487953557704891


In [9]:
compare = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})
print(compare.head(10))

   Actual  Predicted
0       9   9.780110
1      11  12.469984
2      11  11.562559
3      10   8.344815
4      11  10.347868
5      10   8.229829
6       0   7.009981
7      12  12.528820
8       8   7.472780
9      14  15.427361


## Conclusions

The model's weights demonstrated that the most recent term's grade was the most significant predictor of a student's final grade. This was expected as a student progresses through a class, it is typically more difficult for their grade to change as new grades are more likely to be weighted less and less. 

There are some interesting results about the model. First, a student's relationship with their family was more important than I had previously thought. This can be easily rationalized, but I didn't expect it. Secondly, studytime was given a negative weight. I recall some negative correlation with grades and studytime in my personal experience, but I was not convinced this could be the case in general.