# Linear Regression

## Overview

This Jupyter Notebook intends to identify the relationships between several attributes and their effects on student performance, specifically their final grade. The kernel uses Python 3.8.5.

## Research Question

What is the relationship between ## TODO

## Dataset

- Dataset Name: Student Performance Data Set
- Link to the dataset: https://archive.ics.uci.edu/ml/datasets/Student+Performance#
- Number of observations: 395
- Description: This dataset includes data about student performance in their mathematics class in Portuguese secondary school. The variables include (but are not limited to) study time, free time, number of absences, health status, and final grade.

## Setup

In [81]:
# imports
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
from matplotlib import pyplot as plt
from matplotlib import style
import pickle


In [82]:
# default is 10; Set to None to display all rows/cols in dataframes
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

Read in the dataset that will be cleaned and analyzed.

In [83]:
student_data = pd.read_csv("student-mat.csv", sep=";")


In [84]:
student_data

Unnamed: 0,school,sex,age,address,famsize,...,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,...,3,6,5,6,6
1,GP,F,17,U,GT3,...,3,4,5,5,6
2,GP,F,15,U,LE3,...,3,10,7,8,10
3,GP,F,15,U,GT3,...,5,2,15,14,15
4,GP,F,16,U,GT3,...,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,...,4,11,9,9,9
391,MS,M,17,U,LE3,...,2,3,14,16,16
392,MS,M,21,R,GT3,...,3,3,10,8,7
393,MS,M,18,R,LE3,...,5,0,11,12,10


In [85]:
student_data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

## Data Cleaning

Only keep columns with information that I will be using. Rename these columns for understandability.

In [86]:
student_data = student_data[["studytime", "freetime", "goout", "Dalc", "Walc","G1", "G2", "G3"]]
# 
student_data.columns = ["study_time", "free_time", "go_out", "weekday_alc", "weekend_alc", "q1_grade", "q2_grade", "final_grade"]
# 

In [87]:
student_data

Unnamed: 0,study_time,free_time,go_out,weekday_alc,weekend_alc,q1_grade,q2_grade,final_grade
0,2,3,4,1,1,5,6,6
1,2,3,3,1,1,5,5,6
2,2,3,2,2,3,7,8,10
3,3,2,2,1,1,15,14,15
4,2,3,2,1,2,6,10,10
...,...,...,...,...,...,...,...,...
390,2,5,4,4,5,9,9,9
391,1,4,5,3,4,14,16,16
392,1,5,3,3,3,10,8,7
393,1,4,1,3,4,11,12,10


## Data Analysis & Results

Look at some of the dataset statistics to gain a better understanding.

In [88]:
student_data.describe()

Unnamed: 0,study_time,free_time,go_out,weekday_alc,weekend_alc,q1_grade,q2_grade,final_grade
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,2.035443,3.235443,3.108861,1.481013,2.291139,10.908861,10.713924,10.41519
std,0.83924,0.998862,1.113278,0.890741,1.287897,3.319195,3.761505,4.581443
min,1.0,1.0,1.0,1.0,1.0,3.0,0.0,0.0
25%,1.0,3.0,2.0,1.0,1.0,8.0,9.0,8.0
50%,2.0,3.0,3.0,1.0,2.0,11.0,11.0,11.0
75%,2.0,4.0,4.0,2.0,3.0,13.0,13.0,14.0
max,4.0,5.0,5.0,5.0,5.0,19.0,19.0,20.0


Split the data into X and y datasets. I want to predict final_grade based on the other columns.

In [89]:
predict_label = "final_grade"
X = np.array(student_data.drop(predict_label, axis=1))
y = np.array(student_data[predict_label])

In [90]:
# split into training and test sets (90/10 split)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

Train the linear regression model many times and keep the one with the highest accuracy.

In [98]:
# comment out this block to avoid retraining the model
highest_accuracy = 0
num_runs = 50
# train the model multiple times
for _ in range(num_runs):
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
    # fit a linear model
    reg = linear_model.LinearRegression().fit(X_train, y_train)
    accuracy = reg.score(X_test, y_test)
    if accuracy > highest_accuracy:
        highest_accuracy = accuracy
        with open("student_model.pickle", "wb") as f:
                # save the model in an external file
                pickle.dump(reg, f)

In [102]:
# load the model
pickle_in = open("student_model.pickle", "rb")
reg = pickle.load(pickle_in)

print("The best model had accuracy", highest_accuracy * 100)
#print(reg.coef_)
#print(reg.intercept_)

The best model had accuracy 92.55392394419097


In [93]:
# predict grades of test inputs
pred_grades = reg.predict(X_test)
for i in range(len(pred_grades)):
    print(X_test[i], pred_grades[i], y_test[i])

[ 2  4  5  1  1 10 11] 10.505993854696081 9
[1 4 5 1 1 6 0] -0.8635775059200057 0
[ 2  3  3  2  2 10 10] 9.40964134005649 0
[ 2  3  3  2  3 10 11] 10.5110926239 12
[ 2  3  3  1  1 19 18] 18.772950603372806 18
[ 1  2  1  1  1 11 11] 10.578674743958699 12
[4 5 5 2 4 9 8] 7.451214083907152 8
[1 5 5 5 5 7 7] 6.319145594197607 5
[1 5 1 1 1 8 7] 6.453240029097659 6
[ 1  5  5  3  4 11 11] 11.029698579579621 10
[1 4 5 2 4 6 8] 7.25927461214428 8
[2 4 4 1 1 8 7] 6.228502718246997 8
[ 3  3  3  1  3 10  9] 8.516402937980947 10
[ 2  4  4  1  1 12 12] 11.815734917266239 11
[ 1  4  4  1  3 10 13] 12.839008676302141 12
[2 4 3 1 2 5 5] 3.8745527516232148 5
[ 2  4  3  1  1 14 12] 12.139299119509015 12
[ 2  3  4  2  3 10  9] 8.543348649693636 0
[1 4 5 2 4 6 5] 4.300744031162142 0
[ 3  4  3  3  4 10  9] 8.510749841119209 9
[1 5 5 1 3 6 5] 4.401700192577565 5
[ 3  3  2  1  2 11 10] 9.546782602689388 11
[2 3 3 2 3 9 9] 8.374651928899654 8
[ 2  4  3  1  2 16 15] 15.541278072698455 15
[ 3  2  2  1  1 10 12] 