# Linear Regression

## Overview

This Jupyter Notebook intends to identify the relationships between several attributes and their effects on student performance, specifically their final grade. The kernel uses Python 3.8.5.

## Research Question

What is the relationship between ## TODO

## Dataset

- Dataset Name: Student Performance Data Set
- Link to the dataset: https://archive.ics.uci.edu/ml/datasets/Student+Performance#
- Number of observations: 395
- Description: This dataset includes data about student performance in their mathematics class in Portuguese secondary school. The variables include (but are not limited to) study time, free time, number of absences, health status, and final grade.

## Setup

In [362]:
# imports
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
from matplotlib import pyplot as plt

In [363]:
# default is 10; Set to None to display all rows/cols in dataframes
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

Read in the dataset that will be cleaned and analyzed.

In [364]:
student_data = pd.read_csv("student-mat.csv", sep=";")


In [365]:
student_data

Unnamed: 0,school,sex,age,address,famsize,...,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,...,3,6,5,6,6
1,GP,F,17,U,GT3,...,3,4,5,5,6
2,GP,F,15,U,LE3,...,3,10,7,8,10
3,GP,F,15,U,GT3,...,5,2,15,14,15
4,GP,F,16,U,GT3,...,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,...,4,11,9,9,9
391,MS,M,17,U,LE3,...,2,3,14,16,16
392,MS,M,21,R,GT3,...,3,3,10,8,7
393,MS,M,18,R,LE3,...,5,0,11,12,10


In [366]:
student_data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

## Data Cleaning

Only keep columns with information that I will be using. Rename these columns for understandability.

In [367]:
student_data = student_data[["studytime", "freetime", "goout", "Dalc", "Walc", "G3"]]
student_data.columns = ["study_time", "free_time", "go_out", "weekday_alc", "weekend_alc", "final_grade"]

In [368]:
student_data

Unnamed: 0,study_time,free_time,go_out,weekday_alc,weekend_alc,final_grade
0,2,3,4,1,1,6
1,2,3,3,1,1,6
2,2,3,2,2,3,10
3,3,2,2,1,1,15
4,2,3,2,1,2,10
...,...,...,...,...,...,...
390,2,5,4,4,5,9
391,1,4,5,3,4,16
392,1,5,3,3,3,7
393,1,4,1,3,4,10


## Data Analysis & Results

Look at some of the dataset statistics to gain a better understanding.

In [369]:
student_data.describe()

Unnamed: 0,study_time,free_time,go_out,weekday_alc,weekend_alc,final_grade
count,395.0,395.0,395.0,395.0,395.0,395.0
mean,2.035443,3.235443,3.108861,1.481013,2.291139,10.41519
std,0.83924,0.998862,1.113278,0.890741,1.287897,4.581443
min,1.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,3.0,2.0,1.0,1.0,8.0
50%,2.0,3.0,3.0,1.0,2.0,11.0
75%,2.0,4.0,4.0,2.0,3.0,14.0
max,4.0,5.0,5.0,5.0,5.0,20.0


In [370]:
predict_label = "final_grade"
X = np.array(student_data.drop(predict_label, axis=1))
y = np.array(student_data[predict_label])

In [371]:
# split into training and test sets (90/10 split)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

In [372]:
X_train

array([[2, 5, 4, 2, 3],
       [2, 5, 2, 1, 1],
       [2, 4, 4, 3, 3],
       ...,
       [2, 3, 2, 2, 3],
       [3, 2, 3, 1, 2],
       [3, 3, 3, 1, 3]])

In [373]:
# fit a linear model
reg = linear_model.LinearRegression().fit(X_train, y_train)
accuracy = reg.score(X_test, y_test)
print(accuracy)
print(reg.coef_)
print(reg.intercept_)

-0.05155725409994982
[ 0.6228324   0.41950132 -0.74029917 -0.32737772  0.31446181]
9.807665598878112


In [374]:
# predict grades of test inputs
pred_grades = reg.predict(X_test)
for i in range(len(pred_grades)):
    print(X_test[i], pred_grades[i], y_test[i])

[1 2 2 1 1] 9.775986380765659 13
[2 2 4 1 4] 9.861605865630892 8
[1 4 5 3 4] 8.682721481085752 16
[2 4 3 2 3] 10.79906813438315 12
[1 3 4 4 4] 8.67614160686699 13
[1 5 5 3 4] 9.102222797199161 10
[1 3 3 1 1] 9.455188530518337 14
[2 4 3 1 2] 10.811984053480023 15
[1 3 5 3 5] 8.577681970341551 16
[1 2 2 1 3] 10.404909991504082 8
[1 4 1 3 4] 11.643918146528684 10
[2 2 4 2 4] 9.534228141164808 13
[1 3 3 1 2] 9.76965033588755 0
[2 3 2 1 1] 10.818320098358132 15
[1 3 3 1 2] 9.76965033588755 11
[2 3 5 2 4] 9.213430290917486 10
[2 3 3 2 2] 10.065105012900528 0
[1 4 3 1 1] 9.874689846631748 15
[2 3 1 1 1] 11.558619264718866 11
[3 4 3 1 1] 11.120354649589872 11
[2 4 2 1 1] 11.237821414471544 9
[3 3 3 1 1] 10.700853333476463 12
[2 5 4 4 5] 10.452438445942086 9
[1 2 3 1 2] 9.350149019774138 10
[2 4 3 1 1] 10.49752224811081 15
[1 3 1 1 1] 10.935786863239802 13
[2 1 1 1 1] 10.719616632492043 9
[2 3 3 2 3] 10.37956681826974 10
[1 4 4 1 2] 9.448852485640227 14
[2 5 2 2 2] 11.644406811488082 14
[2 5 4 