# Student Performance Analysis

## Data Set Information:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

## Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira) <br/>
2 sex - student's sex (binary: "F" - female or "M" - male) <br/>
3 age - student's age (numeric: from 15 to 22) <br/>
4 address - student's home address type (binary: "U" - urban or "R" - rural) <br/>
5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3) <br/>
6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart) <br/>
7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) <br/>
8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) <br/>
9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other") <br/>
10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other") <br/>
11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other") <br/>
12 guardian - student's guardian (nominal: "mother", "father" or "other") <br/>
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) <br/>
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) <br/>
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) <br/>
16 schoolsup - extra educational support (binary: yes or no) <br/>
17 famsup - family educational support (binary: yes or no) <br/>
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) <br/>
19 activities - extra-curricular activities (binary: yes or no) <br/>
20 nursery - attended nursery school (binary: yes or no) <br/>
21 higher - wants to take higher education (binary: yes or no) <br/>
22 internet - Internet access at home (binary: yes or no) <br/>
23 romantic - with a romantic relationship (binary: yes or no) <br/>
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) <br/>
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) <br/>
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) <br/>
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) <br/>
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) <br/>
29 health - current health status (numeric: from 1 - very bad to 5 - very good) <br/>
30 absences - number of school absences (numeric: from 0 to 93) <br/>

these grades are related with the course subject, Math or Portuguese: <br/>
31 G1 - first period grade (numeric: from 0 to 20) <br/>
32 G2 - second period grade (numeric: from 0 to 20) <br/>
33 G3 - final grade (numeric: from 0 to 20, output target) <br/>

Additional note: there are several (382) students that belong to both datasets . 
These students can be identified by searching for identical attributes
that characterize each student, as shown in the annexed R file.

In [1]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
from random import sample
from sklearn import svm
from sklearn.preprocessing import LabelBinarizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [2]:
% matplotlib inline

In [3]:
data_math = pd.read_csv("data/student-mat.csv", delimiter = ";")
data_port = pd.read_csv("data/student-por.csv", delimiter = ";")

In [4]:
# Sample
data_math.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [5]:
# Shape of the Math scored data
print("There are {} data points with {} features".format(data_math.shape[0], data_math.shape[1]))
# Shape of the Portuguese language scored data
print("There are {} data points with {} features".format(data_port.shape[0], data_port.shape[1]))

There are 395 data points with 33 features
There are 649 data points with 33 features


In [6]:
num_nan_null = data_math.isnull().sum().sum()
print("There are {} Nan, None, or Null values in data_math".format(num_nan_null))
num_nan_null = data_port.isnull().sum().sum()
print("There are {} Nan, None, or Null values in data_port".format(num_nan_null))

There are 0 Nan, None, or Null values in data_math
There are 0 Nan, None, or Null values in data_port


In [7]:
# Find the categories that should be relabelled to integer values
# Get binary labels
n_x = data_math.shape[1]
binary_labels = []
data_math_proc = data_math.copy()

for i in range(n_x):
    cat_length = len(pd.Series.value_counts(data_math.iloc[:, i]))
    # cat_length will be large when 
    if cat_length <= 5:
        print(data_math.columns[i])
        print(pd.Series.value_counts(data_math.iloc[:, i]))
    if cat_length == 2:
        binary_labels.append(data_math.columns[i])

school
GP    349
MS     46
Name: school, dtype: int64
sex
F    208
M    187
Name: sex, dtype: int64
address
U    307
R     88
Name: address, dtype: int64
famsize
GT3    281
LE3    114
Name: famsize, dtype: int64
Pstatus
T    354
A     41
Name: Pstatus, dtype: int64
Medu
4    131
2    103
3     99
1     59
0      3
Name: Medu, dtype: int64
Fedu
2    115
3    100
4     96
1     82
0      2
Name: Fedu, dtype: int64
Mjob
other       141
services    103
at_home      59
teacher      58
health       34
Name: Mjob, dtype: int64
Fjob
other       217
services    111
teacher      29
at_home      20
health       18
Name: Fjob, dtype: int64
reason
course        145
home          109
reputation    105
other          36
Name: reason, dtype: int64
guardian
mother    273
father     90
other      32
Name: guardian, dtype: int64
traveltime
1    257
2    107
3     23
4      8
Name: traveltime, dtype: int64
studytime
2    198
1    105
3     65
4     27
Name: studytime, dtype: int64
failures
0    312
1     

In [8]:
binary_labels

['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

In [9]:
data_math_proc.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

In [10]:
num_bin_labels = len(binary_labels)
lb = LabelBinarizer(neg_label = 0, pos_label = 1)
for i in range(num_bin_labels):
    lb.fit(data_math[binary_labels[i]])
    data_math_proc[binary_labels[i]] = lb.transform(data_math_proc[binary_labels[i]])

In [11]:
# This will allow Mjob, Fjob, reason, and guardian features in the data to be one-hot encoded
one_hot_cat = ['Mjob', 'Fjob', 'reason', 'guardian']

for cat in one_hot_cat:
    data_math_proc = pd.concat([data_math_proc, pd.get_dummies(data_math_proc[cat], prefix = cat)], axis = 1)
    data_math_proc = data_math_proc.drop(cat, axis = 1)

In [12]:
data_math_proc.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,0,0,18,1,0,0,4,4,2,2,...,0,0,1,1,0,0,0,0,1,0
1,0,0,17,1,0,1,1,1,1,2,...,1,0,0,1,0,0,0,1,0,0
2,0,0,15,1,1,1,1,1,1,2,...,1,0,0,0,0,1,0,0,1,0
3,0,0,15,1,0,1,4,2,1,3,...,0,1,0,0,1,0,0,0,1,0
4,0,0,16,1,0,1,3,3,1,2,...,1,0,0,0,1,0,0,1,0,0


In [13]:
data_math_proc.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid',
       'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2',
       'G3', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other',
       'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home',
       'reason_other', 'reason_reputation', 'guardian_father',
       'guardian_mother', 'guardian_other'],
      dtype='object')

In [14]:
X = data_math_proc[["school","sex","age","address","famsize",
                    "Pstatus","Medu","Fedu","traveltime","studytime",
                    "failures","schoolsup","famsup","paid","activities",
                    "nursery","higher","internet","romantic","famrel",
                    "freetime","goout","Dalc","Walc","health","absences",
                    "Mjob_at_home","Mjob_health","Mjob_other","Mjob_services",
                    "Mjob_teacher","Fjob_at_home","Fjob_health","Fjob_other",
                    "Fjob_services","Fjob_teacher","reason_course","reason_home",
                    "reason_other","reason_reputation","guardian_father",
                    "guardian_mother","guardian_other","G1","G2"]]
Y = data_math_proc["G3"]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [16]:
pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('clf', LogisticRegression(random_state=42))])

pipe_lr_pca = Pipeline([('scl', StandardScaler()),
                        ('pca', PCA(n_components=2)),
                        ('clf', LogisticRegression(random_state=42))])

pipe_rf = Pipeline([('scl', StandardScaler()),
                    ('clf', RandomForestClassifier(random_state=42))])

pipe_rf_pca = Pipeline([('scl', StandardScaler()),
                        ('pca', PCA(n_components=2)),
                        ('clf', RandomForestClassifier(random_state=42))])

pipe_svm = Pipeline([('scl', StandardScaler()),
                     ('clf', svm.SVC(random_state=42))])

pipe_svm_pca = Pipeline([('scl', StandardScaler()),
                         ('pca', PCA(n_components=2)),
                         ('clf', svm.SVC(random_state=42))])

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,traveltime,studytime,...,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other,G1,G2
181,0,1,16,1,0,1,3,3,1,2,...,0,0,1,0,0,0,1,0,12,13
194,0,1,16,1,0,1,2,3,2,1,...,0,0,1,0,0,1,0,0,13,14
173,0,0,16,1,0,1,1,3,1,2,...,0,0,1,0,0,0,1,0,8,7
63,0,0,16,1,0,1,4,3,1,3,...,0,0,1,0,0,0,1,0,10,9
253,0,1,16,0,0,1,2,1,2,1,...,0,1,0,0,0,0,1,0,8,9
225,0,0,18,0,0,1,3,1,1,2,...,0,0,0,0,1,0,1,0,9,8
331,0,0,17,0,0,1,2,4,1,3,...,0,1,0,0,0,1,0,0,12,14
383,1,1,19,0,0,1,1,1,2,1,...,0,0,0,1,0,0,1,0,6,5
227,0,1,17,1,1,1,2,3,1,2,...,0,0,0,0,1,1,0,0,12,11
342,0,1,18,1,1,1,3,4,1,2,...,0,0,1,0,0,0,1,0,16,15


In [17]:
X = pd.concat([data_math_proc.iloc[:, 0:26], data_math_proc.iloc[:, 29:]], axis = 1)
Y = data_math_proc['G1']
sup_vec_reg = svm.SVR()
sup_vec_reg.fit(X, Y)
sup_vec_reg.score(X, Y)

0.29576816084082458

In [18]:
samp_int = sample(range(0, X.shape[0]), 6)

for i in samp_int:
    print("Sample point {} in our training set".format(i))
    print("SVM prediction: {}".format(sup_vec_reg.predict(X.iloc[i, :].values.reshape(1, -1))))
    print("Actual G1 score: {}".format(data_math_proc['G1'][i]))

Sample point 66 in our training set
SVM prediction: [ 11.52876491]
Actual G1 score: 13
Sample point 100 in our training set
SVM prediction: [ 9.39237952]
Actual G1 score: 7
Sample point 189 in our training set
SVM prediction: [ 9.63729811]
Actual G1 score: 8
Sample point 53 in our training set
SVM prediction: [ 10.71557839]
Actual G1 score: 8
Sample point 5 in our training set
SVM prediction: [ 11.41948314]
Actual G1 score: 15
Sample point 37 in our training set
SVM prediction: [ 11.74522692]
Actual G1 score: 15


In [19]:
score = []
X.drop(X.columns[1], axis = 1)
for i in range(X.shape[1]):
    X_drop = X.drop(X.columns[i], axis = 1)
    sup_vec_reg.fit(X_drop, Y)
    print("Score: {}".format(sup_vec_reg.score(X_drop, Y)))
    score.append(sup_vec_reg.score(X_drop, Y))
max(score)

Score: 0.29734201276255245
Score: 0.2895743304303948
Score: 0.28969553798320746
Score: 0.2963289210277814
Score: 0.2954489538615266
Score: 0.2977366318098693
Score: 0.2892642129787991
Score: 0.29661056597168
Score: 0.29659408429211975
Score: 0.28502238532772706
Score: 0.25970310223052806
Score: 0.2845628924905298
Score: 0.2868390027573975
Score: 0.295805663662531
Score: 0.2971054755752244
Score: 0.2972652329400297
Score: 0.2975763617109577
Score: 0.297113638149204
Score: 0.29783225795309387
Score: 0.2969108241787226
Score: 0.28820977997734987
Score: 0.2840321162484477
Score: 0.2932238036147601
Score: 0.2948771594725942
Score: 0.2924691035320167
Score: 0.28600985474920027
Score: 0.2975708718356428
Score: 0.2962938610316965
Score: 0.2937381842684411
Score: 0.2973637715394163
Score: 0.2975672582694884
Score: 0.2980793444680534
Score: 0.298019246003418
Score: 0.2938523748755205
Score: 0.29796781230660685
Score: 0.2966432011027409
Score: 0.29394057892512837
Score: 0.2979292888873202
Score: 

0.29826524075416661