# Classification
I will be using Random forest for this assignment. Random forest is a method to use for predicting a value (a classification). In this case I will be predicting if a student is in a relationship based on some of their attributes. Random forest is build on a decision tree, in which the tree branches out, starting from the complete dataset and working (by branching) towards having the attributes that define if a person is in a relationship or not. Because it uses a forest of trees and not just one decision tree, it becomes a bit of a black box algorithm because you do not know what it is deciding on. This however proved to do better than the KNN method so I used Random forest for this assignment. 

In [80]:
import seaborn as sns #this is the plotting library I'll be using 
import pandas as pd #"as pd" means that we can use the abbreviation in commands
import matplotlib.pyplot as plt #we need Matplotlib for setting the labels in the Seaborn graphs
import math
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
pd.set_option('display.max_columns', 40)
df = pd.read_csv('./student-por.csv')
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,0,yes,no,no,no,yes,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,no,yes,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,no,no,yes,yes,no,no,4,3,2,1,2,5,0,11,13,13


In [90]:
# Because the predictable value is not a numerical one, I need to make it into dummy date to use it. 
# I would also like to use sex in my prediction model so I am also converting this into dummy data.
dummies1 = pd.get_dummies(df['sex'])
dummies2 = pd.get_dummies(df['romantic'] )
dummies2.head(1)


Unnamed: 0,no,yes
0,1,0


In [82]:
df = pd.concat([df, dummies1, dummies2], axis=1) #the axis=1 means: add it to the columns (axis=0 is rows)
df.head(1)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,F,M,no,yes
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11,1,0,1,0


In [83]:
#due to the amount of columns, I will be using a correlation table instead of a correlation matrix
df.corr()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,F,M,no,yes
age,1.0,-0.107832,-0.12105,0.03449,-0.008415,0.319968,-0.020559,-0.00491,0.112805,0.134768,0.086357,-0.00875,0.149998,-0.174322,-0.107119,-0.106505,0.043662,-0.043662,-0.17881,0.17881
Medu,-0.107832,1.0,0.647477,-0.265079,0.097006,-0.17221,0.024421,-0.019686,0.009536,-0.007018,-0.019766,0.004614,-0.008577,0.260472,0.264035,0.240151,-0.119127,0.119127,0.030992,-0.030992
Fedu,-0.12105,0.647477,1.0,-0.208288,0.0504,-0.165915,0.020256,0.006841,0.02769,6.1e-05,0.038445,0.04491,0.029859,0.217501,0.225139,0.2118,-0.083913,0.083913,0.067675,-0.067675
traveltime,0.03449,-0.265079,-0.208288,1.0,-0.063154,0.09773,-0.009521,0.000937,0.057454,0.092824,0.057007,-0.048261,-0.008149,-0.15412,-0.154489,-0.127173,-0.04088,0.04088,-0.004751,0.004751
studytime,-0.008415,0.097006,0.0504,-0.063154,1.0,-0.147441,-0.004127,-0.068829,-0.075442,-0.137585,-0.214925,-0.056433,-0.118389,0.260875,0.240498,0.249789,0.206214,-0.206214,-0.033036,0.033036
failures,0.319968,-0.17221,-0.165915,0.09773,-0.147441,1.0,-0.062645,0.108995,0.045078,0.105949,0.082266,0.035588,0.122779,-0.38421,-0.385782,-0.393316,-0.073888,0.073888,-0.069901,0.069901
famrel,-0.020559,0.024421,0.020256,-0.009521,-0.004127,-0.062645,1.0,0.129216,0.089707,-0.075767,-0.093511,0.109559,-0.089534,0.048795,0.089588,0.063361,-0.083473,0.083473,0.04492,-0.04492
freetime,-0.00491,-0.019686,0.006841,0.000937,-0.068829,0.108995,0.129216,1.0,0.346352,0.109904,0.120244,0.084526,-0.018716,-0.094497,-0.106678,-0.122705,-0.146305,0.146305,-0.027112,0.027112
goout,0.112805,0.009536,0.02769,0.057454,-0.075442,0.045078,0.089707,0.346352,1.0,0.245126,0.38868,-0.015741,0.085374,-0.074053,-0.079469,-0.087641,-0.058178,0.058178,0.00052,-0.00052
Dalc,0.134768,-0.007018,6.1e-05,0.092824,-0.137585,0.105949,-0.075767,0.109904,0.245126,1.0,0.616561,0.059067,0.172952,-0.195171,-0.18948,-0.204719,-0.282696,0.282696,-0.062042,0.062042


As can be seen in the table above, the most predictive variables to the variable yes (in a romantic relationship) are (descending):
- age
- G2
- G3
- absences
- G1
- failures
- Fedu (=fathers education)


# Resulting Dataframe
First I will be dividing the dataset into a test and a train set. The value I want to predict will be the Y value. 

In [84]:
y = df['yes'] #The 'yes' variable defines if a person is in a relationship
X = df[['age', 'G2', 'G3', 'absences', 'G1', 'failures', 'Fedu']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables
X_train.head()

Unnamed: 0,age,G2,G3,absences,G1,failures,Fedu
358,18,12,15,8,12,0,3
74,16,11,11,4,11,0,3
640,18,7,0,0,7,1,2
423,16,11,11,11,10,0,3
61,16,10,16,0,10,0,1


# The model

In [85]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(max_depth=None, random_state=1) #when using a max_depth = none, there will be infinite amount of branches on the trees
model = model.fit(X, y)



In [86]:
y_test_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[125,   3],
       [  6,  61]])

In [87]:
y_test.value_counts()

0    128
1     67
Name: yes, dtype: int64

The values with a romantic relationship are the 1's (67 values), and the 0's are the ones without a romantic relationship. This comes in handy at defining the axes on the next matrix.

# Conclusions

In [88]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['no relationship', 'relationship' ], columns = ['norelationship_p', 'relationship_p']) 
conf_matrix

Unnamed: 0,norelationship_p,relationship_p
no relationship,125,3
relationship,6,61


In [89]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           0       0.95      0.98      0.97       128
           1       0.95      0.91      0.93        67

    accuracy                           0.95       195
   macro avg       0.95      0.94      0.95       195
weighted avg       0.95      0.95      0.95       195



so the accuracy is $0.95$, so $95%$ of the relationship statuses are well predicted. 
next to that is the precision also $\frac{61}{61 + 3 }=0.95$ and the recall $\frac{61}{61 + 6 }=0.91$ 

This means that of the people predicted to be in a relationship, 95% of them were actually in a relationship.

Of the people actually in a relationship, 91% were predicted to be in one. 