## Students in Portugal dataset- Predicting the Drinking Habits of Teenagers.

In this project we use a dataset containing information about Portuguese students from two public schools. This is real world dataset that was collected in order to study alcohol consumption in young people and its effects on students academic performance. The dataset was built from two sources: school reports and questionnaires. In this predictive analysis, we utilized method such as by common categories, Logistic Regression Model & Random Forest Classifier Model.

Attribute contents:

    *1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
    *2. sex - student's sex (binary: 'F' - female or 'M' - male)
    *3. age - student's age (numeric: from 15 to 22)
    *4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
    *5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
    *6. Pstatus - parent's cohabitation status (binary:'T' - living together or 'A' - apart)
    *7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2–5th to 9th  grade, 3 – secondary education or 4 – higher education)
    *8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2–5th to 9th grade, 3 – secondary education or 4 – higher education)
    *9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    *10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    *11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
    *12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
    *13. traveltime - home to school travel time (numeric: 1 - <15 min., 2-15 to 30 min., 3-30 min. to 1 hour, or 4 - >1 hour)
    *14. studytime - weekly study time (numeric: 1 - <2 hours, 2-2 to 5 hours, 3-5 to 10 hours, or 4 - >10 hours)
    *15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
    *16. schoolsup - extra educational support (binary: yes or no)
    *17. famsup - family educational support (binary: yes or no)
    *18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
    *19. activities - extra-curricular activities (binary: yes or no)
    *20. nursery - attended nursery school (binary: yes or no)
    *21. higher - wants to take higher education (binary: yes or no)
    *22. internet - Internet access at home (binary: yes or no)
    *23. romantic - with a romantic relationship (binary: yes or no)
    *24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
    *25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
    *26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
    *27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
    *27. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
    *29. health - current health status (numeric: from 1 - very bad to 5 - very good)
    *30. absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

    *31 G1 - first period grade (numeric: from 0 to 20)
    *32 G2 - second period grade (numeric: from 0 to 20)
    *33 G3 - final grade (numeric: from 0 to 20, output target)


## Students in Portugal dataset- Predicting the Drinking Habits of Teenagers.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

In [4]:
student = pd.read_csv('student-por.csv')
student.rename(columns={'sex':'gender'}, inplace=True)
student['alcohol_index'] = (5*student['Dalc'] + 2*student['Walc'])/7
# Alcohol Consumption Level
student['acl'] = student['alcohol_index'] <= 2
student['acl'] = student['acl'].map({True: 'Low', False: 'High'})

In [5]:
student.head()

Unnamed: 0,school,gender,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,goout,Dalc,Walc,health,absences,G1,G2,G3,alcohol_index,acl
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,1,1,3,4,0,11,11,1.0,Low
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,1,1,3,2,9,11,11,1.0,Low
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,2,2,3,3,6,12,13,12,2.285714,High
3,GP,F,15,U,GT3,T,4,2,health,services,...,2,1,1,5,0,14,14,14,1.0,Low
4,GP,F,16,U,GT3,T,3,3,other,other,...,2,1,2,5,0,11,13,13,1.285714,Low


In [6]:
features = ['gender','famsize','age','studytime','famrel','goout','freetime','G3']
target = ['acl']

### Important: Scikit-Learn only understand numbers!

This is why we need to create what are called 'dummy' features or one-hot encoded features.

In [7]:
# For gender: Female will be 0, Male will be 1
student['gender'] = student['gender'].map({'F':0, 'M':1}).astype(int)
# For famsize: 'LE3' less or equal to 3 will be 0. 'GT3' greaterthan 3 will be 1
student['famsize'] = student['famsize'].map({'LE3':0, 'GT3':1}).astype(int)
# For acl: 'Low will be 0, 'High' will be 1
student['acl'] = student['acl'].map({'Low':0, 'High':1}).astype(int)

In [8]:
X = student[features].values
Y = student[target].values

### Method 1: Predict the most common category

In [9]:
student['acl'].value_counts(normalize=True)

0    0.744222
1    0.255778
Name: acl, dtype: float64

### Method 2: Logistic Regression Model

The logistic regression is a model that uses the features to calculate the probability of the target variable to belong to the 'positive class' (target value being equal to 1

In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
student_classifier_logreg = LogisticRegression(C=2)

In [12]:
student_classifier_logreg.fit(X,Y)

  return f(*args, **kwargs)


LogisticRegression(C=2)

### Model Evaluation

In [13]:
student['predictions_logreg'] = student_classifier_logreg.predict(X)

In [14]:
confusion_matrix = pd.crosstab(student['predictions_logreg'], student['acl'])
confusion_matrix

acl,0,1
predictions_logreg,Unnamed: 1_level_1,Unnamed: 2_level_1
0,452,105
1,31,61


### Accuracy of Logistic Regression

In [15]:
ac = (confusion_matrix.iloc[0,0] + confusion_matrix.iloc[1,1])/student.shape[0]
print('Accuracy: {}'.format(ac))

Accuracy: 0.7904468412942989


### Method 3: Random Forest Classifier Model

In [16]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
student_classifier_rf = RandomForestClassifier()

In [18]:
student_classifier_rf.fit(X,Y)
student['predictions_rf'] = student_classifier_rf.predict(X)

  student_classifier_rf.fit(X,Y)


In [19]:
confusion_matrix = pd.crosstab(student['predictions_rf'], student['acl'])
confusion_matrix

acl,0,1
predictions_rf,Unnamed: 1_level_1,Unnamed: 2_level_1
0,482,3
1,1,163


### Accuracy of Random Forest Classifier Model

In [20]:
ac = (confusion_matrix.iloc[0,0] + confusion_matrix.iloc[1,1])/student.shape[0]
print('Accuracy: {}'.format(ac))

Accuracy: 0.9938366718027735


### Simpified syntax by using Random Forest Classifier Model

In [21]:
# ['gender','famsize','age','studytime','famrel','goout','freetime','G3']
# We predict the outcome based on the attibutes & behaviours from the past data pattern.
new_student = np.array([[1,1,18,2,1,5,5,13]])
predictions = student_classifier_rf.predict(new_student)
print('The model predicts that the student belongs to the: ')
if predictions == 1:
    print('High alcohol consumption group')
else: 
    print('Low alcohol consumption group')

The model predicts that the student belongs to the: 
Low alcohol consumption group
