# Lab: Logistic Regression

Learning Goals:  
* Be able to explain how logistic regression arises from linear regression  
* Explain the shortcomings of using accuracy as a metric, and use an alternative  
* Iterate on an algorithm to improve its results  

In [1]:
import math
import pandas as pd
import numpy as np

## Import the data and define the problem

#### Data Set Information:  

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).  
[source, with column descriptions](https://archive.ics.uci.edu/ml/datasets/Student+Performance#)

#### The Problem
Optimize a classification model to predict which students are likely to pass their math classes. This model should not use columns G2 and G3 as features.

In [2]:
math_df = pd.read_csv('student-mat.csv', sep = ';')

math_df.columns

Index([u'school', u'sex', u'age', u'address', u'famsize', u'Pstatus', u'Medu',
       u'Fedu', u'Mjob', u'Fjob', u'reason', u'guardian', u'traveltime',
       u'studytime', u'failures', u'schoolsup', u'famsup', u'paid',
       u'activities', u'nursery', u'higher', u'internet', u'romantic',
       u'famrel', u'freetime', u'goout', u'Dalc', u'Walc', u'health',
       u'absences', u'G1', u'G2', u'G3'],
      dtype='object')

## Make this a classification problem

Since we are interested in classification algorithms, we need to turn our grade variable (G3) into a categorical value. Not knowing more about Portuguese grading scales, we will assume that a score at or above the 70th percentile is required to pass. 

In [3]:
# caluclate the 70th percentile of the G3 column
np.percentile(math_df.G3, 70.0)

13.0

In [4]:
# and add a column for pass/fail
def pass_fail(student):
    if student['G3'] >= 13:
        return 1
    else:
        return 0

math_df['passed'] = math_df.apply(pass_fail, axis = 1)

math_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,1,1,3,6,5,6,6,0
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,3,1,1,3,4,5,5,6,0
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,3,2,2,3,3,10,7,8,10,0
3,GP,F,15,U,GT3,T,4,2,health,services,...,2,2,1,1,5,2,15,14,15,1
4,GP,F,16,U,GT3,T,3,3,other,other,...,3,2,1,2,5,4,6,10,10,0


## Put data into arrays, and train/test split



In [49]:
features = np.array(math_df[['Medu', 'Fedu']])
labels = np.array(math_df['passed'])

from sklearn.model_selection import train_test_split

test_size = 0.20

features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=test_size, random_state=42)

In [50]:
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
features = np.array(math_df[['failures', 'higher']])
labels = np.array(math_df['passed'])

def transform_higher(student):
    if student['higher'] == 'yes':
        return 1
    else:
        return 0

math_df['higher'] = math_df.apply(transform_higher, axis = 1)

## Logistic Regression

First we will perform a basic logistic regression. This will provide the intercept and coefficients for later steps, as well as give a baseline for analysis and iteration.

In [51]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf = LogisticRegression()
clf = clf.fit(features_train, labels_train) 
pred = clf.predict(features_test)

acc = accuracy_score(pred, labels_test)
print "accuracy:", acc

accuracy: 0.607594936709


In [52]:
# get the intercept and coefficients
intercept = clf.intercept_
coefficients = clf.coef_[0]

print intercept
print coefficients # two coefficients because two variables

[-1.61435951]
[ 0.34478007 -0.03812506]


In [53]:
# make a dataframe of testing features and labels to use later for comparison

math_test = pd.DataFrame(features_test, columns = ['Medu', 'Fedu'])
math_test['passed'] = labels_test
math_test.head()

Unnamed: 0,Medu,Fedu,passed
0,2,1,0
1,1,2,0
2,3,3,0
3,2,1,0
4,2,2,0


In [54]:
# add a column to dataframe showing output from linear model (y_star)
math_test['y_star'] = intercept + coefficients[0] * math_test['Medu'] + \
                    coefficients[1] * math_test['Fedu']
    
print math_test.head()
print max(math_test.y_star), min(math_test.y_star)

   Medu  Fedu  passed    y_star
0     2     1       0 -0.962924
1     1     2       0 -1.345830
2     3     3       0 -0.694394
3     2     1       0 -0.962924
4     2     2       0 -1.001049
-0.235239229895 -1.69060963264


In [20]:
# use exponential transformation to caluclate probability
def get_prob(row):
    y = row['y_star']
    p = math.exp(y) / (math.exp(y) + 1) # e to the y / e to the y, + 1
    return p

math_test['prob'] = math_test.apply(get_prob, axis = 1)

math_test.head()
print max(math_test.prob), min(math_test.prob)

0.44145989909 0.155695684324


In [44]:
# add predictions to dataframe for comparison
math_test['pred'] = pred

#### Discuss: What was our accuracy before? Looking at the dataframe we just created, is that meaningful? Why or why not?



### Precision and Recall

**Precision**  
Precision is how good the model is at detecting only true positives (false positives hurt your score).  
precision = true positives / (true positives + false positives)  

**Recall** . 
Recall is how good the model is at detecting positives overall (doesn't care about false positives).  
recall = true positives / (true positives + false negatives)

In [23]:
from sklearn.metrics import precision_score, recall_score

precision = precision_score(labels_test, pred)
recall = recall_score(labels_test, pred)
print precision
print recall

0.0
0.0


  'precision', 'predicted', average, warn_for)


#### Discuss: Remember that we're trying to predict who will pass a class. In this situation, is it better to have a Type I error (false positive) or Type II error (false negative)? What metric is appropriate to use for this?

## You Do:
Try using different features or changing other parameters to build a better classifier. Use a metric other than accuracy as your guide. Do at least 3 iterations using Logistic Regression before moving on to another algorithm.
