# The Logistic Regression Model

Below you will find code that demonstrates how to run and interpret a logistic regression model. As before, please refer to the slides to get a full understanding of the motivations and derivations behind logistic regression and importantly its relation with the linear model.

In [32]:
from pandas import DataFrame, Series
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.cross_validation import train_test_split

%matplotlib inline

In [33]:
#Read in Titanic Data
titanic = pd.read_csv("../../datasets/titanic/train.csv")

## Dealing with Categorical Data (One-Hot-Encoding)

Categorical data, or data that have strings that denote something other than a numeric quantity, are extremely common in datasets. The catch is that, at least in Python, the vast majority of models do not know how to deal with categorical data - they prefer numeric data types only. At least in linear and logistic regression this makes intuitive sense because it doesn't make sense to invert a matrix of strings. What we do instead is do something called "One-Hot-Encoding".

In [34]:
titanic_only = pd.get_dummies(titanic,columns=['Sex','Pclass','Embarked'],drop_first=True)
titanic_only.head()

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,1.0,0.0,1.0,0.0,1.0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,0.0,0.0,0.0,0.0,0.0
2,3,1,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,0.0,0.0,1.0,0.0,1.0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,0.0,0.0,0.0,0.0,1.0
4,5,0,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,1.0,0.0,1.0,0.0,1.0


If you notice closely, there are now more than one column that represents a categorical variable! Sex is split into a male only column (1 if the corresponding Sex element was male) and a female only column, which is NOT shown because we chose to drop it from drop_first. Drop_first drops a single column from the new columns we've generated because this again has to do with multicollinearity. If I know that someone is male, then I know for sure someone is not female. As a result, just holding the male column is enough information for our model to handle, and we won't need to worry about multicollinearity issues!

This process of converting a categorical column into multiple columns containing 0's and 1's is called one-hot-encoding and this technique is by far the most common way of feeding in categorical data into a model. Another way of describing this process is getting "dummy variables" (hence pd.get_dummies) which just refer to the variables with 1's and 0's. 

## Validation Method

In [35]:
#Drop columns we don't care about (yet) or have missing values (Models don't like missing values)
titanic_only.drop(['PassengerId','Name','Ticket','Age','Cabin'],axis=1,inplace=True)

In [36]:
#Train Test Splitting
local_train, local_test = train_test_split(titanic_only,test_size=0.2,random_state=123)

In [37]:
local_train.shape

(712, 9)

In [38]:
local_test.shape

(179, 9)

In [39]:
local_train_y = local_train["Survived"]
local_train_x = local_train.drop(["Survived"],axis=1)
local_test_y = local_test["Survived"]
local_test_x = local_test.drop("Survived",axis=1)

In [40]:
#The Model
clf = sm.Logit(local_train_y,local_train_x)
result = clf.fit()
preds = result.predict(local_test_x)

Optimization terminated successfully.
         Current function value: 0.497509
         Iterations 6


In [41]:
#Accuracy of Logistic Model
np.mean((preds > 0.5) == local_test_y)

0.8044692737430168

In [42]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,704.0
Method:,MLE,Df Model:,7.0
Date:,"Wed, 28 Sep 2016",Pseudo R-squ.:,0.2556
Time:,01:38:38,Log-Likelihood:,-354.23
converged:,True,LL-Null:,-475.84
,,LLR p-value:,7.628e-49

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
SibSp,-0.2733,0.103,-2.662,0.008,-0.475 -0.072
Parch,0.0138,0.133,0.104,0.917,-0.246 0.274
Fare,0.0156,0.003,5.428,0.000,0.010 0.021
Sex_male,-2.1264,0.192,-11.061,0.000,-2.503 -1.750
Pclass_2,0.5004,0.272,1.840,0.066,-0.033 1.033
Pclass_3,-0.2624,0.248,-1.060,0.289,-0.748 0.223
Embarked_Q,0.6878,0.372,1.848,0.065,-0.042 1.417
Embarked_S,0.4359,0.239,1.820,0.069,-0.034 0.905


## Now let's put some of the Data Cleaning and Feature Engineering from before to work!

In [43]:
#Read in Titanic Data
titanic = pd.read_csv("../../datasets/titanic/train.csv")

In [44]:
titanic_engineered = titanic.copy()

In [45]:
#Imputing Age
titanic_engineered['title'] = 'other'
titanic_engineered.loc[['Master.' in n for n in titanic_engineered['Name']],'title'] = 'Master'
titanic_engineered.loc[['Miss.' in n for n in titanic_engineered['Name']],'title'] = 'Miss'
titanic_engineered.loc[['Mr.' in n for n in titanic_engineered['Name']],'title'] = 'Mr'
titanic_engineered.loc[['Mrs.' in n for n in titanic_engineered['Name']],'title'] = 'Mrs'

#Transform performs operation per group and returns values to their original index
titanic_engineered['age_filled'] = titanic_engineered[['title','Age']].groupby('title').transform(lambda x: x.fillna(x.mean())) 

titanic_engineered.drop(['Age'],axis=1,inplace=True)

In [46]:
#Cabin Side Feature
titanic_engineered['cabin_side'] = 'Unknown'
titanic_engineered.loc[titanic_engineered['Cabin'].str[-1].isin(["1", "3", "5", "7", "9"]),'cabin_side'] = 'starboard'
titanic_engineered.loc[titanic_engineered['Cabin'].str[-1].isin(["2", "4", "6", "8", "0"]),'cabin_side'] = 'port'

In [47]:
#Deck Feature (including some cleaning)
titanic_engineered['deck'] = 'Unknown'
titanic_engineered.loc[titanic_engineered['Cabin'].notnull(),'deck'] = titanic_engineered['Cabin'].str[0]
titanic_engineered.loc[titanic_engineered['deck'] == 'T','deck'] = "Unknown"

In [48]:
#Deck Feature (including some cleaning)
titanic_engineered['deck'] = 'Unknown'
titanic_engineered.loc[titanic_engineered['Cabin'].notnull(),'deck'] = titanic_engineered['Cabin'].str[0]
titanic_engineered.loc[titanic_engineered['deck'] == 'T','deck'] = "Unknown"

pattern = "[A-Z]\s[A-Z]" #Any capital letter between A-Z followed by a whitespace followed by any letter between A-Z
mask = titanic_engineered['Cabin'].str.contains(pattern,na=False)
titanic_engineered.loc[mask,'deck'] = titanic_engineered.loc[mask,'Cabin'].str[2]

In [49]:
#Number cabins per person
titanic_engineered['num_in_group'] = titanic_engineered['Cabin'].str.split().apply(lambda x: len(x) if type(x)!=float else 1)

In [50]:
#Removing columns we don't want (that don't make sense to include anymore)
#Notice we are NOT dropping the Age column anymore because we've filled in the missing values!
titanic_engineered.drop(['PassengerId','Name','Ticket','Cabin','title'],axis=1,inplace=True)

In [51]:
#Getting Dummy Variables
titanic_engineered = pd.get_dummies(titanic_engineered,columns=['Sex','Pclass','Embarked','cabin_side','deck'],drop_first=True)

In [52]:
#Train Test Splitting
local_train, local_test = train_test_split(titanic_engineered,test_size=0.2,random_state=123)

local_train_y = local_train["Survived"]
local_train_x = local_train.drop(["Survived"],axis=1)
local_test_y = local_test["Survived"]
local_test_x = local_test.drop("Survived",axis=1)

In [53]:
#The Model
clf = sm.Logit(local_train_y,local_train_x)
result = clf.fit()
preds = result.predict(local_test_x)

Optimization terminated successfully.
         Current function value: 0.434784
         Iterations 6


In [54]:
#Accuracy of Logistic Model
np.mean((preds > 0.5) == local_test_y)

0.82681564245810057

In [55]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,693.0
Method:,MLE,Df Model:,18.0
Date:,"Wed, 28 Sep 2016",Pseudo R-squ.:,0.3494
Time:,01:38:42,Log-Likelihood:,-309.57
converged:,True,LL-Null:,-475.84
,,LLR p-value:,9.314999999999999e-60

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
SibSp,-0.3718,0.123,-3.016,0.003,-0.613 -0.130
Parch,-0.0317,0.141,-0.224,0.822,-0.309 0.245
Fare,0.0017,0.003,0.580,0.562,-0.004 0.008
age_filled,-0.0450,0.009,-4.865,0.000,-0.063 -0.027
num_in_group,-0.2892,0.472,-0.612,0.540,-1.215 0.637
Sex_male,-2.6553,0.230,-11.525,0.000,-3.107 -2.204
Pclass_2,-0.4183,0.482,-0.867,0.386,-1.364 0.527
Pclass_3,-1.4352,0.481,-2.986,0.003,-2.377 -0.493
Embarked_Q,-0.2604,0.430,-0.606,0.545,-1.103 0.582


## K-Fold Cross Validation (Basic Data Set)

In [56]:
from sklearn.cross_validation import KFold

In [57]:
#Splits data into our train and test indices for each fold
kf = KFold(titanic_only.shape[0], n_folds=10)

In [58]:
#Saves our accuracy scores for each fold
outcomes = []

#Keeps track of which fold we are currently in
fold = 0

In [59]:
for train_index, test_index in kf:
    fold += 1
    local_train_xy, local_test_xy = titanic_only.iloc[train_index], titanic_only.iloc[test_index]
    local_train_y = local_train_xy['Survived']
    local_train_x = local_train_xy.drop(['Survived'],axis=1)
    local_test_y = local_test_xy['Survived']
    local_test_x = local_test_xy.drop(['Survived'],axis=1)

    clf = sm.Logit(local_train_y,local_train_x)
    result = clf.fit()
    preds = result.predict(local_test_x)
    accuracy = np.mean((preds > 0.5) == local_test_y)

    outcomes.append(accuracy)
    print("Fold {0} accuracy: {1}".format(fold, accuracy)) 

Optimization terminated successfully.
         Current function value: 0.486011
         Iterations 6
Fold 1 accuracy: 0.7555555555555555
Optimization terminated successfully.
         Current function value: 0.491157
         Iterations 6
Fold 2 accuracy: 0.8314606741573034
Optimization terminated successfully.
         Current function value: 0.482838
         Iterations 6
Fold 3 accuracy: 0.7640449438202247
Optimization terminated successfully.
         Current function value: 0.490833
         Iterations 6
Fold 4 accuracy: 0.8426966292134831
Optimization terminated successfully.
         Current function value: 0.482505
         Iterations 6
Fold 5 accuracy: 0.7528089887640449
Optimization terminated successfully.
         Current function value: 0.485408
         Iterations 6
Fold 6 accuracy: 0.7752808988764045
Optimization terminated successfully.
         Current function value: 0.482177
         Iterations 6
Fold 7 accuracy: 0.7640449438202247
Optimization terminated successful

In [60]:
#Final Cross Validated (average) score
mean_outcome = np.mean(outcomes)
mean_outcome

0.78679151061173536

## K-Fold Cross Validation (Feature Engineered Data Set)

In [61]:
#Saves our accuracy scores for each fold
outcomes = []

#Keeps track of which fold we are currently in
fold = 0

In [62]:
for train_index, test_index in kf:
    fold += 1
    local_train_xy, local_test_xy = titanic_engineered.iloc[train_index], titanic_engineered.iloc[test_index]
    local_train_y = local_train_xy['Survived']
    local_train_x = local_train_xy.drop(['Survived'],axis=1)
    local_test_y = local_test_xy['Survived']
    local_test_x = local_test_xy.drop(['Survived'],axis=1)

    clf = sm.Logit(local_train_y,local_train_x)
    result = clf.fit()
    preds = result.predict(local_test_x)
    accuracy = np.mean((preds > 0.5) == local_test_y)

    outcomes.append(accuracy)
    print("Fold {0} accuracy: {1}".format(fold, accuracy)) 

    

Optimization terminated successfully.
         Current function value: 0.426672
         Iterations 6
Fold 1 accuracy: 0.8111111111111111
Optimization terminated successfully.
         Current function value: 0.424065
         Iterations 7
Fold 2 accuracy: 0.797752808988764
Optimization terminated successfully.
         Current function value: 0.422449
         Iterations 6
Fold 3 accuracy: 0.8089887640449438
Optimization terminated successfully.
         Current function value: 0.422501
         Iterations 7
Fold 4 accuracy: 0.8202247191011236
Optimization terminated successfully.
         Current function value: 0.423288
         Iterations 7
Fold 5 accuracy: 0.7752808988764045
Optimization terminated successfully.
         Current function value: 0.423060
         Iterations 7
Fold 6 accuracy: 0.7640449438202247
Optimization terminated successfully.
         Current function value: 0.424933
         Iterations 6
Fold 7 accuracy: 0.797752808988764
Optimization terminated successfully

In [63]:
mean_outcome = np.mean(outcomes)
mean_outcome

0.8092009987515606