# DAT 19: Homework 2 Assignment

## Instructions

For Homework 2, we will build on the work we did with the Titanic dataset in Homework 1. In this assignment, we will build a logistic regression model to predict passenger survival.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:00PM on Monday, January 11.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

## Homework Assignment

**1) Create a logistic regression model on the Titanic dataset to predict the survival of passengers. Show your model output. Include coefficient values.**

In [127]:
#import packages 

import pandas as pd
import numpy as np

from bokeh.plotting import figure,show,output_notebook
from bokeh.models import Range1d

from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_rows',100)
pd.set_option('display.max_columns',60)


In [144]:
#read in data 

data = pd.read_csv('/Users/maxcameron/Desktop/General Assembly/DAT_SF_19/data/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [141]:
#investigate data
# print data.head()
# print data.describe()
print data.info()

# Munging Tasks: Drop unecessary columns, Get gender-specific averages for Age, Sex 
# should be boolean instead of a string, Pclass should be dummy variable.  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 15 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Pclass_1       891 non-null float64
Pclass_2       891 non-null float64
Pclass_3       891 non-null float64
dtypes: float64(5), int64(6), object(4)
memory usage: 111.4+ KB
None


In [143]:
#drop unecessary columns

df = data.drop(['Name','Ticket','Fare','Cabin','Embarked'], axis=1)

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Pclass_1,Pclass_2,Pclass_3
0,1,0,3,0,22,1,0,0,0,1
1,2,1,1,1,38,1,0,1,0,0
2,3,1,3,1,26,0,0,0,0,1
3,4,1,1,1,35,1,0,1,0,0
4,5,0,3,0,35,0,0,0,0,1


In [131]:
# Convert sex to boolean
data.Sex = data.Sex.replace(['male','female'],[0,1])
print data[['Name','Sex']].head()

                                                Name  Sex
0                            Braund, Mr. Owen Harris    0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...    1
2                             Heikkinen, Miss. Laina    1
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)    1
4                           Allen, Mr. William Henry    0


In [132]:
#Impute avg age for men and women separately 

avg_age_men = data.Age[data.Sex==0].mean()
avg_age_women = data.Age[data.Sex==1].mean()
data.Age[data.Sex==0] = data.Age[data.Sex==0].fillna(avg_age_men)
data.Age[data.Sex==1] = data.Age[data.Sex==1].fillna(avg_age_women)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [133]:
#Check for null values for men and women

print data[(data.Sex==0)&(data.Age.isnull())]
print data[(data.Sex==1)&(data.Age.isnull())]

print data.info()

Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []
Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(6), object(4)
memory usage: 90.5+ KB
None


In [134]:
# put Pclass into dummy variables 

pclass = pd.get_dummies(data.Pclass, prefix = 'Pclass')
print pclass.head()

   Pclass_1  Pclass_2  Pclass_3
0         0         0         1
1         1         0         0
2         0         0         1
3         1         0         0
4         0         0         1


In [135]:
#Merge pclass back into dataframe

data = pd.merge(data,pclass,left_index=True, right_index=True)

In [136]:
print avg_age_women
print avg_age_men

27.9157088123
30.7266445916


In [137]:
model_lr = LogisticRegression(C=1)

In [138]:
features = data.drop('Survived',axis=1)
target = data.Survived

features.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Pclass_1,Pclass_2,Pclass_3
0,1,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,S,0,0,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,C85,C,1,0,0
2,3,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.925,,S,0,0,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1,C123,S,1,0,0
4,5,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.05,,S,0,0,1


In [139]:
features.values()

TypeError: 'numpy.ndarray' object is not callable

**2) Which features are predictive for this logistic regression? Explain your thinking. Do not simply cite model statistics.**

**3) Implement cross-validation for your logistic regression model. Select the number of folds. Explain your choice.**

**4) In the hw-assignments director on the class github repo, there is a file called titanic-test.csv. What does your logistic regression model predict for these previously unseen (i.e. out of sample) passengers?**