# DAT 19: Homework 2 Assignment

## Instructions

For Homework 2, we will build on the work we did with the Titanic dataset in Homework 1. In this assignment, we will build a logistic regression model to predict passenger survival.

Please do all your analysis to answer the questions below in this Jupyter notebook. Show your work.

**Please submit your completed notebook by 6:00PM on Monday, January 11.**

## About the Data

```
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```

## Homework Assignment

**1) Create a logistic regression model on the Titanic dataset to predict the survival of passengers. Show your model output. Include coefficient values.**

In [1]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import Imputer


### Clean Data

In [2]:
train = pd.read_csv("titanic.csv")
train = train[np.isfinite(train['Age'])]
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


### Map data to numerical values

In [3]:
gender_map = {'male': 0, 'female': 1}
train['Sex'] = train['Sex'].map(gender_map)

embarked_map = {'C': 0, 'Q': 1, 'S': 2}
train['Embarked'] = train['Embarked'].map(embarked_map)

### Define target and features

In [4]:
#features = train.drop(['PassengerId', 'Name', 'Survived', 'Ticket', 'Cabin'],axis=1)
features = train.drop(['PassengerId', 'Name', 'Survived', 'Ticket', 'Cabin', 'Age', 'Fare', 'SibSp', 'Parch', 'Embarked'],axis=1)
target = train.Survived
features.head()

Unnamed: 0,Pclass,Sex
0,3,0
1,1,1
2,3,1
3,1,1
4,3,0


### Normalize features

In [5]:
features.describe()

Unnamed: 0,Pclass,Sex
count,714.0,714.0
mean,2.236695,0.365546
std,0.83825,0.481921
min,1.0,0.0
25%,1.0,0.0
50%,2.0,0.0
75%,3.0,1.0
max,3.0,1.0


In [6]:
imp=Imputer(missing_values='NaN',strategy='mean',axis=0)
new_features = imp.fit_transform(features)

new_features

array([[ 3.,  0.],
       [ 1.,  1.],
       [ 3.,  1.],
       ..., 
       [ 1.,  1.],
       [ 1.,  0.],
       [ 3.,  0.]])

In [7]:
scaler = StandardScaler()
features_norm = scaler.fit_transform(new_features)
pd.DataFrame(features_norm).describe()

Unnamed: 0,0,1
count,714.0,714.0
mean,-1.043361e-16,2.3635e-17
std,1.000701,1.000701
min,-1.476364,-0.7590513
25%,-1.476364,-0.7590513
50%,-0.2825656,-0.7590513
75%,0.9112324,1.317434
max,0.9112324,1.317434


### Run model with cross validation

In [11]:
model_lr = LogisticRegression(C=1)
model_lr

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

In [12]:
cross_val_score(model_lr,features_norm,target,cv=10).mean()

0.77998826291079815

### Find feature coefficients 

In [13]:
model_lr = LogisticRegression(C=1).fit(new_features, target)
coefficients=model_lr.coef_.ravel()
print coefficients

[-0.94458556  2.4713532 ]


**2) Which features are predictive for this logistic regression? Explain your thinking. Do not simply cite model statistics.**

I initially included features 'Age', 'Fare', 'SibSp', 'Parch', 'Embarked' in my first model and received a cross validation score of .77, which is about the same as a model with only the 'Pclass' and 'Sex' features.  I decided to exclude the other features when I saw that their model coefficients were close to 0, indicating that they are not as predictive as 'Pclass' and 'Sex' which where about -1 and 2.5 respectively.

**3) Implement cross-validation for your logistic regression model. Select the number of folds. Explain your choice.**

In [156]:
# 10 folds should be sufficient
cross_val_score(model_lr,features_norm,target,cv=10).mean()

0.77998826291079815

**4) In the hw-assignments director on the class github repo, there is a file called titanic-test.csv. What does your logistic regression model predict for these previously unseen (i.e. out of sample) passengers?**

In [163]:
test = pd.read_csv("titanic-test.csv")
gender_map = {'male': 0, 'female': 1}
test['Sex'] = test['Sex'].map(gender_map)

In [165]:
test_features = test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Age', 'Fare', 'SibSp', 'Parch', 'Embarked'],axis=1)


In [166]:
predicted = model_lr.predict(test_features)

In [168]:
test['Prediction'] = predicted
test.describe()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Prediction
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0,418.0
mean,1100.5,2.26555,0.363636,30.27259,0.447368,0.392344,35.627188,0.363636
std,120.810458,0.841838,0.481622,14.181209,0.89676,0.981429,55.907576,0.481622
min,892.0,1.0,0.0,0.17,0.0,0.0,0.0,0.0
25%,996.25,1.0,0.0,21.0,0.0,0.0,7.8958,0.0
50%,1100.5,3.0,0.0,27.0,0.0,0.0,14.4542,0.0
75%,1204.75,3.0,1.0,39.0,1.0,0.0,31.5,1.0
max,1309.0,3.0,1.0,76.0,8.0,9.0,512.3292,1.0


In [173]:
survivors = test['Prediction'].sum()
survivors

152

In [174]:
survival_rate = float(survivors) / len(test)
survival_rate

0.36363636363636365

My logistic regression model predicts that of the passengers in the new data set **152 survive** out of **418** for a survival rate of **36%**