# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](../data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

## Read the data into a Pandas dataframe

In [12]:
# Read the data into a Panda's dataframe and display the head of the file.  Use passenger ID as the index_col
import pandas as pd
import numpy as np


In [13]:
df = pd.read_csv('../data/titanic.csv', index_col='PassengerId')
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


## 1a: Create X and y

Define **Pclass** and **Parch** as the features, and **Survived** as the response.

In [15]:
X = df[['Pclass','Parch']]
y = df['Survived']

## 1b: Split the data into training and testing sets

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## 2: Chose and import the estimator

In [18]:
from sklearn.linear_model import LogisticRegression


## 3: Instantiate into a variable

In [19]:
LR = LogisticRegression()

## 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [20]:
LR.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [21]:
LR.coef_

array([[-0.8080444 ,  0.22966061]])

## 5: Make predictions on the testing set

In [22]:
# class predictions
y_predict = LR.predict(X_test)

## 6: Calculate the accuracy using the testing set predictions

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
print(accuracy_score(y_test,y_predict))

0.679104477612


## Compare your testing accuracy to the null accuracy

In [25]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [26]:
sum(y_test == 0)/len(y_test)

0.57462686567164178

In [27]:
y_null = np.zeros_like(y_test, dtype=float)
print(accuracy_score(y_test,y_null))

0.574626865672


In [34]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_predict)

array([[130,  24],
       [ 62,  52]])

## Bonus problem: Build a K Nearest Neighbor model and compare the accuracy

In [28]:
from sklearn.neighbors import KNeighborsClassifier

In [29]:
KNN = KNeighborsClassifier()

In [30]:
KNN.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [31]:
y_predict2 = KNN.predict(X_test)

In [32]:
print(accuracy_score(y_test,y_predict2))

0.686567164179
