# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](https://github.com/justmarkham/DAT8/blob/master/data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Import the other modules you might need

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix



## Step 1: Read the data into Pandas

In [None]:
#url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'

url = '../../data/titanic.csv'
titanic_df = pd.read_csv(url, index_col='PassengerId')
titanic_df.head()

## Step 2: Create X and y

Define **Pclass** and **Parch** as the features, and **Survived** as the response.

In [None]:
# set up data
X = titanic_df[['Pclass', 'Parch']]
y = titanic_df['Survived']

## Step 3: Split the data into training and testing sets

In [None]:
# set up train and test data

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=27)

## Step 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [None]:
# build a logistic regression model

lr = LogisticRegression()

lr.fit(X_train,y_train)


## Step 5: Make predictions on the testing set and calculate the accuracy

In [None]:
# class predictions (not predicted probabilities)
y_pred = lr.predict(X_test)

In [None]:
# calculate classification accuracy

acc = lr.score(X_test,y_test)

print(f'Accuracy: {acc}')



## Step 6: Compare your testing accuracy to the null accuracy

In [None]:
# this works regardless of the number of classes
# We need to determine the most frequent class in the test data.
test_df = pd.DataFrame(y_test)
test_df.value_counts()

In [None]:
# this only works for binary classification problems coded as 0/1
y_null = np.zeros_like(y_test)

print(f'Null Accuracy: {np.mean(y_test == y_null)}' )

In [None]:
# alternative approach

y_null = 1 - y_test.mean()
print(f'Null Accuracy: {np.mean(y_test == y_null)}' )

# Confusion matrix of Titanic predictions

In [None]:
# print confusion matrix

# print confusion matrix - remember the order of elements in the confusion matrix

def_confusion = np.array(
        [['tn', 'fp'],
         ['fn', 'tp']])

print(f'Confusion Matrix: (defined) \n {def_confusion}')
print()

cm = confusion_matrix(y_test, y_pred)

print(f'Confusion Matrix: \n {cm}')

In [None]:
# unravel the confusion matrix - 
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()


In [None]:
# calculate the sensitivity
sensitivity = tp/(tp + fn)
print(f'Sensitivity: {sensitivity}')

In [None]:
# calculate the specificity
specificity = tn/(tn + fp)
print(f'Specificity: {specificity}')


In [None]:
# store the predicted probabilities - use y_pred_prob as your variable
y_pred_prob = lr.predict_proba(X_test)[:,1]


In [None]:
# histogram of predicted probabilities
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(y_pred_prob)
plt.xlim(0, 1)
plt.xlabel('Predicted probability of survival')
plt.ylabel('Frequency')

In [None]:
# increase sensitivity by lowering the threshold for predicting survival
y_pred_sens = y_pred_prob > 0.3

In [None]:
# old confusion matrix
cm

In [None]:
# new confusion matrix
cm_new = confusion_matrix(y_test, y_pred_sens)
cm_new

In [None]:
# new sensitivity (higher than before)
tns, fps, fns, tps  = cm_new.ravel()
sens = tps/(tps + fns)

print(f'Original Sensitivity: {sensitivity}')
print(f'     New Sensitivity: {sens}')

In [None]:
# new specificity (lower than before)
specs= tns/(tns + fps)
print(f'Original Specificity: {specificity}')
print(f'     New Specificity: {specs}')

**Question:** What did we prioritze when we lowered the threshold for survivability? That is, what confusion matrix element did we optimize?

**answer** - improving sensitivity reduces the number of False Positives. In this case, it reduces the number of people predicted to survive, even though they did not survive.