# Modeling Walkthrough

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

## Modeling Steps

1. Build a model based on the [Titanic dataset](https://www.kaggle.com/c/titanic/data) that predicts whether a given person survived or not
2. Evaluate the performance of the model
3. Make changes in an attempt to improve the model
4. Demonstrate whether an improvement was made

## The Data

This dataset has the following columns:

| Variable | Definition | Key |
| -------- | ---------- | --- |
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

## Initial Data Understanding and Preparation

Open up the file, get everything into `X` features and `y` target variables, divided into train and test

In [None]:
df = pd.read_csv("titanic.csv")

In [None]:
df.head()

In [None]:
df.describe()

Age data is missing for about 1 in 9 rows in our dataset.  For now, let's just exclude it, plus the non-numeric columns, and `PassengerId` which doesn't seem like a real feature, but rather just an artifact of the dataset.

In [None]:
columns_to_use = ["Survived", "Pclass", "SibSp", "Parch", "Fare"]

In [None]:
df["Survived"].value_counts()

Not a huge class imbalance, but not evenly-sized categories either

In [None]:
sns.pairplot(df[columns_to_use])

In [None]:
df_to_use = df[columns_to_use]
X = df_to_use.drop("Survived", axis=1)
y = df_to_use["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

## Modeling

Let's start with a completely "dummy" model, that will always choose the majority class.

In [None]:
dummy_model = DummyClassifier(strategy="most_frequent")

Fit the model on our data

In [None]:
dummy_model.fit(X_train, y_train)

We should expect all predictions to be the same

In [None]:
# just grabbing the first 50 to save space
dummy_model.predict(X_train)[:50]

## Model Evaluation

In [None]:
dummy_model.score(X_train, y_train)

So, the mean accuracy is about 60% if we always guess the majority class

In [None]:
plot_confusion_matrix(dummy_model, X_train, y_train)

A pretty lopsided confusion matrix!

## Modeling, Again

Let's use a logistic regression and compare its performance

In [None]:
logreg_model = LogisticRegression(random_state=2019)

In [None]:
logreg_model.fit(X_train, y_train)

Look at the predictions

In [None]:
logreg_model.predict(X_train)[:50]

Mixture of 1s and 0s this time

## Model Evaluation, Again

In [None]:
logreg_model.score(X_train, y_train)

So the mean accuracy is 70% if the model is actually taking in information from the features instead of always guessing the majority class

In [None]:
plot_confusion_matrix(logreg_model, X_train, y_train)

In [None]:
confusion_matrix(y_train, logreg_model.predict(X_train))

So, in general we are not labeling many of the "not survived" passengers as "survived", but for "survived" passengers we're getting it right only about half of the time

## Data Understanding and Preparation, Again

Maybe there is some useful information in the features we are not using yet.

In [None]:
df.columns

In [None]:
df["Name"]

Maybe we could do some parsing to make more sense of this, but that seems complicated

In [None]:
df["Sex"]

This one is of type "object" but looks potentially one-hot-encode-able

In [None]:
df["Sex"].value_counts()

In [None]:
sns.catplot(x="Sex", y="Survived", data=df, kind="bar")

Looks like there is a meaningful difference in survival rates, so let's add it to the model

Only two categories, so we only need a LabelEncoder (no new columns needed, we just need to replace the strings with numbers)

In [None]:
label_encoder = LabelEncoder()

In [None]:
sex_labels = label_encoder.fit_transform(df["Sex"])
sex_labels[:50]

In [None]:
label_encoder.classes_

So, this is telling us that "female" is encoded as 0, "male" is encoded as 1

In [None]:
df["sex_encoded"] = sex_labels

In [None]:
columns_to_use.append("sex_encoded")

In [None]:
df_to_use = df[columns_to_use]
X = df_to_use.drop("Survived", axis=1)
y = df_to_use["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

## Modeling, Again

Let's see how the logistic regression does, now that this new feature has been added

In [None]:
second_logreg_model = LogisticRegression(random_state=2019)

In [None]:
second_logreg_model.fit(X_train, y_train)

In [None]:
second_logreg_model.score(X_train, y_train)

In [None]:
plot_confusion_matrix(second_logreg_model, X_train, y_train)

In [None]:
confusion_matrix(y_train, second_logreg_model.predict(X_train))

So, we have improved the mean accuracy from about 70% to 79%, and actually are doing a better job guessing both the "survived" and "not survived" classes