# Purpose

This notebook will use the kaggle dataset "Titanic - Machine Learning from Disaster".  
to demonstrate logistic regression and a pipeline to do a lot of dataset manipulation.  

The goal isn't necessarily to win the Titanic competion - which isn't possible anyway because there are perfect classifiers out there, but to demonstrate some techniques and beat a naive baseline.

In [1]:
from typing import Tuple

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder

Gather the data and a little preprocessing.  
*  create my own train/test split
*  drop some columns that don't make good 'out of the box' features

In [2]:
train_competition = pd.read_csv('./data/train.csv')
test_competition = pd.read_csv('./data/test.csv')

In [3]:
print(f'Train: {train_competition.shape}')
print(f"Test: {test_competition.shape}")

Train: (891, 12)
Test: (418, 11)


In [4]:
train_competition.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
# Drop a few columns that wouldn't make good 'out of the box' features
train_x = train_competition.drop(["PassengerId", "Survived", "Name", "Ticket", "Cabin"], axis=1)
train_y = train_competition.loc[:,["Survived"]]

In [6]:
train_x, test_x, train_y, test_y = train_test_split(train_x, train_y, test_size=0.1, stratify=train_y)

In [7]:
print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)

(801, 7)
(801, 1)
(90, 7)
(90, 1)


In [8]:
train_x.dtypes

Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

# Baseline

One of the most important tasks with ML and supervised learning is to establish a baseline.

A naive baseline doesn't use any input features, for a binary classification problem such as this it is akin to predicting the majority class in all cases.

Without a naive baseline you cannot be certain a ML model is able to learn from input features.

In [9]:
# The majority class is dead
baseline = (train_x.shape[0] - train_y.sum())/train_x.shape[0]

In [10]:
print(f'Baseline: {round(baseline.iloc[0], 2)}')

Baseline: 0.62


# Pipeline

Take the approach as treating all training data categorical.

Pclass, Sex, SibSp, Parch, Embarked will be one-hot encoded  
Age and Fare will be bucketed and one-hot encoded

## Preprocessing

build a series of transformers to preprocess the data and become part of the training pipeline.

In [11]:
train_x.dtypes

Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

In [12]:
# Define the features
features_one_hot = ["Pclass", "Sex", "SibSp", "Parch", "Embarked"]
features_binned = ["Age", "Fare"]

In [13]:
train_x.shape

(801, 7)

In [14]:
# Define the preprocessing using a collection of ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('one_hot',
         OneHotEncoder(drop="if_binary", 
                       handle_unknown="infrequent_if_exist", 
                       sparse_output=False),
         features_one_hot),
        ('binned', Pipeline(steps=[
         ('imputer', SimpleImputer(strategy='mean')),
         ('discretizer', KBinsDiscretizer(encode='onehot-dense'))
        ]),
        features_binned)
    ]
)

In [15]:
# Create the training pipeline.
# The model is simple enough in this case that it doesn't need to be its own variable.
# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [16]:
# fit the model by fitting the whole pipeline
pipeline.fit(train_x, train_y["Survived"])

In [17]:
# Score the fitted model on the training data.
# This beats our baseline established above.
round(pipeline.score(train_x, train_y["Survived"]),2)

0.81

In [18]:
# Score the fitted model on the holdout test set.
round(pipeline.score(test_x, test_y["Survived"]),2)



0.8

In [19]:
# For kaggle submission, predict on test set and create a submission file.
# The warning is OK, there are categories that exist only in the test set.
test_competition_predictions = pipeline.predict(X=test_competition)



In [20]:
test_predictions_df = pd.DataFrame({"PassengerId": test_competition["PassengerId"],
                                    "Survived": test_competition_predictions})

In [21]:
test_predictions_df.to_csv('./test_competition_predictions_jpj.csv', index=False)

Test competition predictions score 77.272%  
This is good enough for 10,185 place on the leaderboard!