First, let's import a few common modules, ensure MatplotLib plots figures inline. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# ignore convergence warning for now
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning,
                            module="sklearn")


# Tackle the Titanic dataset

The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

First, let's load the data:

In [2]:
import pandas as pd
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

The data is already split into a training set and a test set. However, the test data does *not* contain the labels: your goal is to train the best model you can using the training data, then make your predictions on the test data and upload them to WTClass to see your score.

Let's take a peek at the top few rows of the training set:

In [3]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
1,2,0,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


The attributes have the following meaning:
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

Let's get more info to see how much data is missing:

In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          572 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Cabin        159 non-null    object 
 11  Embarked     710 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 66.9+ KB


Okay, the **Age**, **Cabin** and **Embarked** attributes are sometimes null (less than 712 non-null), especially the **Cabin** (77% are null). We will ignore the **Cabin** for now and focus on the rest. The **Age** attribute has about 20% null values, so we will need to decide what to do with them. Replacing null values with the median age seems reasonable.

The **Name** and **Ticket** attributes may have some value, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.

Let's take a look at the numerical attributes:

In [5]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,712.0,712.0,712.0,572.0,712.0,712.0,712.0
mean,356.5,0.376404,2.330056,29.498846,0.553371,0.379213,32.586276
std,205.680983,0.484824,0.824584,14.500059,1.176404,0.791669,51.969529
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,178.75,0.0,2.0,21.0,0.0,0.0,7.925
50%,356.5,0.0,3.0,28.0,0.0,0.0,14.4542
75%,534.25,1.0,3.0,38.0,1.0,0.0,30.5
max,712.0,1.0,3.0,80.0,8.0,6.0,512.3292


* Yikes, only 37.6% **Survived**. :(  That's close enough to 40%, so **accuracy** will be a reasonable metric to evaluate our model.
* The mean **Fare** was £32.60, which does not seem so expensive (but it was probably a lot of money back then).
* The mean **Age** was less than 30 years old.

Let's check that the target is indeed 0 or 1:

In [6]:
train_data["Survived"].value_counts()

0    444
1    268
Name: Survived, dtype: int64

Now let's take a quick look at all the categorical attributes:

In [7]:
train_data["Pclass"].value_counts()

3    398
1    163
2    151
Name: Pclass, dtype: int64

In [8]:
train_data["Sex"].value_counts()

male      467
female    245
Name: Sex, dtype: int64

In [9]:
train_data["Embarked"].value_counts()

S    525
C    125
Q     60
Name: Embarked, dtype: int64

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

Now let's build our preprocessing pipelines. We will use the `ColumnTransformer` to apply different preprocessing and feature extraction pipelines to different subsets of features. Here the numeric data is mean-imputated, while the categorical data is one-hot encoded after imputing missing values with the most frequent value in each column.

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [11]:
numeric_features = ["Age", "SibSp", "Parch", "Fare"]
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))])

categorical_features = ["Pclass", "Sex", "Embarked"]
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

Cool! Now we have a nice preprocessing pipeline that takes the raw data and outputs numerical input features that we can feed to any Machine Learning model we want.

In [12]:
X_train = preprocessor.fit_transform(train_data.drop(columns=["Survived"]))
X_train

array([[45.5,  0. ,  0. , ...,  0. ,  0. ,  1. ],
       [23. ,  0. ,  0. , ...,  0. ,  0. ,  1. ],
       [32. ,  0. ,  0. , ...,  0. ,  0. ,  1. ],
       ...,
       [41. ,  2. ,  0. , ...,  0. ,  0. ,  1. ],
       [14. ,  1. ,  2. , ...,  0. ,  0. ,  1. ],
       [21. ,  0. ,  1. , ...,  0. ,  0. ,  1. ]])

Note: We drop the "Survived" column from train_data before fit the preprocessor, because there is no "Survived" column in test_data. 

Let's not forget to get the labels:

In [13]:
y_train = train_data["Survived"]

We are now ready to train a classifier.

In [14]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

LogisticRegression()

Great, our model is trained, let's use it to make predictions on the test set:

In [15]:
X_test = preprocessor.transform(test_data)
y_pred = log_reg.predict(X_test)

In [16]:
test_data["Survived"]=y_pred
test_data.to_csv("my_solution.csv",index=False, columns=["PassengerId", "Survived"])

And now we just build a CSV file with these predictions, then upload it and hope for the best. But wait! We can do better than hope. Why don't we use cross-validation to have an idea of how good our model is?

In [17]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg, X_train, y_train, cv=10)
scores.mean()


0.8019561815336462

Okay, over 80% accuracy, clearly better than random chance, but it's not a great score. 

**Your Goal**: try to build a model that reaches higher accuracy, for example, 85%.

To improve this result further, you could:
* Tune hyperparameters using cross validation,
* Do more feature engineering, for example:
  * replace **SibSp** and **Parch** with their sum,
  * try to identify parts of names that correlate well with the **Survived** attribute (e.g. if the name contains "Countess", then survival seems more likely),
* try to convert numerical attributes to categorical attributes: for example, different age groups had very different survival rates (see below), so it may help to create an age bucket category and use it instead of the age. Similarly, it may be useful to have a special category for people traveling alone since only 30% of them survived (see below).

In [18]:
data=[train_data, test_data]

for dataset in data:
    dataset["AgeBucket"] = dataset["Age"] // 15 * 15
    dataset[["AgeBucket"]].groupby(['AgeBucket']).mean().astype(int)
    

In [19]:
data=[train_data, test_data]

for dataset in data:
    dataset["RelativesOnboard"]=dataset["SibSp"]+dataset["Parch"]
    
train_data=train_data.drop(['SibSp','Parch'],axis=1)
test_data=test_data.drop(['SibSp','Parch'],axis=1)

In [20]:
data = [train_data, test_data]
for dataset in data:
    dataset.loc[dataset['RelativesOnboard'] > 0, 'alone'] = 1
    dataset.loc[dataset['RelativesOnboard'] == 0, 'alone'] = 0
    dataset['alone'] = dataset['alone'].astype(int)

In [21]:
train_data = train_data.drop(['Ticket'], axis=1)
test_data = test_data.drop(['Ticket'], axis=1)

In [22]:
data = [train_data, test_data]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Major":5, "Ms":6}

for dataset in data:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    dataset['Title'] = dataset['Title'].map(titles)
    dataset['Title'] = dataset['Title'].fillna(0)

In [23]:
data = [train_data, test_data]
for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['RelativesOnboard']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)

In [24]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,Cabin,Embarked,AgeBucket,RelativesOnboard,alone,Title,Fare_Per_Person
0,1,0,1,"Partner, Mr. Austen",male,45.5,28.5000,C124,S,45.0,0,0,1.0,28
1,2,0,2,"Berriman, Mr. William John",male,23.0,13.0000,,S,15.0,0,0,1.0,13
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,7.9250,,S,30.0,0,0,1.0,7
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,7.8542,,S,15.0,1,1,1.0,3
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,31.2750,,S,0.0,6,1,2.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,708,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,7.6500,,S,15.0,0,0,2.0,7
708,709,0,1,"Cairns, Mr. Alexander",male,,31.0000,,S,,0,0,1.0,31
709,710,0,3,"Hansen, Mr. Claus Peter",male,41.0,14.1083,,S,30.0,2,1,1.0,4
710,711,1,1,"Carter, Miss. Lucile Polk",female,14.0,120.0000,B96 B98,S,0.0,3,1,2.0,30


In [25]:
numeric_features = ["AgeBucket", "RelativesOnboard", "Fare","Title","Fare_Per_Person", "Pclass","alone"]
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))])

categorical_features = ["Sex", "Embarked"]
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [26]:
X_train2 = preprocessor.fit_transform(train_data.drop(columns=["Survived"]))
X_train2

array([[ 45.    ,   0.    ,  28.5   , ...,   0.    ,   0.    ,   1.    ],
       [ 15.    ,   0.    ,  13.    , ...,   0.    ,   0.    ,   1.    ],
       [ 30.    ,   0.    ,   7.925 , ...,   0.    ,   0.    ,   1.    ],
       ...,
       [ 30.    ,   2.    ,  14.1083, ...,   0.    ,   0.    ,   1.    ],
       [  0.    ,   3.    , 120.    , ...,   0.    ,   0.    ,   1.    ],
       [ 15.    ,   1.    ,  77.2875, ...,   0.    ,   0.    ,   1.    ]])

In [27]:
y_train = train_data["Survived"]

In [28]:
from sklearn.linear_model import LogisticRegression

log_reg2 = LogisticRegression()
log_reg2.fit(X_train2, y_train)

LogisticRegression()

In [29]:
X_test2 = preprocessor.transform(test_data)
y_pred = log_reg2.predict(X_test2)

In [30]:
test_data["Survived"]=y_pred
test_data.to_csv("my_solution2.csv",index=False, columns=["PassengerId", "Survived"])

In [43]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg2, X_train2, y_train, cv=21)
scores.mean()

0.8345641286817757

In [32]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,Cabin,Embarked,AgeBucket,RelativesOnboard,alone,Title,Fare_Per_Person
0,1,0,1,"Partner, Mr. Austen",male,45.5,28.5000,C124,S,45.0,0,0,1.0,28
1,2,0,2,"Berriman, Mr. William John",male,23.0,13.0000,,S,15.0,0,0,1.0,13
2,3,0,3,"Tikkanen, Mr. Juho",male,32.0,7.9250,,S,30.0,0,0,1.0,7
3,4,0,3,"Hansen, Mr. Henrik Juul",male,26.0,7.8542,,S,15.0,1,1,1.0,3
4,5,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,31.2750,,S,0.0,6,1,2.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
707,708,1,3,"Salkjelsvik, Miss. Anna Kristine",female,21.0,7.6500,,S,15.0,0,0,2.0,7
708,709,0,1,"Cairns, Mr. Alexander",male,,31.0000,,S,,0,0,1.0,31
709,710,0,3,"Hansen, Mr. Claus Peter",male,41.0,14.1083,,S,30.0,2,1,1.0,4
710,711,1,1,"Carter, Miss. Lucile Polk",female,14.0,120.0000,B96 B98,S,0.0,3,1,2.0,30


In [33]:
test_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,Fare,Cabin,Embarked,Survived,AgeBucket,RelativesOnboard,alone,Title,Fare_Per_Person
0,1,3,"Moubarek, Master. Halim Gonios (""William George"")",male,,15.2458,,C,1,,2,1,4.0,5
1,2,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,10.5000,,S,0,30.0,0,0,1.0,10
2,3,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,7.9250,,S,0,15.0,0,0,1.0,7
3,4,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.0,33.0000,,S,1,0.0,1,1,2.0,16
4,5,3,"Nicola-Yarred, Miss. Jamila",female,14.0,11.2417,,C,1,0.0,1,1,2.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
174,175,3,"Kallio, Mr. Nikolai Erland",male,17.0,7.1250,,S,0,15.0,0,0,1.0,7
175,176,3,"Elias, Mr. Dibo",male,,7.2250,,C,0,,0,0,1.0,7
176,177,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,31.3875,,S,0,30.0,6,1,3.0,4
177,178,2,"Ilett, Miss. Bertha",female,17.0,10.5000,,S,1,15.0,0,0,2.0,10


In [34]:
train_data.shape

(712, 14)

In [35]:
test_data.shape

(179, 14)