In [1]:
import pandas as pd
import numpy as np

%matplotlib inline

Before we train the the predictors on the Tatinic dataset and crate the Streamlit app, some preprocessing is necessary. 
- We need to address the missing values in the dataset
- Next, we need to understand which columns have a high information potential and which columns can be merged or dropped altogether
- Finally, we need to transform categorical features to numerical ones

We start by importing the dataset and taking an overall look at the data

In [2]:
titanic = pd.read_csv('titanic_raw.csv')

In [3]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The dataset includes 891 entries and 12 columns. We notice that the 'Age' column is missing 177 values, the 'Cabin' column is missing 687 values and the 'Embarked' column is missing 2 values.

### Resolving missing values

**'Age' column**

We will replace the missing vaues with the median passenger age per class.

In [5]:
titanic.groupby(['Pclass'])['Age'].median()

Pclass
1    37.0
2    29.0
3    24.0
Name: Age, dtype: float64

In [6]:
def age_by_class(df):
    Age = df[0]
    Pclass = df[1]
    
    if pd.isnull(Age):
        if Pclass == 1: return 37
        elif Pclass == 2: return 29
        else: return 24
    else:
        return Age

titanic['Age'] = titanic[['Age' , 'Pclass']].apply(age_by_class, axis = 1)

**'Cabin' column**

While an overwhelming amount of entries are missing in the 'Cabin' column, it does seem to be a good predictor of whether the passenger survived or not so it is probably a bad idea to dismiss it altogether. Instead, it can be transformed into a binary indicator. 

In [7]:
titanic.groupby(titanic['Cabin'].isnull())['Survived'].count()

Cabin
False    204
True     687
Name: Survived, dtype: int64

Over three times more passengers whom had a cabin number survived letting believe that wealthier passangers had more chances to survive.

In [8]:
titanic['Cabin_ind'] = np.where(titanic['Cabin'].isnull(), 0, 1)

In [9]:
titanic.drop(columns =['Cabin'], inplace=True) 

**'Embarked' column**

It looks like the wide majority of passengers embarked in Southampton and only relatively few passengers joined in the other two ports. The more, when looking at the survival rate by port and by class, the chance of survival seems to be determined by the class as the mean survival rate is almost identical for passengers embarked in any port. An exception is the survival rate for the 3rd class for passengers from Southampton, but this is explained by the sheer number of people who embarked in Southampton on the 3rd class. Therefore, we will drop the 'Embarked' column and keep the 'Pclass' one.

In [10]:
titanic.groupby(['Embarked', 'Survived'])['PassengerId'].count()

Embarked  Survived
C         0            75
          1            93
Q         0            47
          1            30
S         0           427
          1           217
Name: PassengerId, dtype: int64

In [11]:
titanic.groupby(['Embarked', 'Pclass'])['Survived'].mean()

Embarked  Pclass
C         1         0.694118
          2         0.529412
          3         0.378788
Q         1         0.500000
          2         0.666667
          3         0.375000
S         1         0.582677
          2         0.463415
          3         0.189802
Name: Survived, dtype: float64

In [12]:
titanic.drop(columns =['Embarked'], inplace=True) 

### Feature importance

**'SibSp' and 'Parch' columns**

First, we will merge these columns into a single one called 'Family'.

In [13]:
titanic['Family'] = titanic['SibSp'] + titanic['Parch']

In [14]:
titanic.groupby(titanic['Family']!=0)['Survived'].mean()

Family
False    0.303538
True     0.505650
Name: Survived, dtype: float64

We see that people travelling with family had more chances to survive than people travelling alone. We will, therefore, binarise this information by '0' -> passengers travelling alone and '1' -> passengers traveling not alone.

In [15]:
titanic.loc[titanic['Family'] > 0, 'Not_alone'] = 1
titanic.loc[titanic['Family'] == 0, 'Not_alone'] = 0
titanic['Not_alone'] = titanic['Not_alone'].astype(int)

In [16]:
titanic.drop(columns =['SibSp', 'Parch', 'Family'], inplace=True) 

**'Ticket', 'Fare', 'Name', 'PassengerId' columns**

We will drop these columns for the following reasons:
- 'Ticket' and 'Name' are strings and feable predictors of survival
- 'PassengerId' is the dataframe index
- 'Fare' is strongly correlated the class as passengers in upper classes payed more for their ticket

In [17]:
titanic.drop(columns =['Name', 'Ticket', 'Fare', 'PassengerId'], inplace=True) 

### Categorical features to numerical

**'Sex' column**

We will binarize the 'Sex' column values so as a male passenger will be represented by '1' and a female passenger represented by '0'.

In [18]:
# male is 1
titanic['Sex'] = pd.get_dummies(titanic['Sex'], drop_first=True)

Our preprocessed dataframe will look like:

In [19]:
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Cabin_ind,Not_alone
0,0,3,1,22.0,0,1
1,1,1,0,38.0,1,1
2,1,3,0,26.0,0,0
3,1,1,0,35.0,1,1
4,0,3,1,35.0,0,0


### Train test split

In [20]:
target = "Survived"

y = titanic[target].copy() # get labels
X = titanic.drop(target, axis=1) # drop labels for the training set

Save a copy of the dataset without the labels to be used in the Streamlit application.

In [21]:
# titanic.to_csv('titanic_processed.csv', index=False)

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Survival rate prediction

We will first test a baseline prediction using for three estimators. 

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

In [24]:
def test_raw_predictor(predictor):
    for item in predictor:
        model = item
        cv_scores = cross_val_score(model, X_train, y_train, cv=10, n_jobs = -1)
        print(str(item).split("(")[0], "Scores:", cv_scores)
        print(str(item).split("(")[0], "Mean:", cv_scores.mean())
        print(str(item).split("(")[0], "Standard deviation:", cv_scores.std())  
        print("\n")

In [25]:
predictor = [LogisticRegression(), RandomForestClassifier(), SVC()]

test_raw_predictor(predictor)

LogisticRegression Scores: [0.89552239 0.70149254 0.80597015 0.85074627 0.80597015 0.70149254
 0.70149254 0.79104478 0.75757576 0.89393939]
LogisticRegression Mean: 0.7905246494798733
LogisticRegression Standard deviation: 0.07119617673807932


RandomForestClassifier Scores: [0.88059701 0.76119403 0.8358209  0.85074627 0.73134328 0.74626866
 0.73134328 0.74626866 0.78787879 0.86363636]
RandomForestClassifier Mean: 0.793509724106739
RandomForestClassifier Standard deviation: 0.055531054502725385


SVC Scores: [0.64179104 0.65671642 0.65671642 0.68656716 0.64179104 0.6119403
 0.67164179 0.62686567 0.60606061 0.65151515]
SVC Mean: 0.6451605608322026
SVC Standard deviation: 0.023901927499956427




Next, we will fine tune the estimators and test their performancce in an ensemble learning.

In [26]:
from sklearn.model_selection import GridSearchCV

def fine_tune_predictor(predictor, grid):
    predictor_cv = GridSearchCV(predictor, grid, cv=10)
    predictor_cv.fit(X_train, y_train)
    print("Tuned hyperparameters :(best parameters) ", predictor_cv.best_params_)
    print("Accuracy :",predictor_cv.best_score_)

In [27]:
parameters_svm = {'C':[1, 10, 100],'kernel':['rbf','linear'], 'gamma':[0.1,'auto', 10], 'degree':[3,4,10]}

parameters_rf = {'n_estimators': [10, 100,50], 'min_samples_leaf': [2,4], 'min_samples_split':[2,6],
                 'bootstrap':[True, False], 'max_depth': [10, 50, 100]}

parameters_lr = {'penalty': ['l1', 'l2'],'C':[0.001,0.01,1,10,100], 'solver':['liblinear', 'saga']}

In [28]:
fine_tune_predictor(LogisticRegression(max_iter=10000), parameters_lr)

Tuned hyperparameters :(best parameters)  {'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}
Accuracy : 0.7935323383084577


In [29]:
fine_tune_predictor(RandomForestClassifier(n_jobs = -1), parameters_rf)

Tuned hyperparameters :(best parameters)  {'bootstrap': True, 'max_depth': 50, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 10}
Accuracy : 0.8174581637268205


In [30]:
fine_tune_predictor(SVC(), parameters_svm)

Tuned hyperparameters :(best parameters)  {'C': 10, 'degree': 3, 'gamma': 0.1, 'kernel': 'rbf'}
Accuracy : 0.7981004070556309


In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="liblinear", penalty ="l1", C=1, random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, bootstrap=False, max_depth=10, min_samples_leaf=2, 
                                 min_samples_split=6, random_state=42)
svm_clf = SVC(gamma=0.1, degree=3, C=10, kernel='rbf', probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')

In [32]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8026905829596412
RandomForestClassifier 0.8116591928251121
SVC 0.8026905829596412
VotingClassifier 0.7982062780269058


The predictors are ready now to be deployed in the application. 