In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df_training = pd.read_csv('../../datasets/titanic_training.csv')

In [None]:
df_test = pd.read_csv('../../datasets/titanic_test.csv')

In this notebook we analyse and preprocess the training data to prepare it for machine learning algorithms. We apply exactly the same transformations to the test data. 

# Initial preparation

In [None]:
len(df_training)

In [None]:
len(df_test)

In [None]:
df_training.head()

In [None]:
df_training.columns

- PassengerID: row id
- Survived: target variable (1 = yes, 0 = no)
- Pclass: ticket class
- Name: name
- Sex: sex
- Age: age
- SibSp: number of spouses or siblings aboard
- Parch: number of parents or children aboard
- Ticket: ticket number 
- Fare: ticket fare
- Cabin: assigned cabin number
- Embarked: port from which they embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

It may seem that to make the Ticket and Cabin features we need to transform the data.

The Cabin feature is composed of a letter (which correlates to the class) and a number (the specific cabin). It may be interesting to split this feature into two: a categorical feature that may be very correlated to Pclass (CabinClass) and a numerical feature (CabinNumber) that specifies the approximate position from the front to the back of the ship. The problem is that according to Titanic Deck plans there is not a direct relation between the cabin number and distance from the front. It would be useful to use the cabin number to split cabins insto front, middle and back cabins. And also in left and right. However, it is hard to find a good deck plan that indicates the actual positions of the cabins. I will keep cabin number anyway because it may indicate proximity. 

In [None]:
df_training['Cabin'].unique()

With respect to ticket number, the optional prefix (TicketPrefix) indicates issuing office and the number (TicketNumber) can be compared for equality (sharing a cabin) or for closeness (people with cabins that are close to each other.)

In [None]:
df_training['Ticket'].values

I don't see how to use TicketNumber and CabinNumber as proximity features, so I will stick to TicketPrefix and CabinClass.

In [None]:
def process_ticket(df):
    df['TicketPrefix'] = df['Ticket']
    df.loc[df['Ticket'].notnull(), 'TicketPrefix'] = df['Ticket'].apply(lambda x: x.split(' ')[0] 
                                                                                  if len(x.split(' ')) > 1
                                                                                  else 'NUMBER')
    
process_ticket(df_training)
process_ticket(df_test)

In [None]:
df_training[['Ticket', 'TicketPrefix']].head()

In [None]:
# For cabin I keep the first letter. There are multiple instances of rows having multiple assigned cabins. In these cases
# the first letter is the same for all the assigned cabins, except in two cases in which we have:
# F GXX
# In this case, for simplicity, I decided to keep the F letter
def process_cabin(df):
    df['CabinClass'] = df['Cabin']
    df.loc[df['Cabin'].notnull(), 'CabinClass'] = df['Cabin'].apply(lambda x: str(x)[0])
    
process_cabin(df_training)
process_cabin(df_test)

In [None]:
df_training[['Cabin', 'CabinClass']].head()

In [None]:
dependent = 'Survived'
categorical = ['Pclass', 'Sex', 'TicketPrefix', 'CabinClass', 'Embarked']
numerical = ['Age', 'SibSp', 'Parch', 'Fare']

## Initial exploration

We must take into account that there are missing values.

Looking at numerical variables first. 

In [None]:
kwargs = dict(histtype = 'stepfilled', alpha = 0.3, density = True, ec = 'k')

for n in numerical:
    df = df_training[df_training[n].notnull()]
    x = df[n].values
    y = df[dependent].values
    
    fig, ax = plt.subplots(1, 2)
    (_, bins, _) = ax[0].hist(x, **kwargs)
    ax[0].set_title(n)
    
    x_0 = x[np.where(y == 0)]
    x_1 = x[np.where(y == 1)]
    ax[1].hist(x_0, **kwargs, bins = bins)
    ax[1].hist(x_1, **kwargs, bins = bins)
    ax[1].legend(['no', 'yes'])
    ax[1].set_title(n + ' vs. survived')
    
    fig.set_figwidth(16)

It seems that all the numerical features may provide useful information in predicting the dependent variable:

* Younger passengers are more likely to survive
* Passengers with not too few or too many embarked siblings/spouses are more likely to survive
* Passengers are more likely to survive if they embarked with parents/children
* Cheaper fares are less likely to survive.

Let's take a look at the categorical features now. 

In [None]:
for c in categorical:
    df = df_training[df_training[c].notnull()]
    
    fig, ax = plt.subplots(1, 2)
    freqs = df[c].value_counts()
    labels = freqs.keys()
    ax[0].bar(range(len(labels)), freqs.values, alpha = 0.3)
    ax[0].set_xticks(range(len(labels)))
    ax[0].set_xticklabels(labels, rotation = 'vertical')
    ax[0].set_title(c)
    
    freqs_01 = df.groupby('Survived')[c].value_counts()
    ax[1].bar(range(len(labels)), freqs_01[0][labels].values, alpha = 0.3)
    ax[1].bar(range(len(labels)), freqs_01[1][labels].values, bottom = freqs_01[0][labels].values, alpha = 0.3)
    ax[1].set_xticks(range(len(labels)))
    ax[1].set_xticklabels(labels, rotation = 'vertical')
    ax[1].legend(['no', 'yes'])
    ax[1].set_title(c + ' vs. survived')
    
    fig.set_figwidth(16)

Most of the categorical features seem to also provide information about survival likelihood. For instance, it is more likely to survive if you are a woman, or if your cabin prefix is not T. Many of the passengers with ticket class = 1 did not survived. 

## Imputing missing values

Let's take a look at the proportion of missing data. Some of the fare values are zero, but we decided not to assume that this is bogus data. I am assumming that these 17 passengers travelled with a zero fare for an explainable reason. 

In [None]:
def test_missing():
    for col in numerical + categorical:
        if col in categorical:
            missing = df_training[df_training[col].isna()]
        else:
            missing = df_training[(df_training[col].isna()) | 
                                  (df_training[col].apply(lambda x: type(x) == str))]
        proportion = len(missing) / len(df_training) * 100
        print(col + ': ' + str(proportion) + '%')

In [None]:
test_missing()

We have two categorical variables (CabinClass and Embarked) and one numerica variable (age) with missing values. I am going to assign a new value 'Missing' to the case of the missing values for the categorical variables. For the imputation of the numerical variable I am going to go for something simple and just use the median imputation.

In [None]:
# Categorical variables
for c in ['CabinClass', 'Embarked']:
    df_training.loc[df_training[c].isna(), c] = 'None'
    df_test.loc[df_training[c].isna(), c] = 'None'

In [None]:
# Numerical variable
imputed = df_training[np.isreal(df_training['Age'])]['Age'].median()
df_training[(df_training['Age'].isna()) | (~np.isreal(df_training['Age']))]['Age'] = imputed
df_test[(df_test['Age'].isna()) | (~np.isreal(df_test['Age']))]['Age'] = imputed

In [None]:
test_missing()

## Correlation between variables

We calculate pearson correlation in order to determine whether we should remove any variable. 

In [None]:
features = categorical + numerical

fig, ax = plt.subplots(6, 6)

plots = 0
for i in range(len(features)):
    for j in range(i + 1, len(features)):
        row = int(plots / 6)
        col = plots % 6

        def categorical_to_numerical(f):
            if features[f] in numerical:
                values_f = df_training[features[f]]
            else:
                values = df_training[features[f]].unique()
                values_f = df_training[features[f]].values.copy()
                for v in range(len(values)):
                    values_f[np.where(values_f == values[v])] = v
            
            return values_f
        
        values_i = categorical_to_numerical(i)
        values_j = categorical_to_numerical(j)
        
        cor = ((values_i - values_i.mean()) * (values_j - values_j.mean()) / \
              ((len(values_i) - 1) * values_i.std() * values_j.std())).sum()
            
        ax[row][col].scatter(values_i, values_j, alpha = 0.5)
        
        ax[row][col].set_xlabel(features[i])
        ax[row][col].set_ylabel(features[j])
        ax[row][col].set_title('cor = ' + '%.2f' % cor)
        
        if features[i] in categorical:
            values = df_training[features[i]].unique().tolist()
            ax[row][col].set_xticks(range(len(values)))
            ax[row][col].set_xticklabels(values, rotation = 'vertical')
        if features[j] in categorical:
            values = df_training[features[j]].unique().tolist()
            ax[row][col].set_yticks(range(len(values)))
            ax[row][col].set_yticklabels(values)

        plots = plots + 1
        
fig.set_figwidth(16)
fig.set_figheight(16)
plt.tight_layout()

I don't observe any strong correlation. I cannot observe any obvious outlier either. 

## Dummy variables

Transforming categorical variables into dummy variables (we create k-1 new binary variables for each categorical variable, where k is the number of values of that categorical variable).

In [None]:
new_categorical = []
for c in categorical:
    values = df_training[c].unique()[:-1]
    for v in values:
        name = c + '_' + str(v)
        df_training[name] = (df_training[c] == v).astype(int)
        df_test[name] = (df_test[c] == v).astype(int)
        new_categorical.append(name)
    df_training = df_training.drop(c, axis = 1)
    df_test = df_test.drop(c, axis = 1)

In [None]:
print(len(categorical + numerical))

In [None]:
variables = new_categorical + numerical
print(len(variables))

After this step our training dataset contains 60 variables instead of 9.

## Standardise

We want to keep the correlation between variables. Therefore, we use standardisation instead of normalisation. This step is not necessary for some machine learning algorithms, but can help others to converge much faster and also to prevent bias in those machine learning algorithms based on the Euclidean distance. 

In [None]:
# Keeping this values to transform the test dataset
statistics = pd.concat((df_training.mean(), df_training.std()), axis = 1)
statistics.columns = ['mean', 'std']
statistics.head()

In [None]:
for c in variables:
    mean = statistics.loc[c, 'mean']
    std = statistics.loc[c, 'std']
    df_training[c] = (df_training[c] - mean) /  std
    df_test[c] = (df_test[c] - mean) /  std

In [None]:
df_training[variables].head()

## Class imbalance

Finally we test whether the training set has a class imbalance problem. 

In [None]:
print(str((df_training.Survived == 1).sum()) + ' rows have Survived = 1')
print(str((df_training.Survived == 0).sum()) + ' rows have Survived = 0')

There's some imbalance in the data, but does not seem to extreme. I decided not to oversample the minority class. 