### Titanic Dataset Analysis for Kaggle

In this classic Kaggle challenge, we try to predict survival on the Titanic based on features of the passengers. 

First,let's import some necessary functions:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Helper Functions

In [None]:
def plot_subplots(feature, data):
    fx, axes = plt.subplots(2,1,figsize=(15,10))
    axes[0].set_title(f"{feature} vs Frequency")
    axes[1].set_title(f"{feature} vs Survival")
    fig_title1 = sns.countplot(data = data, x=feature, ax=axes[0])
    fig_title2 = sns.countplot(data = data, x=feature, hue='Survived', ax=axes[1])
    
def plot_histograms( df , variables , n_rows , n_cols ):
    fig = plt.figure( figsize = ( 16 , 12 ) )
    for i, var_name in enumerate( variables ):
        ax=fig.add_subplot( n_rows , n_cols , i+1 )
        df[ var_name ].hist( bins=10 , ax=ax )
        ax.set_title( 'Skew: ' + str( round( float( df[ var_name ].skew() ) , ) ) ) # + ' ' + var_name ) #var_name+" Distribution")
        ax.set_xticklabels( [] , visible=False )
        ax.set_yticklabels( [] , visible=False )
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

def plot_distribution( df , var , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
    facet.map( sns.kdeplot , var , shade= True )
    facet.set( xlim=( 0 , df[ var ].max() ) )
    facet.add_legend()

def plot_categories( df , cat , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , row = row , col = col )
    facet.map( sns.barplot , cat , target )
    facet.add_legend()

def plot_correlation_map( df ):
    corr = df.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

Get Titanic DataSet, group together for easy data cleaning

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

full = train.append(test, sort=False)

titanic = full.iloc[0:891,:]
full.shape

In [None]:
full.columns

**VARIABLE DESCRIPTIONS:**

We've got a sense of our variables, their class type, and the first few observations of each. We know we're working with 1309 observations of 12 variables. To make things a bit more explicit since a couple of the variable names aren't 100% illuminating, here's what we've got to deal with:


**Variable Description**

 - Survived: Survived (1) or died (0)
 - Pclass: Passenger's class (1,2,3)
 - Name: Passenger's name - includes Title of Mr/Miss/Captain etc.
 - Sex: Passenger's sex 
 - Age: Passenger's age
 - SibSp: Number of siblings/spouses aboard
 - Parch: Number of parents/children aboard
 - Ticket: Ticket number
 - Fare: Fare - numerical amount
 - Cabin: Cabin - denoted by letter
 - Embarked: Port of embarkation

[More information on the Kaggle site](https://www.kaggle.com/c/titanic/data)

From the Kaggle site description, we are told that "Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class." Thus, we will make sure to keep an eye on any variables that identify these groups and include them in the final data analysis.

First, let's note down the missing data. We can see that Age and Cabin are the main issues for this dataset. We'll keep that in mind and look for ways to deal with them as we explore the dataset

In [None]:
full.isnull().sum()

##### Let's first work on creating a Title category from Name. We can seperate the passengers by their titles by extracting from the Name column.

In [None]:
full.Name.head()

In [None]:
full['Title'] = full['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])

In [None]:
full.Title.sample(10)

In [None]:
full.Title.unique().tolist()

By plotting against survival with our helper function, we can see that Title will be a useful predictor of survival as Mr. died a lot more and Mrs./Miss./Master. survived better on average.

In [None]:
plot_subplots('Title', full)

What is the title 'the'? Let's fix that error.

In [None]:
full.loc[full.Title=='the']

In [None]:
full.iloc[759,-1] = "Countess"

In [None]:
full.iloc[759,:]

By looking at a boxplot of Title vs Age, we can conclude that Title is a good variable to use to deal with the missing Age data. We will use it to impute Age later on.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(16, 9)
sns.boxplot(x='Title', y='Age', data=full)

In [None]:
full.groupby("Title").Age.describe()

In [None]:
full.groupby("Title").Survived.describe()

We can use a dictionary to map the rare titles into one:

In [None]:
full.Title.value_counts()

In [None]:
Title_Dictionary = {
                    "Mme.":        "Mrs",
                    "Mlle.":       "Mrs",
                    "Ms.":         "Mrs",
                    "Mr." :        "Mr",
                    "Mrs." :       "Mrs",
                    "Miss." :      "Mrs",
                    "Master.":     "Master",
                    "Countess":    "Lady",
                    "Dona.":       "Lady",
                    "Lady.":       "Lady"
                    }

In [None]:
Mapped_titles = full.Title.map(Title_Dictionary)

In [None]:
Mapped_titles.fillna("Rare", inplace=True)

In [None]:
full['Titles_mapped'] = Mapped_titles

In [None]:
full.Titles_mapped.value_counts()

In [None]:
full.Titles_mapped.unique()

In [None]:
plot_subplots('Titles_mapped', full)

In [None]:
target_columns = []
target_columns.append('Titles_mapped')

For the Ticket variable, we can observe an interesting phenomenon - duplicate tickets!

In [None]:
Ticket = pd.DataFrame(full.Ticket)

In [None]:
Ticket.sample(10)

In [None]:
Ticket.Ticket.value_counts()

### Add column to identify multiple ticket holders

In [None]:
Ticket['Count'] = Ticket.groupby('Ticket')['Ticket'].transform('count')

In [None]:
Ticket.sample(10)

In [None]:
full['Ticket_Count'] = Ticket.Count

In [None]:
full.Ticket_Count.head()

In [None]:
plot_subplots('Ticket_Count', full)

### Seems like Single-ticket holders will more likely die alone... Let's keept this variable!

In [None]:
target_columns.append('Ticket_Count')

Let's look at the other missing data - Cabin

In [None]:
full.isnull().sum()

In [None]:
cabin = pd.DataFrame()

In [None]:
cabin['Cabin'] = full.Cabin

In [None]:
cabin.Cabin.value_counts()

We can suspect that perhaps the missing data means the passengers did not have a cabin. For now, let's fill in the missing data with 'U'. Later on we could also elect to take out this variable from our analysis if we think it isn't particularly predictable since there was so much missing data anyway.

In [None]:
cabin.Cabin.fillna("U", inplace=True)

We can use regex to get the Cabin letter of the other passengers

In [None]:
import re

In [None]:
def findLetter(string, group):
    return re.match(r"([A-Z]{1})(\d*)", str(string)).group(group)

In [None]:
re.match(r"([A-Z]{1})(\d*)", 'U').group(1)

In [None]:
cabin.Cabin.sample(10)

In [None]:
cabin['Cabin_Letter']  = cabin.Cabin.apply(lambda x: findLetter(x,1))   

In [None]:
cabin.sample(10)

In [None]:
cabin['Survived'] = full['Survived']

In [None]:
cabin.head(10)

In [None]:
plot_subplots('Cabin_Letter', cabin)

In [None]:
target_columns.append('Cabin_Letter')

In [None]:
full['Cabin_Letter'] = cabin.Cabin_Letter

In [None]:
full.drop(columns='Cabin', inplace=True)

##### Family Size - this is a variable that we can engineer into the data by combining the Siblings and Parent/Children columns. And we can hypothesize that perhaps large families may be able to help each other and survive better

In [None]:
family = pd.DataFrame()

In [None]:
family["FamilySize"] = full.SibSp + full.Parch + 1

In [None]:
family.sample(10)

In [None]:
family.describe()

In [None]:
family.FamilySize.value_counts()

In [None]:
family['Survived'] = full.Survived

Contrary to what we thought, we can see that it will be more likely that a passenger with NO family will be much more likely to die - which makes sense.

In [None]:
plot_subplots('FamilySize', family)

In [None]:
target_columns.append('FamilySize')

In [None]:
full['FamilySize'] = family.FamilySize

Now let's look at the Fare variable

In [None]:
full[full.Fare.isnull()]

By plotting Title and Embarked vs Fare, we can see that using the 2 variables to impute the missing value for Fare should be rather sufficient

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(16, 9)
sns.barplot(x='Titles_mapped', y='Fare', data=full, hue="Embarked")

In [None]:
Mr_S_Fare_Mean = full[(full['Titles_mapped']=='Mr') & (full['Embarked']=='S')]['Fare'].mean()

In [None]:
Mr_S_Fare_Mean

In [None]:
full.loc[full.PassengerId==1044,'Fare'] = Mr_S_Fare_Mean

In [None]:
target_columns.append('Fare')

In [None]:
target_columns

By looking at the data for Fare, we can also observe that there are duplicate Fare values. Maybe this has to do with the Ticket duplicates?

In [None]:
full.Fare.value_counts()

In [None]:
full.sort_values('Ticket').head(10)

As shown above, seems like the same Ticket holders have a grouped Fare amount in the Fare variable. We can adjust it to Fare per person by dividing with the Ticket_Count variable we created earlier.

In [None]:
target_columns.remove('Fare')

In [None]:
full['Fare_adjusted'] = full.Fare / full.Ticket_Count

In [None]:
target_columns.append('Fare_adjusted')

Of course, as suggested in the exploring of the Titles variable, it would seem like females have a much higher change of survival

In [None]:
plot_subplots('Sex', full)

In [None]:
target_columns.append('Sex')

In [None]:
full.isnull().sum()

We still have a lot of missing data! Let's work on Age. In particular, we can look at the Titles and Passenger Class for more info

In [None]:
full[full.Age.isnull() == False].groupby(['Titles_mapped', 'Pclass']).describe()

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(16, 9)
sns.boxplot(x='Titles_mapped', y='Age', data=full, hue="Pclass")

Evidently, further separating by passenger class will give better estimates of the missing Age values.

In [None]:
full[full.Age.isnull()].groupby(['Titles_mapped', 'Pclass']).describe()

In [None]:
def get_Age_mean(title, pclass):
    return full.loc[(full.Age.isnull() == False) & (full.Titles_mapped==title) & (full.Pclass == pclass)].Age.mean()

In [None]:
get_Age_mean('Master', 3)

In [None]:
full.loc[(full.Age.isnull()) & (full.Titles_mapped == 'Master'), 'Age'] = get_Age_mean('Master', 3)

In [None]:
for pclass in [1,2,3]:
    full.loc[(full.Age.isnull()) & (full.Titles_mapped == 'Mr'), 'Age'] = get_Age_mean('Mr', pclass)

In [None]:
for pclass in [1,2,3]:
    full.loc[(full.Age.isnull()) & (full.Titles_mapped == 'Mrs'), 'Age'] = get_Age_mean('Mrs', pclass)

In [None]:
full.loc[(full.Age.isnull()) & (full.Titles_mapped == 'Rare'), 'Age'] = get_Age_mean('Rare', 1)

In [None]:
target_columns.append('Age')

In [None]:
full[full.Age.isnull() == False].groupby(['Titles_mapped', 'Pclass']).describe()

For Embarked, we can just use the mode as it is by far the most common

In [None]:
full.Embarked.value_counts()

In [None]:
full[full.Embarked.isnull()]

In [None]:
full.Embarked.mode()[0]

In [None]:
full.loc[full.Embarked.isnull(), 'Embarked'] = full.Embarked.mode()[0]

In [None]:
full.loc[full.Embarked.isnull()]

In [None]:
full.isnull().sum()

In [None]:
target_columns.append('Embarked')

In [None]:
target_columns

In [None]:
fullfinal = full[target_columns]

In [None]:
fullfinal['Pclass'] = full.Pclass

In [None]:
fullfinal.dtypes

* ##### Extra feature engineering - credit to Konstantin's kernel (https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83), he also has some detailed discussion and tips so definitely check it out.  It is likely that Families/Groups would survive/die together, and we can find that out based on their Last Names and Fare

In [None]:
full['Last_Name'] = full['Name'].apply(lambda x :str.split(x,',')[0])

In [None]:
full.Last_Name.sample(10)

In [None]:
DEFAULT_SURVIVAL_VALUE = 0.5

In [None]:
full['Family_Survival'] = DEFAULT_SURVIVAL_VALUE

In [None]:
for grp, grp_df in full[['Survived', 'Name', 'Last_Name', 'Fare', 'Ticket', 'PassengerId', 'Age',]].groupby(['Last_Name', 'Fare']):
    if (len(grp_df) != 1):
        #found Family group
        for ind, row in grp_df.iterrows():
            smax = grp_df.drop(ind)['Survived'].max()
            smin = grp_df.drop(ind)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                full.loc[full['PassengerId'] == passID, 'Family_Survival'] = 1
            elif (smin == 0.0):
                full.loc[full['PassengerId'] == passID, 'Family_Survival'] = 0
print("Number of passengers with family survival information:", full.loc[full['Family_Survival']!=0.5].shape[0])

In [None]:
for _, grp_df in full.groupby('Ticket'):
    if (len(grp_df) != 1):
        for ind, row in grp_df.iterrows():
            if (row['Family_Survival'] == 0) | (row['Family_Survival'] == 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    full.loc[full['PassengerId'] == passID, 'Family_Survival'] = 1
                elif (smin == 0.0):
                    full.loc[full['PassengerId'] == passID, 'Family_Survival'] = 0
print("Number of passenger with family/group survival information: " 
      +str(full[full['Family_Survival']!=0.5].shape[0]))

In [None]:
full.Family_Survival.describe()

In [None]:
fullfinal['Family_Survival'] = full.Family_Survival

Now that we have our variables, we will need to encode the categorical variables. As one-hot encoding may lead to sparsity which virtually ensures that continuous variables are assigned higher feature importance, we will use label-encoding instead. 
Source: https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X = fullfinal.copy()
for i in range(0,X.shape[1]):
    if X.dtypes[i]=='object':
        X[X.columns[i]] = le.fit_transform(X[X.columns[i]])
X.head()

In [None]:
X['Survived'] = full.Survived

In [None]:
plot_correlation_map(X)

In [None]:
full_bins = fullfinal.copy()

In [None]:
sns.distplot(full_bins.Age)

In [None]:
sns.distplot(full_bins.Fare_adjusted)

In [None]:
full_bins['AgeBin'] = pd.qcut(full_bins['Age'], 5)

In [None]:
full_bins['FareBin'] = pd.qcut(full_bins['Fare_adjusted'], 6)

In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

In [None]:
full_bins['AgeBin_Code'] = label.fit_transform(full_bins.AgeBin)

In [None]:
full_bins['FareBin_Code'] = label.fit_transform(full_bins.FareBin)

In [None]:
full_bins['CabinBin_Code'] = label.fit_transform(full_bins.Cabin_Letter)

In [None]:
full_bins['EmbarkedBin_Code'] = label.fit_transform(full_bins.Embarked)

In [None]:
full_bins.head()

In [None]:
full_bin_final = full_bins.drop(columns=['Titles_mapped', 'Cabin_Letter','Fare_adjusted', 'Age', 'Embarked', 'AgeBin', 'FareBin'] )

In [None]:
full_bin_final.head()

In [None]:
full_bin_final.Sex = label.fit_transform(full_bin_final.Sex)

In [None]:
full_bin_final.head()

In [None]:
full_train = full_bin_final.copy()

In [None]:
full_train.describe()

In [None]:
full_train.columns

In [None]:
full_train.dtypes

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

In [None]:
def compute_score(clf, X, y, scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring)
    return np.mean(xval)

In [None]:
def recover_train_test_target(df):
    global combined
    
    targets = pd.read_csv('../input/train.csv', usecols=['Survived'])['Survived'].values
    train = df.iloc[:891]
    test = df.iloc[891:]
    
    return train, test, targets

In [None]:
train, test, targets = recover_train_test_target(full_train)

In [None]:
def checkFeatureImportance(dataset):
    train, test, targets = recover_train_test_target(dataset)
    clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
    clf = clf.fit(train, targets)
    
    features = pd.DataFrame()
    features['feature'] = train.columns
    features['importance'] = clf.feature_importances_
    features.sort_values(by=['importance'], ascending=True, inplace=True)
    features.set_index('feature', inplace=True)
    
    features.plot(kind='barh', figsize=(10, 10))

In [None]:
checkFeatureImportance(full_train)

By looking at the feature importance plot above, we can revise the variables that we will chooes for our predictions:

Embarked - as its feature importance is low, and it is unlikely that port of origin should affect survival, we will leave this out.

Cabin - there was a lot of missing data for this variable, so it may be best that we exclude this variable as well

Ticket Count - as we have already included this variable by adjusting Fare with it, we can exclude it

Pclass - we will choose to keep this variable as it should still be an important predictor (high correlation with Survival etc.)

In [None]:
full_train.columns

In [None]:
full_train2 = full_train.drop(columns=['EmbarkedBin_Code', 'CabinBin_Code', 'Ticket_Count'])

In [None]:
checkFeatureImportance(full_train2)

In [None]:
full_train2.head()

In [None]:
full_train2.describe()

Scale features for better prediction:

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
train = std_scaler.fit_transform(train)
test = std_scaler.fit_transform(test)

GridSearchCV to find best parameters:

In [None]:
# turn run_gs to True if you want to run the gridsearch again.
run_gs = True

if run_gs:
    parameter_grid = {
                 'max_depth' : [8,10,12],
                 'n_estimators': [45,47,48,50],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [2, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
    forest = RandomForestClassifier()
    cross_validation = StratifiedKFold(n_splits=5)

    grid_search = GridSearchCV(forest,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation,
                               verbose=1
                              )

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
    
else: 
    parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
    
    model = RandomForestClassifier(**parameters)
    model.fit(train, targets)

We can also use KNN, which had yielded better results

In [None]:
from sklearn.neighbors import KNeighborsClassifier 

# turn run_gs to True if you want to run the gridsearch again.
run_gs = True

if run_gs:
    parameter_grid = {
                 'n_neighbors' : [6,7,8,9,10,11,12,114,16,18,20,22],
                 'algorithm': ['auto'],
                 'weights': ['uniform', 'distance'],
                 'leaf_size': list(range(1,50,5)),
                 }
    KNN = KNeighborsClassifier()
    cross_validation = StratifiedKFold(n_splits=10)

    grid_search = GridSearchCV(KNN,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation,
                               verbose=1
                              )

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
    
else: 
    parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
    
    model = RandomForestClassifier(**parameters)
    model.fit(train, targets)

In [None]:
model

Export results from best estimator to csv for submitting to Kaggle.

In [None]:
def to_Kaggle_csv(model, filename):
    output = model.predict(test).astype(int)
    df_output = pd.DataFrame()
    aux = pd.read_csv('test.csv')
    df_output['PassengerId'] = aux['PassengerId']
    df_output['Survived'] = output
    df_output[['PassengerId','Survived']].to_csv(filename, index=False)

In [None]:
to_Kaggle_csv(model, 'Family_Survival_KNN_GridSearch.csv')

## Score - 0.81339