##### Predicting survivors on Titanic data set


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
train = pd.read_csv("../input/train.csv")

### Exploratory data analysis

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
for column in train.select_dtypes(exclude=[np.number]).columns.tolist():
    print("{}: {} unique values".format(column, train[column].nunique()))  

In [None]:
train.head()

The prediction task is a classification problem, in which predictions will be made on whether a passenger survived.
The output variable is 'Survived', where a value of 1 denotes that a passenger survived, while a value of 0 denotes that the passenger perished. The remaining columns are potential input variables.

For the input variables:
- 'Ticket' and 'PassengerId' are unique identifiers and will not be useful for prediction. They will be excluded from the modelling. 
- 'Fare' would not directly impact a passenger's chance of survival at the time of the incident and will be excluded as well. (Fare is probably associated with Pclass and cabin location, which are likely to impact survival, but these columns are already available in the data. So, 'Fare' would not be a useful input variable) 
- Data types for categorical variables such as 'Survived' and 'Pclass' will be converted from numerical to 'str'
- There are missing values for 'Age', 'Cabin' and 'Embarked'. These will need to be filled in later


In [None]:
for column in ['Survived', 'Pclass']:
    train[column] = train[column].astype('str')

In [None]:
#visualize how categorical input variables affect survival
cols = ['Pclass', 'Sex', 'SibSp','Parch', 'Embarked']
for col in cols:
    plt.figure()
    sns.countplot(x=col, data = train, hue="Survived")
    plt.title(col)

In [None]:
# visualize how numerical input variables affect survival
for status in train["Survived"].unique():
    plt.hist(x='Age',data=train[(train['Survived']==status) & (~train['Age'].isna())], alpha=0.5, label=status, bins=30)
    plt.title("Age distribution")
    plt.legend(title="Survived")

From the visualizations, we can infer that: 
- Females and better Pclass (1 and 2) have higher proportions of survivors
- Those not travelling with family (SibSp or Parch = 0) and those embarking at 'S' have lower proportions of survivors
- 'Age' of 0-5 years has a higher proportion of survivors compared to other ages.

### Data cleaning/wrangling
#### SibSp/Parch
From the preliminary data analysis, passengers travelling without SibSp/Parch tend to have lower survival rates. A new column will be created to indicate whether passenger has any family(SipSp or Parch) on board. 

In [None]:
def transform_family_info(df):
    df["HasSibSp"] = 1
    df["HasParch"] = 1
    df["HasFamily"] = 1
    df.loc[df["SibSp"]==0,"HasSibSp"]=0
    df.loc[df["Parch"]==0,"HasParch"]=0
    df.loc[(df["HasSibSp"]==0)&(df["HasParch"]==0), "HasFamily"]=0
    
    return df


train = transform_family_info(train)

for col in ["HasSibSp","HasParch","HasFamily"]:
    plt.figure()
    sns.countplot(x=col, data = train, hue="Survived")
    plt.title(col)

There is a marked difference in survival rates for passengers with no family on board.

#### Age 
The 'Name' column contains titles i.e. 'Mr', 'Mrs' etc. This info may be useful for a differentiated approach in filling the missing 'Age' values. For example, "Master" was a form of address for younger males in the early 20th century. We would also expect titles such as "Dr", "Col" and "Sir" to refer to more mature passengers.

'Title' was be extracted from the "Name" column and missing 'Age' values were filled based on the median of that corresponding title.  This would avoid overestimating 'Age' for certain groups such as 'Master'.

In [None]:
train['title'] = train['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip()) #extract titles from names

In [None]:
sns.boxplot(x="Age", y="title", data=train) #visualize age distributions by title

The above figure confirms that there is variation in age distributions for different titles. Missing 'Age' values will be filled using the median age for the corresponding titles.

In [None]:
age_medians  = train.pivot_table(columns="title", aggfunc='median', values="Age") #get median ages
age_medians['overall'] = train['Age'].median() #get overall median age for the entire data set. this acts as a default value in case new titles are present in test data

def fill_age(df): #fill NA values for 'Age' column based on title 
    title = df['title'].unique().item()
    try:
        df['Age'].fillna(age_medians.loc['Age',title], inplace=True)
    except:
        df['Age'].fillna(age_medians.loc['Age','overall'], inplace=True)
    return df

train = train.groupby('title').apply(lambda x: fill_age(x))

Based on the distribution of survivors by age, those at Ages 0-5 tend to have a much higher survival rate. A new feature was created to denote whether passengers belonged to this age group.

In [None]:
def get_age_group(df):
    
    df['Age5_orLess']=0
    df.loc[df['Age']<=5,'Age5_orLess']=1
    
    return df

train = get_age_group(train)

#### Cabin

According to [this wikipedia article](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic#%22Iceberg_right_ahead!%22_(23:39)), the collision happened close to midnight, so most passengers had gone to bed. "Cabin" would be a good indication of where each passenger was at the time of collision, which would affect their chances of survival (i.e. access to life boats etc). 

'Cabin_Deck' and "Cabin_Number" were extracted from "Cabin". Furthermore, according to [Titanic deckplans](https://www.encyclopedia-titanica.org/titanic-deckplans/), even-numbered cabins are located on one side of the ship and odd-numbered ones on the other. A 'Cabin_Number_Loc' column was created to reflect this.



In [None]:
def get_cabin_info(df):
    
    #extract Cabin Deck and Number
    df['Cabin_Deck']= np.NaN
    df['Cabin_Number']= np.NaN
    df['Cabin_Number_Loc']= np.NaN
    df.loc[~df['Cabin'].isna(),'Cabin_Deck']  = df.loc[~df['Cabin'].isna(),'Cabin'].apply(lambda x:x[0]) #extract alphabet
    df.loc[~df['Cabin'].isna(), 'Cabin_Number']  = df.loc[~df['Cabin'].isna(), 'Cabin'].apply(lambda x:x[1:].split(' ')[0]) #retain only first booth number if entries have multiple booths
    df['Cabin_Number'] =  pd.to_numeric(df['Cabin_Number'], errors='coerce')
    df.loc[df['Cabin_Number']%2==0, 'Cabin_Number_Loc']="even"
    df.loc[df['Cabin_Number']%2==1, 'Cabin_Number_Loc']="odd"
    
    return df

train = get_cabin_info(train)

In [None]:
#visualize number of survivors by cabin deck
sns.countplot(x='Cabin_Deck',data=train[~train['Cabin_Deck'].isna()],hue="Survived")

In [None]:
# There is no Deck T on the Titanic, so that observation will be set to 'NA'
train['Cabin_Deck'] = train['Cabin_Deck'].replace("T",np.NaN) 

In [None]:
#visualize number of survivors by cabin number
for status in train["Survived"].unique():
    plt.hist(x='Cabin_Number',data=train[(train['Survived']==status) & (~train['Cabin_Number'].isna())], alpha=0.5, label=status, bins=30)
    plt.title("Cabin Number distribution")
    plt.legend(title="Survived")

In [None]:
#visualize number of survivors by Cabin_Number_Loc
sns.countplot(x='Cabin_Number_Loc',data=train[~train['Cabin_Number_Loc'].isna()],hue="Survived")

In [None]:
# number of missing Cabin values
train[['Cabin','Cabin_Deck','Cabin_Number','Cabin_Number_Loc']].isna().sum()

Based on available data, cabin location appears to have an impact on survival likelihood. 

Decks A and G had lower proportions of survivors compared to other decks. 

Odd numbered cabins had a higher proportion of survivors compared to even numbered cabins. 

However, over 75% of 'Cabin_Deck' and 'Cabin_Number_Loc' values are missing. With that much missing data, these input variables cannot be used to train the model. 

##### Embarked

There were 2 missing values for Embarked, which will be filled with the mode

In [None]:
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode())

In [None]:
train.info()

### Model Training



2 algorithms were considered: random forest, which minimizes overfitting; and logistic regression, which is a relatively robust classifier with interpretable results.

The labelled 'train.csv' data will be split into a training and test set :
- Training set: N-fold cross-validation and GridSearchCV will be used to train and optimize models. The best model will be selected based on cross-validation scores.
- Test set (labelled): The selected model's predictive performance will be evaluated using the held-out labelled test set. 

Subsequently, the selected model will be used to make predictions on the unlabelled 'test.csv' data.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, precision_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [None]:
#include only variables of interest
variables = ["Pclass","Sex","Age","SibSp","Parch", "Embarked", "HasSibSp", "HasParch", "Age5_orLess"]
cat_var = ["Sex","Embarked","Pclass"]

X = train[variables]

X = pd.get_dummies(X, columns=cat_var, drop_first=True)
y = train["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=43)


#### Random Forest

In [None]:
def randomforestclassifier(X,y):
    rf = RandomForestClassifier(random_state=77)

    params = {'n_estimators': np.arange(100,500,100),'min_samples_split':np.arange(2,30,4),
              'criterion':["gini","entropy"]}

    rf_model_cv = GridSearchCV(rf, params, cv=StratifiedKFold(n_splits = 5, random_state=77), scoring='accuracy')
    rf_model_cv.fit(X,y)
    
    print("Cross validation score:{:.3f}".format(rf_model_cv.best_score_))
    print("Best params:{}".format(rf_model_cv.best_params_))

    return rf_model_cv

rfc = randomforestclassifier(X_train,y_train)

#### Logistic Regression

In [None]:
def logisticregression(X,y):
    log = LogisticRegression(max_iter=500)

    params = {'C':np.linspace(0.1,1,10), 'solver':['liblinear','lbfgs', 'newton-cg']}

    log_model_cv = GridSearchCV(log, params, cv=StratifiedKFold(n_splits=5,random_state=77), scoring = "accuracy")
    log_model_cv.fit(X,y)

    print("Cross validation score:{:.3f}".format(log_model_cv.best_score_))
    print("Best params:{}".format(log_model_cv.best_params_))
    
    return log_model_cv
    
log = logisticregression(X_train,y_train)

Cross validation accuracy scores for both models are comparable at 83-84%

The Log Regression model will be selected as the final model as it has an advantage over random forest in that its outputs are more interpretable (coefficients reflect whether input variables negatively or positively affect outcome) 


### Model Evaluation

In [None]:
#Visualize coefs of log model
weights_log = pd.Series(log.best_estimator_.coef_.transpose()[:,0], index=X.columns).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(8,6))
ypos = np.arange(0,len(weights_log))[::-1]

positive_weights = weights_log[weights_log>=0]
ypos_positive = ypos[weights_log>=0]
negative_weights = weights_log[weights_log<0]
ypos_negative = ypos[weights_log<=0]
ax.barh(ypos_positive,positive_weights, color='#77B7D8')
ax.barh(ypos_negative,negative_weights, color='#C16C82')

ax.set_yticks(ypos)
ax.set_yticklabels(weights_log.index)

for i,value in zip(ypos, weights_log.values):
    ax.annotate("{:.2f}".format(value), xy=(value, i))
    
plt.title("Coefficients for Log Regression Model")

In [None]:
## evaluate prediction accuracy on held out labelled data set using log model
y_pred = log.predict(X_test)

print("MODEL PERFORMANCE")
print("Accuracy: {:.2f}%".format(accuracy_score(y_test,y_pred)*100))
print("Confusion matrix:\n{}".format(confusion_matrix(y_test,y_pred)))
print("Recall:\n{:.2f}%".format(100*recall_score(y_test.astype('int'), y_pred.astype('int'))))
print("Precision:\n{:.2f}%".format(100*precision_score(y_test.astype('int'), y_pred.astype('int'))))

In [None]:
#evaluate prediction accuracy using a baseline guess of y_train.mode() for all predictions
y_pred2 = np.repeat(y_train.mode(), len(y_test))

print("BASELINE PERFORMANCE (all outcomes assumed to take the mode of y_train)")
print("Accuracy: {:.2f}%".format(accuracy_score(y_test,y_pred2)*100))
print("Confusion matrix:\n{}".format(confusion_matrix(y_test,y_pred2)))

### Conclusions

The log model gives an accuracy of 78% on the held out (labelled) test set, a 17% improvement over the baseline. 

Based on the log model coefficients, being male or in Pclass 3 greatly reduced the likelihood of survival, while passengers aged 5 or less were more likely to survive. 

Interestingly, "SibSp" has a negative coefficient i.e. a higher number reduced the likelihood of survival. On the other hand, "HasSibSp" has a positive coefficient, i.e. survival likelhood was higher for those with siblings/spouses on board than those without. A similar, but less pronounced trend was seen for Parch.

We could infer that having a large number of family members on board might adversely affect survival (having a large family may delay escape as one may try to find all his/her family members before getting on a life boat), while not having any family on board at all also adversely affects survival (i.e. there is no one to help look out for you and realise you are missing)

#### Predictions on unlabelled test set

In [None]:
test = pd.read_csv("../input/test.csv")
test.isnull().sum()/len(test)

In [None]:
# define a function that consolidates the data cleaning and preprocessing steps to facilitate treatment of unlabelled test set 
def preproc(df):
    
    #fill missing age values
    df['title'] = df['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip()) #extract titles from names
    df = df.groupby('title').apply(lambda x: fill_age(x))
    
    #change Pclass to categorical var
    df['Pclass']=df['Pclass'].astype('str')
    
    #fill missing Embarked data
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode())
    
    #create categorical SibSp/Parch/Family/Age columns
    df = transform_family_info(df)
    df = get_age_group(df)
    
    return df


test_cleaned = preproc(test) #preprocess test data in the same way as train
X_test_unlabelled = test_cleaned[variables]
X_test_unlabelled = pd.get_dummies(X_test_unlabelled , columns=cat_var)


def add_missing_dummy_columns(train, test): #ensure test set is not missing columns needed for prediction
    missing_cols = set(train.columns) - set(test.columns) 
    for c in missing_cols:
        test[c] = 0
    test = test[train.columns]
    return test

X_test_unlabelled = add_missing_dummy_columns(X,X_test_unlabelled)

In [None]:
filename = "submission.csv"

predictions = log.predict(X_test_unlabelled) 
predictions_df = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})
predictions_df.to_csv(filename, index=False)