**Intro:**
The goal of this competition is to predict the survival rate of titanic passengers based on a dataset containing their gender, age, port of departure, etc.

**Best Score:**
1025/13895, top ~7%

## 1. Importing Necessary Packages:

In [170]:
#importing initial packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #graphing
import matplotlib.patches as mpatches #style
import seaborn as sns #advanced graphing
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 2. Importing Data:

In [171]:
#importing the data
training = pd.read_csv("../input/titanic/train.csv")
testing = pd.read_csv("../input/titanic/test.csv")
training['train_test']=1
testing['train_test']=0
all_data=pd.concat([training,testing])

## 3. Discovering the Data:

In [172]:
#ensuring datatypes are correct
training.info()

In [173]:
#exploring descriptive statistics
training.describe()

**Observations:**
1. Not every person has an age.
2. Skewed towards lower ticket classes
3. Wide range in fares

## 4. Feature Engineering

In [174]:
#identifying the cabin type
training['Cabin_Type'] = training['Cabin'].apply(lambda x: str(x)[0])
training['Cabin_Type'].value_counts()

In [175]:
#identifying if a passenger has multiple cabins
training['Cabin'] = training['Cabin'].astype('string')
training.loc[training['Cabin'].str.contains(' ', na=0), 'Multiple_Cabins'] = 1
training.loc[training['Multiple_Cabins'] != 1, 'Multiple_Cabins'] = 0
training['Multiple_Cabins'].value_counts()

In [176]:
#identifying if a passenger has a letter before their ticket
training['Ticket_Letters']=training['Ticket'].apply(lambda x: 0 if x.isnumeric() else 1)
training['Ticket_Letters'].value_counts()

In [177]:
#seperating a person's title from their name
training['Title'] = training['Name'].apply(lambda x: x.split(',')[1].split('.')[0])
training['Title'] = training['Title'].str.replace('Ms', 'Miss')
training['Title'] = training['Title'].str.replace('Mlle', 'Miss')
training['Title'] = training['Title'].str.replace('Mme', 'Mrs')
for title in ['Lady', 'Mlle', 'the Countess','Capt', 'Col', 'Mme', 'Don','Dr','Rev', 'Major', 'Sir', 'Jonkheer', 'Dona']:
    training['Title'] = training['Title'].str.replace(title, 'Misc')
training['Title'].value_counts()

## 5. Exploratory Analysis:

In [178]:
#seperating quantitative and qualitative variables
quant_var_list = ['Age','SibSp','Parch','Fare','Survived','PassengerId']
qual_var_list = ['Pclass','Sex','Embarked','Cabin_Type','Multiple_Cabins','Ticket_Letters','Title','Survived','PassengerId']
quant_var = training[quant_var_list]
qual_var = training[qual_var_list]
quant_var_list.remove('Survived')
quant_var_list.remove('PassengerId')
qual_var_list.remove('Survived')
qual_var_list.remove('PassengerId')

In [179]:
#helper functions
def easy_hist(i):
    plt.hist(quant_var[i])
    plt.title(i)
    plt.show()
    
def qual_stacked_bar(i):
    bar1=sns.barplot(y=qual_var[i].value_counts(),x=qual_var[i].value_counts().index,color='lightblue')
    bar2=sns.barplot(y=qual_var[qual_var['Survived']==True][i].value_counts(),x=qual_var[i].value_counts().index,color='darkblue')
    top_bar=mpatches.Patch(color='lightblue',label='Survived = No')
    bottom_bar=mpatches.Patch(color='darkblue',label='Survived = Yes')
    plt.legend(handles=[top_bar,bottom_bar])
    plt.show()
    
def survived_pivot_table(i):
    print(pd.crosstab(training['Survived'],training[i],normalize='columns'))
    print()

In [180]:
#looking at quantitative variable histograms
for i in quant_var_list:
    easy_hist(i)

**Observations:**

1. Not many children. Age count peaks around 20.
2. Mostly 0 siblings or spouses on board, drops off rapidly. A few outliers.
3. Mostly 0 parents or children on board, drops off rapidly. A few outliers.
4. Mostly cheap fares, drops off rapidly. Grand majority are below $100. A few outliers

In [181]:
#looking for correlations between quantitative variables
quant_corr = quant_var.corr()
print(quant_corr)
sns.heatmap(quant_corr)

**Observations:**

1. The only correlation that is slightly of note is the .414838 correlation between number of siblings/spouses and mumber of parents/children. Maybe if someone is traveling with at least a few members of their family, they are more likely to have even more members.

In [182]:
#looking at quantitative averages for those who survive and those who do not
pd.pivot_table(training, index='Survived',values=quant_var_list)

**Observations:**

1. Those who survived skew slightly younger (by ~2.3 years).
2. Those who survived paid, on average, more than double those who did not for their fare.
3. Those who survived tend to have more children/parents on board, but the difference is small.
4. Those who survived tend to have more spouses/siblings on board, but the difference is small.

In [183]:
#looking at the survival numbers by qualitative attribute graphically
for i in ['Pclass','Sex','Embarked','Multiple_Cabins','Ticket_Letters']:
    qual_stacked_bar(i)

**Observations:**

1. More people survived than not.
2. Most people were in third class, more than double the first and second class (in order of respective size).
3. Almost twice as many males than females were on board.
4. Vast majority of people embarked at Southampton, followed by Cherbourg and Queenstown (respectively).

In [184]:
#looking at survival rate by qualitative attribute
for i in qual_var_list:
    survived_pivot_table(i)

**Observations:**

1. Those in first class were much more likely to survive than second and third class, respectively.
2. Females were much more likely to survive than males.
3. Guests who boarded at Cherbourg had the highest survival rate, followed by Queenstown, then Southhampton. Maybe those who boarded at each different place were predominantly wealthier or of a certain gender.
4. There are large differences in survival rate by cabin type, but because there are so few people who have their cabin type, this may not be very significant
5. If a ticket has a letter or not seems to have no bearing on survival rate, will drop when modeling
6. Survival rate varies wildly by title, but this may have to do with gender

## 6. Data Preprocessing

In [185]:
#re-doing feature engineering for all_data
#Cabin_Type
all_data['Cabin_Type'] = all_data['Cabin'].apply(lambda x: str(x)[0])
#Multiple_Cabins
all_data['Cabin'] = all_data['Cabin'].astype('string')
all_data.loc[all_data['Cabin'].str.contains(' ', na=0), 'Multiple_Cabins'] = 1
all_data.loc[all_data['Multiple_Cabins'] != 1, 'Multiple_Cabins'] = 0
#Ticket_Letters
all_data['Ticket_Letters']=all_data['Ticket'].apply(lambda x: 0 if x.isnumeric() else 1)
#Title
all_data['Title'] = all_data['Name'].apply(lambda x: x.split(',')[1].split('.')[0])
all_data['Title'] = all_data['Title'].str.replace('Ms', 'Miss')
all_data['Title'] = all_data['Title'].str.replace('Mlle', 'Miss')
all_data['Title'] = all_data['Title'].str.replace('Mme', 'Mrs')
for title in ['Lady', 'Rev', 'the Countess','Capt', 'Col', 'Dr', 'Don', 'Major', 'Sir', 'Jonkheer', 'Don']:
    all_data['Title'] = all_data['Title'].str.replace(title, 'Misc')
all_data['Title'] = all_data['Title'].str.replace('Misca', 'Misc')

In [186]:
#filling na values
all_data['Age'] = all_data['Age'].fillna(training['Age'].median())
all_data['Fare'] = all_data['Fare'].fillna(training['Fare'].median())
all_data['Embarked'] = all_data['Embarked'].fillna(training['Embarked'].mode())

In [187]:
#dropping columns without an 'embarked'
all_data.dropna(subset=['Embarked'],inplace=True)

In [188]:
#no missing values (that matter)
all_data.isna().sum()

In [189]:
print(pd.crosstab(all_data['Survived'],all_data['Cabin_Type'],normalize='columns'))

In [190]:
#seperating useful features
useful_features = ['PassengerId','Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Cabin_Type',
                  'Multiple_Cabins','Title','train_test']

In [191]:
#getting dummies
useful_data_with_dummies=pd.get_dummies(all_data[useful_features])

In [192]:
#making sure the dummies look good
useful_data_with_dummies.info()

In [193]:
#re-seperating into train/test
#train
X_train = useful_data_with_dummies[useful_data_with_dummies['train_test']==1]
y_train = X_train['Survived']
X_train.drop(columns=['Survived'],inplace=True)
X_train.drop(columns=['train_test'],inplace=True)
#test
X_test = useful_data_with_dummies[useful_data_with_dummies['train_test']==0]
X_test.drop(columns=['Survived'],inplace=True)
X_test.drop(columns=['train_test'],inplace=True)

## 7. Modeling

In [195]:
#Helper functions
def initial_model(model):
    for i in [X_train,X_train_scaled]:
        mdl = model
        cv = cross_val_score(mdl,i,y_train,cv=5)
        print(cv)
        print(cv.mean())
        
def tuned_model(param_grid,model):
    mdl = GridSearchCV(model,param_grid = param_grid, n_jobs=3, verbose = 1)
    mdl.fit(X_train_scaled,y_train)
    print(mdl.best_score_)
    best=mdl.best_estimator_
    return best

In [196]:
#importing models
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

In [197]:
#compiling list of models
models = [MLPClassifier(max_iter=60), LogisticRegression(max_iter=60), GaussianNB(), KNeighborsClassifier(),
          RandomForestClassifier(),SVC(),XGBClassifier()]

In [198]:
#scaling
scaler=StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [199]:
#testing the models
for i in models:
    print(i, ':')
    initial_model(i)

In [200]:
#creating param grids
svc_param_grid = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}
rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}
mlp_param_grid = {'hidden_layer_sizes': [(10,30,10),(20,)],
                'activation': ['tanh', 'relu'],
                'solver': ['sgd', 'adam'],
                'alpha': [0.0001, 0.05],
                'learning_rate': ['constant','adaptive'],
}

In [201]:
#svc
#svc=tuned_model(svc_param_grid, SVC())

In [214]:
#rf
rf=tuned_model(rf_param_grid,RandomForestClassifier())

In [None]:
#MLP
#mlp=tuned_model(mlp_param_grid,MLPClassifier(max_iter=60))

## 8. Final Model Preparation

In [220]:
#the tuned rf model had the best cv results, so I am using it to predict the test data
y_test = mlp.predict(X_test_scaled)

In [216]:
y_test = y_test.astype(int)

In [217]:
final_data = {'PassengerId': testing.PassengerId, 'Survived': y_test}
submission = pd.DataFrame(data=final_data)

In [222]:
submission.to_csv('submission_mlp_tuned_1.csv', index=False)

**References:**

Huge thank you to the creators of these notebooks which I used for guidance and inspiration

https://www.kaggle.com/code/kenjee/titanic-project-example

https://www.kaggle.com/code/odaymourad/detailed-and-typical-solution-ensemble-modeling