# Titanic survival prediction

## Description of the notebook 

This notebook is organized in 7 parts.

1. Data acquisition: we will import the dataset and get a first glance at what it contains.

2. Dataset exploratory analysis : we will analyze the data, get an understanding of the main features, of their types, if they have missing values...

3. Features exploratory analysis : we will analyze the data,find the corelation between the features and the survival rate and decide which features to keep

4. Data cleaning and feature selection : Select features that will be kept in the model and remove others, deal with missing values if there is any, create dummy variables for categories...

5. Model preparation : prepare the train and test set and prepare the models we will use for the classification

6. Pipeline evaluation and selection: We will run the models and get their scores, which will allow us the choose the best model.

7. Predict : Final stage, we will run our final model to execute predictions.

# 1. Data acquisition
In this part we will just import the relevant libraries and import the trainig and test datasets.
### Import libraries and dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

from sklearn.ensemble import VotingClassifier

sns.set_style('darkgrid')
%matplotlib inline

In [None]:
dataset_train = pd.read_csv("/kaggle/input/titanic/train.csv")
dataset_test = pd.read_csv("/kaggle/input/titanic/test.csv")

# 2. Dataset Exploratory analysis
In this section we will check the number and types of features for the dataset, if there is missing values, if there is corelation between some features, if we can remove some unnecessary features and if we can create new features from the ones we already have

In [None]:
def screen_data(df):
    rows = []
    for col in df.columns:
        rows.append([col, df[col].isnull().sum(), df[col].nunique(), df[col].dtypes])
    print(pd.DataFrame(rows, columns=['Col', 'Missing values', 'Unique values', 'Type']))

In [None]:
screen_data(dataset_train)

In [None]:
screen_data(dataset_test)

### Preliminary infos we can get from the data screening
Here we can see that for dataset_train, three features have **missing data** (*age, cabin and embarked*) <br>
For the dataset_test, there is also three features with **missing data** (*age, cabin and fare*)

We can also see that most features are **numerical** (*PassengerId, Survived, Pclass, Age, SibSP, ParCh and fare*) <br>
some are object type with **categories** (*Sex, Embarked*) <br>
and some are **string** (*Name, Ticket, cabin*) <br>

Cabin and age are missing a lot of value in both datasets, our exploratory analysis will tell us if we need to estimate the missing of value or if we can drop these features.

Now let's print some rows from the dataset

In [None]:
dataset_train.head(10)

In [None]:
dataset_test.head(10)

### Preliminary infos we can get from the data
- The passengerId probably does not give any information on the survivability. <br>
- There is always a title in the name column which might give us further information. <br>
- Some people have a non round age, we will see what to do about it. <br>
- The cabin seems to have a letter associated with it, we might extract it to see if it can be linked to survivability. <br>

Let's see if we can get any additional infos about the numerical data.

In [None]:
dataset_train.describe()

### Preliminary infos we can get from the numerical data
- The average survival rate is 38%. <br>
- More than 50% of the passengers are in the Pclass number 3. <br>
- Most passengers are between 20 and 40 years old with the minimum being 0.42 and the maximum being 80 years old. <br>
- More than 50% of the passengers travel alone (0 in SibSp and Parch). <br>
- The minimum fare is 0 which might be an error and the maximum is 512 with is also and extreme value considering a mean value of 32 and a std of 50. <br>

Let's see if we can get any additional infos about the correlation of the data.

In [None]:
dataset_train.corr()

### Preliminary infos we can get from the correlation matrix
- The passengerId is not correlated to the survivability which confirms our hypothesis. <br>
- The pclass is correlated negatively to the survivability, which means people in class 1 survived better than people in class 3. <br>
- The fare is correlated positively to the survivability, which means people that paid a higher fare had a better chance of surviving. <br>
- The age is not correlated to survivability which is surprising, we would have imagined that younger persons would have survived better than older ones. Maybe it is correlated but not in a linear way. <br>

Let's see if we can get any additional infos about the repartition of the data.

In [None]:
#dataset_train.plot(kind='density', subplots=True, layout=(4,2), sharex=False, figsize=(20,20))
dataset_train.hist(figsize=(20,20));

### Preliminary infos we can get from the histogram plots
- Most people traveled alone (SibSp and Parch). <br>
- Most people paid a fare less than $50. <br>
- There is some very young children amongst the passengers. <br>
- There is approximately the same amount of passengers in class 1 and 2. <br>

Let's dive now a bit further in the features and how they affect the survival rate.

# 3. Feature exploratory analysis

What we would like to analyse :
- The correlation of age and survivability
- The correlation of sex and survivability
- The correlation of the class and survivability
- The correlation of the fare and survivability
- The correlation of the port of embarquation and survivability
- Does the number of siblings or parents/children has an impact on survivability.
- Who paid 0 and $512 for their ticket?

## Let's look at the three categorical features (sex, embarked and Pclass)

In [None]:
fig, ax = plt.subplots(1,3, figsize=(22,5))
sns.countplot(x="Sex", data=dataset_train, ax=ax[0])
ax[0].set_title('Proportion of men and women');
sns.countplot(x="Pclass", data=dataset_train, ax=ax[1])
ax[1].set_title('Proportion of passengers per class');
sns.countplot(x="Embarked", data=dataset_train, ax=ax[2])
ax[2].set_title('Proportion of passengers per port');

There is a more men than women and most of the passengers embarked at the Southampton port.
Most passengers are in class 3.

Now let's check if these features have an impact on the survival rate.

In [None]:
dataset_train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
dataset_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
dataset_train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

### What can we learn from the previous statistics?
- Women had almost 4 times more chance to survive than men
- The passengers in class 1 and 2 ahd a better chance of surviving compare to passengers in class 3
- People that embarked in the port C had a better chance of surviving.
    - We can probably check if people from port C belong to class number 1 and 2 which might explain the increase in survival rate.

In [None]:
fig, ax = plt.subplots(1,4, figsize=(22,5))
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=dataset_train, ax=ax[0])
ax[0].set_title('Survival rate vs Pclass and sex');
sns.countplot(x="Pclass", hue="Sex", data=dataset_train, ax=ax[1])
ax[1].set_title('Proportion of men and women in every class');
sns.countplot(x="Embarked", hue="Pclass", data=dataset_train, ax=ax[2])
ax[2].set_title('Class depending of port of embarkation');
sns.barplot(x="Embarked", y="Survived", hue="Pclass", data=dataset_train, ax=ax[3])
ax[3].set_title('Survival rate vs Pclass and port');

### What can we learn from the previous plots?
- Women in class number 1 and 2 had a very high chance of surviving.
- A lot of people from class number 3 were men whereas in class 1 and 2 it was almost the same number and men and women.
- Very few people in class 1 and 2 embarqued in port Q which explains the very high std for the survival rate.
- Interestingly, people from class number 3 had a better chance of surviving if they embarqued in port C and Q.

## Let's now look at the family size and if it has an impact on survival

In [None]:
fig, ax = plt.subplots(1,4, figsize=(22,5))
sns.countplot(x="Parch", hue="Sex", data=dataset_train, ax=ax[0])
ax[0].set_title('Repartition of family size (Parch)');
sns.barplot(x="Parch", y="Survived", hue="Sex", data=dataset_train, ax=ax[1])
ax[1].set_title('Survival rate vs family size (Parch)');
sns.countplot(x="SibSp", hue="Sex", data=dataset_train, ax=ax[2])
ax[2].set_title('Repartition of family size (SibSp)');
sns.barplot(x="SibSp", y="Survived", hue="Sex", data=dataset_train, ax=ax[3])
ax[3].set_title('Survival rate vs family size (SibSp)');

### What can we learn from the previous plots?
- Most people travelled alone, especially men.
- As a man, you have better chance of surviving if you are travelling with someone.

It seems interesting to build a new feature with family size.

## Let's look how the age and survival rate are related

In [None]:
bin_size = 20
fig, ax = plt.subplots(2,3, figsize=(22,10))
sns.histplot(data=dataset_train, x="Age", hue="Survived", multiple="stack", bins=bin_size, ax=ax[0][0])
ax[0][0].set_title('Repartition of age among the passengers');
sns.histplot(x="Age", hue="Pclass", data=dataset_train, ax=ax[0][1],multiple="stack", bins=bin_size, palette=['blue', 'orange', 'green'])
ax[0][1].set_title('Repartition of age and class');
sns.histplot(x="Age", hue="Survived", multiple="stack", bins=bin_size, data=dataset_train.loc[dataset_train['Pclass'] == 1], ax=ax[1][0])
ax[1][0].set_title('Survival rate of passengers of class 1');
sns.histplot(x="Age", hue="Survived", multiple="stack", bins=bin_size, data=dataset_train.loc[dataset_train['Pclass'] == 2], ax=ax[1][1])
ax[1][1].set_title('Survival rate of passengers of class 2');
sns.histplot(x="Age", hue="Survived", multiple="stack", bins=bin_size, data=dataset_train.loc[dataset_train['Pclass'] == 3], ax=ax[1][2])
ax[1][2].set_title('Survival rate of passengers of class 3');

### What can we learn from the previous plots?
- Children less than 15 years old in the class 1 and 2 had a very high chance of surviving
- Older people (more than 45) had a lower chance of surviving
- For passengers of class 3, the age did not really help surviving.
- Class 1 and 2 have more older people than class 3.

## Let's look at the fare and maybe the fare per person.

In [None]:
#Let's first create a family size column and then a fare per person column.
df = dataset_train.copy()
df['Family_size'] = df['Parch']+df['SibSp']+1
df['Fare_per_person'] = df['Fare']/df['Family_size']
#Let's also reduce the maximum fare per person at 60 to avoid extreme values.
df.loc[df['Fare_per_person'] >= 60, 'Fare_per_person']=60
fig, ax = plt.subplots(1,4, figsize=(22,5))
sns.barplot(x="Family_size", y="Survived", hue="Sex", data=df, ax=ax[0])
ax[0].set_title('Survival rate vs family size');
sns.barplot(x="Family_size", y="Survived", data=df, ax=ax[1])
ax[1].set_title('Survival rate vs family size');

In [None]:
fig, ax = plt.subplots(1,2, figsize=(22,5))
sns.histplot(data=df, x='Fare_per_person', hue="Survived", multiple="stack", bins=20, ax=ax[0])
ax[0].set_title('Survival rate vs Pclass and sex');
sns.scatterplot(x="Age", y="Fare_per_person", hue="Pclass", data=df, ax=ax[1], palette=['blue', 'orange', 'green'])
ax[1].set_title('Price vs age');

### What can we learn from these plots?
- A higher fare meant a higher chance of surviving
- Somme people from class 1 (and 3) paid 0 or close to 0, it might indicate missing values.
- Three people from class 3 paid more than $50 for their ticket, it might also be an error.

In [None]:
df.loc[df['Fare_per_person'] <= 1]

We need to recalculate the price of their ticket based on their class.

In [None]:
df.loc[(df['Fare_per_person'] > 50) & (df['Pclass'] == 3)]

These people have the same ticket number but don't belong to the same family. <br>
We need to investigate if the family size is equal to the number of ticket with the same number. <br>
It's possible that some family have not been registered.

# 4. Data cleaning and feature selection

Based on previous analysis, the first step will be to create a new column with the number of identical tickets. <br>

In [None]:
#let's create a column with number of identical tickets.
df['Nb_identical_tickets'] = 0
val_counts_ticket = df['Ticket'].value_counts()
for ticket in val_counts_ticket.index.to_list():
    df.loc[df['Ticket'] == ticket, 'Nb_identical_tickets'] = val_counts_ticket[ticket]

In [None]:
#Let's checl if the family size and the number of identical tickets is equivalent or not.
df.loc[(df['Nb_identical_tickets'] - df['Family_size']) >= 2].sort_values('Ticket').head(50)

From this analysis we can see that:
- We need to update the fare_per_person in the way : fare_per_person = fare/Nb_identical_tickets

Further questions:
- Do people from the same ticket but not from the same family survive better than people alone?

In [None]:
for pclass in df['Pclass'].unique():
    df.loc[(df['Fare'] == 0) & (df['Pclass'] == pclass), 'Fare'] = df.loc[df['Pclass'] == pclass]['Fare'].median()
df['Fare_per_person'] = df['Fare']/df['Nb_identical_tickets']


fig, ax = plt.subplots(1,4, figsize=(22,6))
sns.countplot(x="Nb_identical_tickets", hue="Pclass", data=df, ax=ax[0])
ax[0].set_title('Number of identical tickets per class');
sns.barplot(x="Nb_identical_tickets", y="Survived", hue="Sex", data=df, ax=ax[1])
ax[1].set_title('Survival rate vs number of identical tickets');
sns.scatterplot(x="Age", y="Fare_per_person", hue="Pclass", data=df, ax=ax[2], palette=['blue', 'orange', 'green'])
ax[2].set_title('Price vs age');
sns.barplot(x="Nb_identical_tickets", y="Survived", data=df, ax=ax[3])
ax[3].set_title('Survival rate vs number of identical tickets');

In [None]:
df.loc[(df['Fare_per_person'] <= 25) & (df['Pclass'] == 1)]

### What can we learn from this?
- The age is not correlated to the fare per person.
- The number of identical tickets is probably an interesting feature.

Let's create a feature with the title contained in the name and a feature with the cabin letter.

In [None]:
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df['Cabin_level'] = df['Cabin'].astype(str).str[0]

In [None]:
df['Title'].value_counts()

In [None]:
df['Cabin_level'].value_counts()

### From these two new columns
- Most cabin are not indicated in the dataset
- There is lot of title that correspond to only one or two persons. We can aggregate them together.
- It would be interesting to check if there is a link between cabin level and class.

In [None]:
df['Title'] = df['Title'].replace(['Capt', 'Col', 'Dr', 'Major', 'Jonkheer', 'Dona'], 'Other')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df['Title'] = df['Title'].replace('Lady', 'Mrs')
df['Title'] = df['Title'].replace('Don', 'Mr')
df['Title'] = df['Title'].replace('Rev', 'Mr')
df['Title'] = df['Title'].replace('Sir', 'Mr')
df['Title'] = df['Title'].replace('Countess', 'Mrs')

In [None]:
fig, ax = plt.subplots(1,4, figsize=(22,6))
sns.countplot(x="Cabin_level", hue="Pclass", data=df, ax=ax[0])
ax[0].set_title('Cabin letter and class');
sns.scatterplot(x="Cabin_level", y="Fare_per_person", hue="Pclass", data=df, ax=ax[1], palette=['blue', 'orange', 'green'])
ax[1].set_title('Price vs cabin');
sns.barplot(x="Cabin_level", y="Survived", data=df, ax=ax[2])
ax[2].set_title('Survival rate per cabin');
sns.boxplot(x="Cabin_level", y="Fare_per_person", data=df, ax=ax[3])
ax[3].set_title('Price vs cabin');

### What can we learn from the cabin letter?
- Most passengers in cabin A, B, C, D and E were from class number 1.
- Cabins in A, B, C, D and E were the most expensive.
- Cabins F an G were significantly less expensive.
- Cabin T only has one passenger so we cannot really conclude anything from it.
- Women in cabins B, C, D, E and F survived better than other women.

In [None]:
fig, ax = plt.subplots(1,4, figsize=(22,6))
sns.scatterplot(x="Title", y="Age", hue="Sex", data=df, ax=ax[0])
ax[0].set_title('Title vs age');
sns.scatterplot(x="Title", y="Age", hue="Survived", data=df, ax=ax[1])
ax[1].set_title('Title vs age');
sns.boxplot(x="Title", y="Age", data=df, ax=ax[2])
ax[2].set_title('Title vs age');
sns.boxplot(x="Title", y="Age", hue="Pclass", data=df, ax=ax[3])
ax[3].set_title('Title vs age');

Let's also see if mother and children had a better chance of survival
- Children are pasengers less than 18 years old
- Mother are passengers that are women, that are more than 22 years old, that have one or more Parch and that have the title Mrs.

In [None]:
df['Civ'] = 'none'
df.loc[df['Age'] < 18, 'Civ'] = 'Child'
df.loc[(df['Age'] > 18) & (df['Sex'] == 'female') & (df['Title'] == 'Mrs') & (df['Parch'] >= 1), 'Civ'] = 'Mother'
fig, ax = plt.subplots(1,2, figsize=(22,6))
sns.barplot(x="Civ", y="Survived", hue="Pclass", data=df, ax=ax[0])
ax[0].set_title('Survival rate vs Civility');

### What can we learn from this title analysis?
- The title and the class can give an approximation of the age of a person, this can be used to fill the age missing value.
- Children and mother in class 1 and 2 survived really well.

In [None]:
df

## 4.2 Preprocessing pipeline

Steps to include in the preprocessing pipeline:
- Feature engineering
    - Create a title column
    - Create a family size (Parch + SibSp + 1)
    - Create a nb of identical tickets column
    - Create a Civ column for children and mother
    - Create a Deck column with the first letter of the cabin
- Impute missing values
    - For Fare : median value of the passengers of the same class
    - For Age : median value of the passenger of the same title and class
- Encode feature
    - Family size will be encoded in three subcategories (single (<2), small (>1 & < 5) and large (>5)
    - Deck will be encoded in three subcategories (BDE, CF, AG)
    - Number of identical tickets will become ident_tickets_234 with passengers with the same ticket as up to 3 passengers have a better chance of surviving
- Binning of features
    - Age : 0-6, 7-12, 13-17, 18-25, 26-40, 41-100
    - Fare per person : 0-5, 5-10, 10-15, 15-25, 25-40, 40-100
- Scale features (robust scaler)
- Remove feature (PassengerId, Name, SibSp, Parch, Ticket, Fare, Embarked)

In [None]:
#Preprocessing functions definition
def extract_deck(df):
    df['Cabin'] = df['Cabin'].astype(str).str[0]
    return df

def encode_deck(df):
    df['Deck_BDE'] = 0
    df['Deck_CF'] = 0
    df['Deck_AG'] = 0
    df.loc[(df['Cabin'] == 'B') | (df['Cabin'] == 'D') | (df['Cabin'] == 'E'), 'Deck_BDE'] = 1
    df.loc[(df['Cabin'] == 'C') | (df['Cabin'] == 'F'), 'Deck_CF'] = 1
    df.loc[(df['Cabin'] == 'A') | (df['Cabin'] == 'G'), 'Deck_AG'] = 1
    return df

def encode_family_size(df):
    df.loc[(df['Family_size'] > 1) & (df['Family_size'] < 5), 'Family_size'] = 2
    df.loc[df['Family_size'] > 4, 'Family_size'] = 3
    return df

def encode_identical_tickets(df):
    df['Ident_tickets_234'] = 0
    df.loc[(df['Nb_identical_tickets'] > 1) & (df['Nb_identical_tickets'] < 5), 'Ident_tickets_234'] = 1
    return df

def extract_title(df):
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(['Capt', 'Col', 'Dr', 'Major', 'Jonkheer', 'Dona'], 'Other')
    df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
    df['Title'] = df['Title'].replace(['Mme', 'Lady', 'Countess'], 'Mrs')
    df['Title'] = df['Title'].replace(['Don', 'Rev', 'Sir'], 'Mr')
    return df

def extract_family_size(df):
    df['Family_size'] = df['Parch'] + df['SibSp'] + 1
    return df

def extract_identical_tickets(df):
    df['Nb_identical_tickets'] = 0
    val_counts_ticket = df['Ticket'].value_counts()
    for ticket in val_counts_ticket.index.to_list():
        df.loc[df['Ticket'] == ticket, 'Nb_identical_tickets'] = val_counts_ticket[ticket]
    return df

def calculate_fare_per_person(df):
    df['Fare_per_person'] = df['Fare']/df['Nb_identical_tickets']
    return df

def impute_fare(df, df_total):
    for pclass in df['Pclass'].unique():
        df.loc[((df['Fare'].isna()) | (df['Fare'] == 0)) & (df['Pclass'] == pclass), 'Fare'] = df_total.loc[df_total['Pclass'] == pclass]['Fare'].median()
    return df

def impute_age(df, df_total):
    for pclass in df['Pclass'].unique():
        for title in df.loc[df['Pclass'] == pclass]['Title'].unique():
            df.loc[(df['Pclass'] == pclass) & (df['Title'] == title) & (df['Age'].isna()), 'Age'] = df_total.loc[(df_total['Pclass'] == pclass) & (df_total['Title'] == title)]['Age'].median()
            #print(title, pclass, df.loc[(df['Pclass'] == pclass) & (df['Title'] == title)]['Age'].median())
    return df

def bin_column(df, col, bins, labels):
    df[col] = pd.cut(df[col], bins=bins, labels=labels)
    df[col] = df[col].astype(float)
    return df

def extract_civ(df):
    df['Child'] = 0
    df['Mother'] = 0
    df.loc[df['Age'] < 18, 'Child'] = 1
    df.loc[(df['Age'] > 18) & (df['Sex'] == 'female') & (df['Title'] == 'Mrs') & (df['Parch'] >= 1), 'Mother'] = 1
    return df

In [None]:
dataset_train = pd.read_csv("/kaggle/input/titanic/train.csv").drop(columns=['Survived'])
dataset_test = pd.read_csv("/kaggle/input/titanic/test.csv")

combine = [dataset_train, dataset_test]
for df in combine:
    df = extract_title(df)
    df = extract_family_size(df)
    df = extract_identical_tickets(df)
    df = extract_civ(df)
    df = extract_deck(df)
    df = impute_fare(df, dataset_train)
    df = impute_age(df, dataset_train)
    df = calculate_fare_per_person(df)
    df = encode_family_size(df)
    df = encode_deck(df)
    df = encode_identical_tickets(df)
    df = bin_column(df, 'Fare_per_person', bins=[0,5,10,15,25,40,1000], labels=[1,2,3,4,5,6])
    df = bin_column(df, 'Age', bins=[0,6,12,17,25,40,100], labels=[1,2,3,4,5,6])

In [None]:
dataset_test.info()

In [None]:
#Preprocessing pipeline
drop_cols = ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked', 'Nb_identical_tickets', 'Cabin']
num_cols = ['Pclass', 'Age', 'Family_size', 'Child', 'Mother', 'Fare_per_person', 'Deck_BDE', 'Deck_CF', 'Deck_AG', 'Ident_tickets_234']
preprocessing = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore', sparse=False) , ['Title']),
    (OneHotEncoder(handle_unknown='ignore', sparse=False) , ['Sex']),
    ('drop' ,                                               drop_cols),
    remainder = "passthrough"
)

## 5. Model Preparation

In [None]:
#dataset_train = pd.read_csv("/kaggle/input/titanic/train.csv")
X = dataset_train
y = np.ravel(np.array([pd.read_csv("/kaggle/input/titanic/train.csv")['Survived']]).T)

X_pred = dataset_test.copy()

folds = KFold(n_splits=10, shuffle=True, random_state=0)

### Let's see how the dataset looks like after the preprocessing phase

In [None]:
pipeline = Pipeline([
    ('preprocessing' , preprocessing),
    ('scaler' ,        MinMaxScaler())
])
pipeline.fit(X)
pd.DataFrame(pipeline.transform(X)).describe()

In [None]:
X_pred_preprocessed = pd.DataFrame(pipeline.transform(X_pred))
X_pred_preprocessed.describe()

# 6. Results of pipeline + model

In [None]:
models_list = { 'LogisticRegression': LogisticRegression(), 
           'SVC': SVC(),
           'LinearSVC': LinearSVC(), 
           'Random Forest': RandomForestClassifier(), 
           'KNN': KNeighborsClassifier(), 
           'Naive Bayes' :GaussianNB(), 
           'Perceptron': Perceptron(), 
           'SGD': SGDClassifier(), 
           'Decision tree': DecisionTreeClassifier(),
           'XGBoost': xgb.XGBClassifier(use_label_encoder=False, verbosity = 0)}

model_perf_matrix = []
for model_name, model in models_list.items():
    pipeline = Pipeline([
        ('preprocessing' , preprocessing),
        ('scaler' ,        MinMaxScaler()),
        ('model' ,           model)
    ])

    cv_score = cross_val_score(pipeline, X, y, cv=folds);
    model_perf_matrix.append([model_name, round(cv_score.mean(),3), round(cv_score.std(),4)])
    
df_model_perf = pd.DataFrame(model_perf_matrix, columns=['Model', 'Mean value', 'Std value'])
df_model_perf

### Let's now test the five best algorithms with hyperparameters tuning

In [None]:
pipeline_preprocessing = Pipeline([
    ('preprocessing' , preprocessing),
    ('scaler' ,        MinMaxScaler())
])
pipeline_preprocessing.fit(X)
X_preprocessed = pd.DataFrame(pipeline_preprocessing.transform(X))
y = np.ravel(np.array([pd.read_csv("/kaggle/input/titanic/train.csv")['Survived']]).T)

In [None]:
models_list = { 'LogisticRegression': { 'model' : LogisticRegression(),
                                         'param_grid' : { 
                                                            'C'     : [0.001, 0.01, 0.1, 1.],
                                                         }},
                                'SVC': { 'model' : SVC(),
                                         'param_grid' : { 
                                                            "C": [0.001, 0.01, 0.1, 1.],
                                                            "kernel": ["linear", "poly", "rbf", "sigmoid"],
                                                            "gamma": ["scale", "auto"]
                                                         }},
                                'LinearSVC': { 'model' : LinearSVC(),
                                         'param_grid' : { 
                                                            'C' : [0.001, 0.01, 0.1, 1.],
                                                         }},
                                'Random Forest': { 'model' : RandomForestClassifier(),
                                         'param_grid' : { 
                                                            'n_estimators': [100, 200, 300],
                                                            'max_features': ['auto', 'sqrt', 'log2'],
                                                            'min_samples_split': [2,4,10],
                                                            'criterion' :['gini', 'entropy']
                                                         }},
                                'XGBoost': { 'model' : xgb.XGBClassifier(use_label_encoder=False, verbosity = 0),
                                         'param_grid' : { 
                                                            'max_depth': [3, 5, 7, 9], 
                                                            'n_estimators': [25, 50, 100, 150, 200, 300],
                                                            'learning_rate': [0.01, 0.05, 0.1]
                                                         }}}

In [None]:
%%time

results = {}

for model_name, model in models_list.items():
    best_model = GridSearchCV(estimator=model['model'], param_grid=model['param_grid'], cv= 10)
    best_model.fit(X_preprocessed, y)
    print(model_name)
    print(best_model.best_params_)
    print('Mean score : ', best_model.best_score_, ' Std : ', best_model.cv_results_['std_test_score'][best_model.best_index_])
    model_results = {'estimator' : best_model.best_estimator_, 'best_params' : best_model.best_params_, 
                     'mean' : best_model.best_score_, 'std' : best_model.cv_results_['std_test_score'][best_model.best_index_]}
    results[model_name] = model_results

In [None]:
results

### The five different models all have very similar scores, let's use all these models in an ensemble classifier

In [None]:
#create a dictionary of our models
estimators = []
for model_name, model in results.items():
    estimators.append((model_name, model['estimator']))
#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')

In [None]:
#fit model to training data
ensemble.fit(X_preprocessed, y)
#test our model on the test data
ensemble.score(X_preprocessed, y)

# 7. Prediction

In [None]:
y_pred = ensemble.predict(X_pred_preprocessed).astype(int)
y_pred

In [None]:
output = pd.DataFrame({'PassengerId': pd.read_csv("/kaggle/input/titanic/test.csv").PassengerId, 'Survived': y_pred})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")