### 1. INTRODUCTION
Our Objectif in this competition is to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

from collections import Counter

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve

sns.set(style='white', context='notebook', palette='deep')

### 2. Load data

In [None]:
# Load data
##### Load train and Test set

train = pd.read_csv("../input/titanic/train.csv")
test = pd.read_csv("../input/titanic/test.csv")

In [None]:
train.head()

This amazing checking outlier is taken from [here](https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling?scriptVersionId=1416377&cellId=7) please go and check it

In [None]:
# Outlier detection 

def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])

I decided to detect outliers from the numerical values features (Age, SibSp, Sarch and Fare). Then, i considered outliers as rows that have at least two outlied numerical values.

In [None]:
train.loc[Outliers_to_drop] # Show the outliers rows

We detect 10 outliers. The 28, 89 and 342 passenger have an high Ticket Fare

The 7 others have very high values of SibSP.

In [None]:
# Drop outliers
train = train.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)

In [None]:
## Join train and test dfs in order to obtain the same number of features during categorical conversion
df =  pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

Now let's check for messing values.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

Age and Cabin have an important part in missing values.

In [None]:
train.dtypes

In [None]:
train.describe()

### 3. Feature analysis
#### 3.1 For numerical variables

In [None]:
# Correlation matrix between numerical variables
cor_numeric = sns.heatmap(train[["Survived","SibSp","Parch","Age","Fare"]].
                          corr(),annot=True, fmt = ".2f", cmap = "coolwarm")

Only Fare feature seems to have a significative correlation with the survival probability.

#### Checking the death frequency of the classes of different attributes

#### SibSp

In [None]:
# Explore SibSp feature vs Survived
g = sns.factorplot(x="SibSp",y="Survived",data=train,kind="bar", size = 5 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

It seems that passengers having a lot of siblings/spouses have less chance to survive, and the passengers having a less of siblings/spouses have more chance to survive.

#### Parch

In [None]:
# Explore Parch feature vs Survived
g  = sns.factorplot(x="Parch",y="Survived",data=train,kind="bar", size = 5 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

Small families have more chance to survive.

#### Age

In [None]:
# Explore Age vs Survived
g = sns.FacetGrid(train, col='Survived')
g = g.map(sns.distplot, "Age")

Age distribution seems to be a Heavy-tailed distribution.

We notice that age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to passengers between 0 and 5, that have survived. We also see that passengers between 60-80 have less survived.

It seems that very young passengers have more chance to survive.

#### Fare

In [None]:
# Explore Fare distribution 
g = sns.distplot(df["Fare"], color="b", label="Skewness : %.2f"%(df["Fare"].skew()))
g = g.legend(loc="best")

As we can see, Fare distribution is very skewed. This can lead to overweigth very high values in the model, even if it is scaled.

Many financial models that attempt to predict the future performance of an asset assume a normal distribution, in which measures of central tendency are equal. If the data are skewed, this kind of model will always underestimate skewness risk in its predictions. The more skewed the data, the less accurate this financial model will be.

In this case, it is better to transform it with the log function to reduce this skew.

In [None]:
# Apply log to Fare to reduce skewness distribution
df["Fare"] = df["Fare"].map(lambda i: np.log(i) if i > 0 else 0)

In [None]:
g = sns.distplot(df["Fare"], color="r", label="Skewness : %.2f"%(df["Fare"].skew()))
g = g.legend(loc="best")

Skewness is clearly reduced after the log transformation

#### 3.2 Categorical values

#### Sex

In [None]:
g = sns.barplot(x="Sex",y="Survived",data=train)
g = g.set_ylabel("Survival Probability")

It's clearly obvious that Male have less chance to survive than Female.

#### Pclass

In [None]:
# Explore Pclass vs Survived
g = sns.factorplot(x="Pclass",y="Survived",data=train,kind="bar", size = 4 , 
palette = "muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

The passenger survival is not the same in the 3 classes. First class passengers have more chance to survive than second class and third class passengers.

#### Embarked

In [None]:
# Explore Embarked vs Survived 
g = sns.factorplot(x="Embarked", y="Survived",  data=train,
                   size=4, kind="bar", palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")

It seems that passenger coming from Cherbourg (C) have more chance to survive.

### 4. Filling missing Values

In [None]:
df.isnull().sum()

418 Survived missing values coming from test data.

Since we have two missing values for Embarked, I decided to fill them with the most fequent value of "Embarked".

Since we have one messing values for Fare, so I decide to fill them with the median value of "Fare".

In [None]:
#Fill Fare missing values with the median value
df["Fare"] = df["Fare"].fillna(df["Fare"].median())

#Fill Embarked nan values of df with most frequent value
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

In [None]:
df.isnull().sum()

As we see, Age column contains 256 missing values in the whole df.

Since there is subpopulations that have more chance to survive (children for example), it is preferable to keep the age feature and to impute the missing values.

To adress this problem, i looked at the most correlated features with Age (Sex, Parch , Pclass and SibSP).

In [None]:
# convert Sex into categorical value 0 for male and 1 for female
df["Sex"] = df["Sex"].map({"male": 0, "female":1})

In [None]:
g = sns.heatmap(df[["Age","Sex","SibSp","Parch","Pclass"]].corr(),cmap="BrBG",annot=True)

Age is not correlated with Sex, but is negatively correlated with Pclass, Parch and SibSp. So I will use Pclass, Parch and SibSp to impute the missing values of Age.

In [None]:
# Filling missing value of Age 

## Fill Age with the median age of similar rows according to Pclass, Parch and SibSp
# Index of NaN age rows
index_NaN_age = list(df["Age"][df["Age"].isnull()].index)

for i in index_NaN_age :
    age_med = df["Age"].median()
    age_pred = df["Age"][((df['SibSp'] == df.iloc[i]["SibSp"]) & (df['Parch'] == df.iloc[i]["Parch"]) & (df['Pclass'] == df.iloc[i]["Pclass"]))].median()
    if not np.isnan(age_pred) :
        df['Age'].iloc[i] = age_pred
    else :
        df['Age'].iloc[i] = age_med

Cabin Have a lot of missing values, I decide to drop this column.

In [None]:
# Drop Cabin variable
df.drop(labels = ["Cabin"], axis = 1, inplace = True)

### 5. Feature engineering

In [None]:
df["Name"].head()

The Name feature contains information on passenger's title.

Since some passenger with distingused title may be preferred during the evacuation, it is interesting to add them to the model.

In [None]:
# Get Title from Name
df_title = [i.split(",")[1].split(".")[0].strip() for i in df["Name"]]
df["Title"] = pd.Series(df_title)

In [None]:
g = sns.countplot(x="Title",data=df)
g = plt.setp(g.get_xticklabels(), rotation=45) 

There is 17 titles in the df, most of them are very rare and we can group them in 4 categories.

In [None]:
# Convert to categorical values Title 
df["Title"] = df["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df["Title"] = df["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})
df["Title"] = df["Title"].astype(int)

In [None]:
# Drop Name variable
df.drop(labels = ["Name"], axis = 1, inplace = True)

In [None]:
# convert to indicator values Embarked 
df = pd.get_dummies(df, columns = ["Embarked"], prefix="Em")

In [None]:
df = pd.get_dummies(df, columns = ["Ticket"], prefix="T")

In [None]:
df.head()

### 6. Build Our Model

In [None]:
## Separate train and test data

train = df[:len(train)]
test = df[len(train):]
test.drop(labels=["Survived"],axis = 1,inplace=True)

In [None]:
## Separate train features and label 

train["Survived"] = train["Survived"].astype(int)

Y_train = train["Survived"]

X_train = train.drop(labels = ["Survived"],axis = 1)

I compared 10 popular classifiers and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.

* SVC
* Decision Tree
* AdaBoost
* Random Forest
* Extra Trees
* Gradient Boosting
* Multiple layer perceprton (neural network)
* KNN
* Logistic regression
* Linear Discriminant Analysis

In [None]:
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

In [None]:
# Modeling step Test differents algorithms 
random_state = 2
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))
classifiers.append(LinearDiscriminantAnalysis())

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_train, y = Y_train, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())
    
cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",
"RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]})

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

I decided to choose RandomForestClassifier for the ensemble modeling.

In [None]:
#Create a LinearDiscriminantAnalysis Classifier
ETC=RandomForestClassifier(random_state=random_state)

#Train the model using the training sets y_pred=clf.predict(X_test)
ETC.fit(X_train,Y_train)

y_pred=ETC.predict(test)

In [None]:
## Create Sample Submission file and submit
pred = pd.DataFrame(y_pred)
submession = pd.read_csv("../input/titanic/gender_submission.csv")
df = pd.concat([submession["PassengerId"], pred], axis = 1)
df.columns = ["PassengerId", "Survived"]
df.to_csv("gender_submission_rfc.csv", index=False)