# DIVE INTO KAGGLE COMPETITIONS
## Titanic: Machine Learning from Disaster

**Loading the data**

In [1]:
import pandas as pd

train_df = pd.read_csv('datas/train.csv')
test_df = pd.read_csv('datas/test.csv')

Let's take a look at the datasets:

In [2]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


We notice that *Age*, *Cabin* and *Embarked* features contain empty values.<br>
What about the test set?

In [3]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Here, we have the same thing **plus** a missing value for the *Fare* column.<br>
Before cleaning the datas, we can already decide that *Cabin* attribute will be dropped, as it's poorly filled.

We can now take a look at the format of each feature:

In [4]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


*Ticket* feature doesn't seem to be helpful for this problem. 
Consequently, it looks like a good idea to drop it, like the *Cabin* feature.<br>

We can first process *Sex* by replacing *male* by **0** and *female* by **1**.

In [5]:
def processSex(dataset):
    dataset.Sex = dataset.Sex.map({'female': 0, 'male': 1}).astype(int)
    return dataset

train_df = processSex(train_df)
test_df = processSex(test_df)

The *Name* feature can be processed. We can get something interesting by extracting the passenger's title, using **regular expressions**.<br>
Then, we'll see if it helps us getting a better score or not.

We can first take a look at the different titles in the datasets:

In [6]:
train_df['Title'] = train_df.Name.str.extract('([\w]+\.)', expand=False)
train_df.Title.value_counts()

Mr.          517
Miss.        182
Mrs.         125
Master.       40
Dr.            7
Rev.           6
Mlle.          2
Major.         2
Col.           2
Don.           1
Mme.           1
Lady.          1
Jonkheer.      1
Capt.          1
Countess.      1
Ms.            1
Sir.           1
Name: Title, dtype: int64

In [7]:
test_df['Title'] = test_df.Name.str.extract('([\w]+\.)', expand=False)
test_df.Title.value_counts()

Mr.        240
Miss.       78
Mrs.        72
Master.     21
Col.         2
Rev.         2
Dr.          1
Ms.          1
Dona.        1
Name: Title, dtype: int64

*Mr*, *Miss*, *Mrs* and *Master* are the main titles among the passengers.
The ohters are quite rare. We have two choice:
1. Replace all rare titles by an unique label
2. **Replace rare titles by their most related main title** <-- our choice

In [8]:
pd.crosstab(train_df.Sex, train_df.Title)

Title,Capt.,Col.,Countess.,Don.,Dr.,Jonkheer.,Lady.,Major.,Master.,Miss.,Mlle.,Mme.,Mr.,Mrs.,Ms.,Rev.,Sir.
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,0,0,1,0,1,0,1,0,0,182,2,1,0,125,1,0,0
1,1,2,0,1,6,1,0,2,40,0,0,0,517,0,0,6,1


Thanks to this cross-tabulation and some researchs, we can suppose the following:<br>
*Capt*, *Col*, *Don*, *Jonkheer*, *Major*, *Rev* and *Sir* are related to *Mr*<br>
*Countess*, *Lady* and *Mme* are related to *Mrs*<br>
*Mlle* and *Ms* are related to *Miss*<br>

However, *Dr* title is assigned to both genders. We need process it differently.<br>
Don't forget there's a different title in the test set: *Dona*, which is related to *Mrs*.

In [9]:
def replaceTitle(dataset):
    dataset['Title'] = dataset.Title.replace(['Capt.', 'Col.', 'Don.', 'Jonkheer.',
                                              'Major.', 'Rev.', 'Sir.'], 'Mr.')
    dataset['Title'] = dataset.Title.replace(['Countess.', 'Dona.', 'Lady.', 'Mme.'], 'Mrs.')
    dataset['Title'] = dataset.Title.replace(['Mlle.', 'Ms.'], 'Miss.')
    
    temp = dataset.loc[(dataset.Sex == 1) & (dataset.Title == 'Dr.'), 'Title']
    dataset.Title = dataset.Title.replace(temp, 'Mrs.')
    temp = dataset.loc[(dataset.Sex == 0) & (dataset.Title == 'Dr.'), 'Title']
    dataset.Title = dataset.Title.replace(temp, 'Mr.')

    return dataset

train_df = replaceTitle(train_df)
test_df = replaceTitle(test_df)

train_df.Title.value_counts()

Mr.        532
Miss.      185
Mrs.       134
Master.     40
Name: Title, dtype: int64

In [10]:
test_df.Title.value_counts()

Mr.        244
Miss.       79
Mrs.        74
Master.     21
Name: Title, dtype: int64

Great! Now we can add the *Name* feature to the drop list.

In [11]:
drop_list = ['Cabin', 'Ticket', 'Name']

Now, it could be a wise idea to merge *SibSp* and *Parch* into a single feature.<br>
As *SibSp* is equivalent to the number of **Sib**lings/**Sp**ouses of a given passenger, and *Parch* the number of its **Par**ents/**ch**ildren aboard. We have two solutions:<br>
1. Create a new feature called *Alone* which is **1** if *SibSp* and *Parch* are both equal to 0, and **0** otherwise.
2. Create a new feature called *FamilySize* which is the sum of *SibSp* and *Parch* for each row.

In [12]:
pd.crosstab(train_df.Survived, ((train_df.SibSp == 0) & (train_df.Parch == 0)), normalize=True)

col_0,False,True
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.196409,0.419753
1,0.200898,0.182941


In [13]:
pd.crosstab(train_df.Survived, (train_df.SibSp + train_df.Parch), normalize=True)

col_0,0,1,2,3,4,5,6,7,10
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0.419753,0.080808,0.04826,0.008979,0.013468,0.021324,0.008979,0.006734,0.007856
1,0.182941,0.099888,0.066218,0.023569,0.003367,0.003367,0.004489,0.0,0.0


Your chances of surviving if you're alone are clearly lowest than if you're not. But we can see a slight difference if we pay attention to the family size.<br>
Both solution give almost the same score, but paying attention to the family size seems to perform better at the end, so we're going to use this strategy.

In [14]:
def processFamilySize(dataset):
    dataset['FamilySize'] = dataset.SibSp + dataset.Parch
    return dataset

train_df = processFamilySize(train_df)
test_df = processFamilySize(test_df)

Nice! Let's hope that will help getting a better score.<br>
Don't forget to add *SibSp* and *Parch* to the drop list:

In [15]:
# drop_list.extend(('SibSp', 'Parch))
drop_list = ['Cabin', 'Ticket', 'Name', 'SibSp', 'Parch']

Ok. Now this done, go back on the training set and see where to go next.

In [16]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null int32
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Title          891 non-null object
FamilySize     891 non-null int64
dtypes: float64(2), int32(1), int64(6), object(5)
memory usage: 94.1+ KB


*Age* and *Fare* need to be processed.<br>
Both need to categorized in order to reduce the number of features when we'll apply a one hot encoding.<br>
Even if it misses only one fare in the test set, we'll use the same strategy for the both features.<br>

We can begin with the *Age* feature.<br>
**First**, we're going to fill NaN values, **then** we'll categorize this feature.<br><br>
There are 177 values missing in the training set, and 86 in the test set.<br>
We could simply replace these values by the global median, but we observe a different median age according to the class:

In [17]:
for n in [1,2,3]:
    medianAge = train_df.Age[train_df.Pclass == n].median(skipna=True)
    print('Median age of Class #{}: {} years'.format(n, medianAge))

Median age of Class #1: 37.0 years
Median age of Class #2: 29.0 years
Median age of Class #3: 24.0 years


Ok! That will be the first step of our function **processAge**.<br><br>
Now we're going to create a new feature called *AgeCat*.<br>
We use quantile method to determine intervals, and divide *Age* into 6 categories. (I've selected 6 after testing several values and observed a better score with this number of categories)

In [18]:
def processAge(dataset):
    for n in [1,2,3]:
        medianAge = dataset.Age[dataset.Pclass == n].median(skipna=True)
        dataset.loc[(pd.isnull(dataset.Age)) & (dataset.Pclass == n), 'Age'] = medianAge
    dataset['AgeCat'] = pd.qcut(dataset.Age, 6, labels=[0, 1, 2, 3, 4, 5])
    return dataset
        
train_df = processAge(train_df)
test_df = processAge(test_df)

train_df[['Age', 'AgeCat']].head(10)

Unnamed: 0,Age,AgeCat
0,22.0,1
1,38.0,4
2,26.0,2
3,35.0,4
4,35.0,4
5,24.0,1
6,54.0,5
7,2.0,0
8,27.0,3
9,14.0,0


Let's do the same thing for the *Fare* feature.<br>
(Here again, I've chosen to divide into 4 categories because it scores better with this number).

In [19]:
def processFare(dataset):
    for c in [1, 2, 3]:
        medianFare = dataset.Fare[dataset.Pclass == c].median(skipna=True)
        dataset.loc[(pd.isnull(dataset.Fare)) & (dataset.Pclass == c), 'Fare'] = medianFare
    dataset['FareCat'] = pd.qcut(dataset.Fare, 4, labels=[0, 1, 2, 3])
    return dataset

train_df = processFare(train_df)
test_df = processFare(test_df)

train_df[['Fare', 'FareCat']].head(10)

Unnamed: 0,Fare,FareCat
0,7.25,0
1,71.2833,3
2,7.925,1
3,53.1,3
4,8.05,1
5,8.4583,1
6,51.8625,3
7,21.075,2
8,11.1333,1
9,30.0708,2


Prefect! Don't forget to add *Age* and *Fare* to the drop list.

In [20]:
# drop_list.extend(('Age', 'Fare'))
drop_list = ['Cabin', 'Ticket', 'Name', 'SibSp', 'Parch', 'Age', 'Fare']

To finish cleaning the data, we need to fill the two missing values in the *Embarked* column.<br>
Passengers mostly embarked in Southampton.

In [21]:
train_df.Embarked.value_counts(normalize=True)

S    0.724409
C    0.188976
Q    0.086614
Name: Embarked, dtype: float64

Plus, it seems there is no link between *Embarked* and the other features.<br>
So the simplest thing to do is fitting the empty values by 'S'.<br>

In [22]:
train_df.Embarked = train_df.Embarked.fillna('S')

We can now **finally** drop all the features we don't need anymore.

In [23]:
train_df = train_df.drop(drop_list, axis=1)
test_df = test_df.drop(drop_list, axis=1)

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int32
Embarked       891 non-null object
Title          891 non-null object
FamilySize     891 non-null int64
AgeCat         891 non-null category
FareCat        891 non-null category
dtypes: category(2), int32(1), int64(4), object(2)
memory usage: 47.5+ KB


In [24]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Sex            418 non-null int32
Embarked       418 non-null object
Title          418 non-null object
FamilySize     418 non-null int64
AgeCat         418 non-null category
FareCat        418 non-null category
dtypes: category(2), int32(1), int64(3), object(2)
memory usage: 19.2+ KB


Nice :-)<br>
Problem with label encoding is that it assumes higher the categorical value, better the category.<br>
The solution consists in **One Hot encoding**.

In [25]:
dummies_cols = ['Pclass', 'Embarked', 'Title', 'FamilySize', 'AgeCat', 'FareCat']
pd.get_dummies(train_df, columns=dummies_cols).head(10)

Unnamed: 0,PassengerId,Survived,Sex,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Master.,...,AgeCat_0,AgeCat_1,AgeCat_2,AgeCat_3,AgeCat_4,AgeCat_5,FareCat_0,FareCat_1,FareCat_2,FareCat_3
0,1,0,1,0,0,1,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0
1,2,1,0,1,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,3,1,0,0,0,1,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0
3,4,1,0,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,1
4,5,0,1,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0
5,6,0,1,0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0
6,7,0,1,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1
7,8,0,1,0,0,1,0,0,1,1,...,1,0,0,0,0,0,0,0,1,0
8,9,1,0,0,0,1,0,0,1,0,...,0,0,0,1,0,0,0,1,0,0
9,10,1,0,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0


That's what it looks like. However, we don't need *PassengerId* to train the datas. Also, *Survived* corresponding to the label.<br><br>
So we first define *X_train*, *y_train* and *X_test*

In [26]:
X_train = train_df.copy()
X_train = pd.get_dummies(X_train, columns=dummies_cols)
X_train = X_train.drop(['PassengerId', 'Survived'], axis=1)

y_train = train_df.Survived

X_test = test_df.copy()
X_test = pd.get_dummies(X_test, columns=dummies_cols)
X_test = X_test.drop('PassengerId', axis=1)

Ok! We're now ready to do some Machine Learning.
First, we import some classifiers and metrics.

In [28]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

classifiers = {'KNN': KNeighborsClassifier(3),
               'SVC': SVC(probability=True),
               'Decision Tree': DecisionTreeClassifier(),
               'Random Forest': RandomForestClassifier(),
               'AdaBoost': AdaBoostClassifier(),
               'Gradient Boosting': GradientBoostingClassifier(),
               'Naive Bayes': GaussianNB()}

Which classifier will be the best?

In [30]:
for classifier in classifiers:
    model = classifiers[classifier]
    model.fit(X_train, y_train)
    score = cross_val_score(model, X_train, y_train, cv=10)
    print(classifier, score.mean())

KNN 0.8070865963000793
SVC 0.8136409034161842
Decision Tree 0.8080853478606288
Random Forest 0.8092214277607536
AdaBoost 0.8204202133696515
Gradient Boosting 0.8316947565543071
Naive Bayes 0.42084752014527294


The Gradient Boosting Classifier gives the higher score so we're going to use it to make our predictions on the test set.

In [32]:
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test).astype(int)

submit = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': y_pred})
submit.to_csv('datas/submit4.csv', index=False)