## A deep dive with Titanic

A notebook to create a submission for the titanic competitions on Kaggle. Here are the steps involved -

#### 1. Data Cleaning
#### 2. Deal with missing data
#### 3. Descriptive Statistics / Data Visualization
#### 4. Prediction - stacking, ensamble learning ?? Don't know much about this part yet

### Importing Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re

### Importing Data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
full_data = [train, test]

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Feature Extraction / Engineering

We want to quantify the non-numerical data and put categorical data for our use

__PassengerId__ and __Survived__ not to be bothered.

Move on to **Pclass**. It is already a numerical variable. Let's see missing values..

In [4]:
def is_null(column_name):
    """ Checks for null values in a column """
    print(str(train[column_name].isnull().sum()) + " null values out of " + str(len(train)) + " in training data.")
    print(str(test[column_name].isnull().sum()) + " null values out of " + str(len(test)) + " in test data.")

In [5]:
is_null('Pclass')

0 null values out of 891 in training data.
0 null values out of 418 in test data.


Moving on to __Name__. Let's create two features out of it - 1. Length of the name 2. Title in the name

In [6]:
#Looking for null
is_null('Name')

0 null values out of 891 in training data.
0 null values out of 418 in test data.


In [7]:
#Length of the name
for dataset in full_data:
    dataset['NameLength'] = dataset['Name'].apply(len)

__Title__ extraction

In [8]:
def title(Name):
    """ Extracts title from a string """
    title_search = re.search(' ([A-Za-z]+)\.', Name)
    if title_search:
        return title_search.group(1)
    return ""

In [9]:
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(title)

Let us have a look at the list

In [10]:
train['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Major         2
Mlle          2
Countess      1
Ms            1
Lady          1
Jonkheer      1
Don           1
Mme           1
Capt          1
Sir           1
Name: Title, dtype: int64

Now, we'll combine some of them together into single category and put the rest in an 'other' category

In [11]:
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Other')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

In [12]:
train['Title'].value_counts()

Mr        517
Miss      185
Mrs       126
Master     40
Other      23
Name: Title, dtype: int64

Moving on to __Sex__

In [13]:
is_null('Sex')

0 null values out of 891 in training data.
0 null values out of 418 in test data.


Moving on to __Age__

In [14]:
is_null('Age')

177 null values out of 891 in training data.
86 null values out of 418 in test data.


Let us use Random Forest to predict the missing age values because why not. But we'll do it after other features are defined.

Moving on to __SibSp__  and __Parch__. We are going to define *Family_size* and *Alone* using these.

In [15]:
is_null('SibSp')
is_null('Parch')

0 null values out of 891 in training data.
0 null values out of 418 in test data.
0 null values out of 891 in training data.
0 null values out of 418 in test data.


In [16]:
for dataset in full_data:
    dataset['Family_size'] = dataset['SibSp'] + dataset['Parch'] + 1
    dataset['Alone'] = 0
    dataset.loc[dataset['Family_size'] == 1, 'Alone'] = 1

Moving on to __Ticket__. Let's just drop it for now.

In [17]:
#train.head()

Moving on to __Fare__.

In [18]:
is_null('Fare')

0 null values out of 891 in training data.
1 null values out of 418 in test data.


In [19]:
#Fill missing data with median
test['Fare'] = test['Fare'].fillna(test['Fare'].median())

Moving on to __Cabin__.

In [20]:
is_null('Cabin')

687 null values out of 891 in training data.
327 null values out of 418 in test data.


Let's use deck for each passanger and assign 'U' as deck in place of all the NAs.

In [21]:
for dataset in full_data:
    dataset['Deck'] = dataset['Cabin'].str[0]
    dataset['Deck'] = dataset['Deck'].fillna('U')

Moving on to __Embarked__.

In [22]:
is_null("Embarked")

2 null values out of 891 in training data.
0 null values out of 418 in test data.


In [23]:
train.loc[train['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NameLength,Title,Family_size,Alone,Deck
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,19,Miss,1,1,B
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,41,Mrs,1,1,B


Compare the passangers with similar features and take a guess.

In [24]:
train[train['Survived']==1][train['Pclass']==1][train['Sex']=='female'][train['Deck']=='B'][train['Alone']==1]['Embarked'].value_counts()

  if __name__ == '__main__':


S    6
C    4
Name: Embarked, dtype: int64

In [25]:
train['Embarked'] = train['Embarked'].fillna('S')

Let us drop the unnecessary columns

In [26]:
drop_columns = ['PassengerId',
 'Name',
 'SibSp',
 'Parch',
 'Ticket',
 'Cabin']

In [27]:
# for dataset in full_data:
#     dataset = dataset.drop(drop_columns,axis=1)

In [28]:
train = train.drop(drop_columns,axis=1)
test = test.drop(['Name','SibSp','Parch','Ticket','Cabin'],axis=1)
#test.head()

In [29]:
#train.head()

### Conversion of categorical variables into numerical

Here are the categorical variables - Sex, Embarked, Title, Deck. We have to convert them to numerical.

In [30]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

encoder=LabelEncoder()

categorical=['Sex', 'Embarked', 'Title', 'Deck']
for col in categorical:
    test[col]=encoder.fit_transform(test[col])
    train[col]=encoder.fit_transform(train[col])

train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,NameLength,Title,Family_size,Alone,Deck
0,0,3,1,22.0,7.25,2,23,2,2,0,8
1,1,1,0,38.0,71.2833,0,51,3,2,0,2
2,1,3,0,26.0,7.925,2,22,1,1,1,8
3,1,1,0,35.0,53.1,2,44,3,2,0,2
4,0,3,1,35.0,8.05,2,24,2,1,1,8


In [31]:
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,NameLength,Title,Family_size,Alone,Deck
0,892,3,1,34.5,7.8292,1,16,2,1,1,7
1,893,3,0,47.0,7.0,2,32,3,2,0,7
2,894,2,1,62.0,9.6875,1,25,2,1,1,7
3,895,3,1,27.0,8.6625,2,16,2,1,1,7
4,896,3,0,22.0,12.2875,2,44,3,3,0,7


Now we have to deal with the *missing* __Age__ values. Let's do __Random Forest__

In [32]:
from sklearn.ensemble import RandomForestRegressor
def fill_missing_age(df):
    
    #Feature set
    age_df = df[['Age','Pclass','Sex','Fare','Embarked','NameLength','Title','Family_size','Alone','Deck']]
    
    # Split sets into train and test
    train  = age_df.loc[ (df.Age.notnull()) ]# known Age values
    test = age_df.loc[ (df.Age.isnull()) ]# null Ages
    
    # All age values are stored in a target array
    y = train.values[:, 0]
    
    # All the other values are stored in the feature array
    X = train.values[:, 1::]
    
    # Create and fit a model
    rtr = RandomForestRegressor(n_estimators=2000, n_jobs=-1)
    rtr.fit(X, y)
    
    # Use the fitted model to predict the missing values
    predictedAges = rtr.predict(test.values[:, 1::])
    
    # Assign those predictions to the full data set
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 
    
    return df

In [33]:
train = fill_missing_age(train)
test = fill_missing_age(test)
is_null('Age')

0 null values out of 891 in training data.
0 null values out of 418 in test data.


#### Family Size Bins need to be created

In [34]:
for dataset in [train,test]:
    dataset.loc[dataset['Family_size'] == 1, 'FamilyBin'] = 'single'
    dataset.loc[(dataset['Family_size'] > 1) & (dataset['Family_size'] < 5), 'FamilyBin'] = 'small'
    dataset.loc[dataset['Family_size'] > 4, 'FamilyBin'] = 'large'
    
for dataset in [train,test]:
    print (dataset['FamilyBin'].value_counts())
    dataset['FamilyBin'] = encoder.fit_transform(dataset['FamilyBin'])
    print (dataset['FamilyBin'].value_counts())
#     print (dataset.head())

#Numerical Encoding

single    537
small     292
large      62
Name: FamilyBin, dtype: int64
1    537
2    292
0     62
Name: FamilyBin, dtype: int64
single    253
small     145
large      20
Name: FamilyBin, dtype: int64
1    253
2    145
0     20
Name: FamilyBin, dtype: int64


## Now the data is ready for prediction

In [35]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,NameLength,Title,Family_size,Alone,Deck,FamilyBin
0,0,3,1,22.0,7.25,2,23,2,2,0,8,2
1,1,1,0,38.0,71.2833,0,51,3,2,0,2,2
2,1,3,0,26.0,7.925,2,22,1,1,1,8,1
3,1,1,0,35.0,53.1,2,44,3,2,0,2,2
4,0,3,1,35.0,8.05,2,24,2,1,1,8,1


In [36]:
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,NameLength,Title,Family_size,Alone,Deck,FamilyBin
0,892,3,1,34.5,7.8292,1,16,2,1,1,7,1
1,893,3,0,47.0,7.0,2,32,3,2,0,7,2
2,894,2,1,62.0,9.6875,1,25,2,1,1,7,1
3,895,3,1,27.0,8.6625,2,16,2,1,1,7,1
4,896,3,0,22.0,12.2875,2,44,3,3,0,7,2


In [37]:
train.corr()['Survived']

Survived       1.000000
Pclass        -0.338481
Sex           -0.543351
Age           -0.088672
Fare           0.257307
Embarked      -0.167675
NameLength     0.332350
Title         -0.071174
Family_size    0.016639
Alone         -0.203367
Deck          -0.301116
FamilyBin      0.283810
Name: Survived, dtype: float64

### Write the data to csvs

In [38]:
train.to_csv('./Clean/Titanic_Train_Clean.csv')
test.to_csv('./Clean/Titanic_Test_Clean.csv')