On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.


Let's analyse which parameters correlate with survivability and make a model to predict the test sample

In [1]:
# data analysis
import pandas as pd
import numpy as np
import random as random

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

### Import data to pandas from a csv file

In [2]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
combine = [train_df, test_df]

Which features are implemented in dataset and what they mean?

In [3]:
print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

In [4]:
# preview the data
train_df.describe()

    Variable	Definition	Key
    survival	Survival	0 = No, 1 = Yes
    pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
    sex	Sex	
    Age	Age in years	
    sibsp	# of siblings / spouses aboard the Titanic	
    parch	# of parents / children aboard the Titanic	
    ticket	Ticket number	
    fare	Passenger fare	
    cabin	Cabin number	
    embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
    survival	Survival	0 = No, 1 = Yes
    pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
    sex	Sex	
    Age	Age in years	
    sibsp	# of siblings / spouses aboard the Titanic	
    parch	# of parents / children aboard the Titanic	
    ticket	Ticket number	
    fare	Passenger fare	
    cabin	Cabin number	
    embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

In [5]:
train_df.head()

### Lets make more values numerical by transforming:
    Sex: Male: 0, Female: 1
    Embarked: C: 0, S: 1, Q: 2


In [6]:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
    

train_df.head()

In [7]:
train_df.isna().sum()


### Category Embarked contains only two empty values. Let's fill it with the most common one

In [8]:
freq_port = train_df.Embarked.dropna().mode()[0]



In [9]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
    dataset['Embarked'] = dataset['Embarked'].map( {'C': 0, 'S': 1, 'Q': 2} ).astype(int)
    
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)



In [10]:
train_df.head()

### Now lets look at the survivability in terms of age and sex

In [11]:
train_df.groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

We can see that women have much higher Survived value. Mean value of age for both women and men is quite similar. Let's look at some histograms

In [12]:
plot = sns.FacetGrid(train_df, col='Survived')
plot.map(plt.hist, 'Age', bins=20)

In [13]:
grid = sns.FacetGrid(train_df, col='Survived', row='Sex', height=2, aspect=3)
grid.map(plt.hist, 'Age', bins=10)
grid.add_legend();

We can see that a lot of men in age of 15-30 died 

### And lets check other relations

In [14]:
grid = sns.FacetGrid(train_df, col='Survived', row='Sex')
grid.map(plt.hist, 'Fare', bins=20)
grid.add_legend();

In [15]:
#Fill NaN falues with median value
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

#And create Fare Band
train_df['Band'] = pd.qcut(train_df['Fare'], 4)
train_df[['Band', 'Survived']].groupby(['Band'], as_index=False).mean().sort_values(by='Band', ascending=True) 


    


In [16]:
#Insert values
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['Band'], axis=1)
combine = [train_df, test_df]

In [17]:
train_df[['Fare', 'Survived']].groupby(['Fare'], as_index=False).mean().sort_values(by='Survived', ascending=False)

### Higher Fare contributes to survivability

In [18]:
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass')
grid.map(plt.hist, 'Age', bins=20)
grid.add_legend();

In [19]:
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

    We can also create a band in tearms of Age.
    But at frst we have to predict the empty values
    Let's predict it using Sex and Pclass data

In [21]:
#create an empty array to store values
guess_ages = np.zeros((2,3))



In [22]:

for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            #store the diffrent Age values
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            # use the median to guess the age
            age_guess = guess_df.median()

            #store it in array
            guess_ages[i,j] = int( age_guess)            
    


In [23]:
for dataset in combine:
    
# Now, fill the empty values with predicted data
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                        'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()

In [24]:
#create band
train_df['Band'] = pd.cut(train_df['Age'], 6)
train_df[['Band', 'Survived']].groupby(['Band'], as_index=False).mean().sort_values(by='Band', ascending=True)




In [25]:
# replace values in range of band
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 13.333, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 13.333) & (dataset['Age'] <= 26.667), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 26.667) & (dataset['Age'] <= 40), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 53.333), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 53.333) & (dataset['Age'] <= 66.667), 'Age'] = 4
    dataset.loc[ dataset['Age'] > 66.667, 'Age'] =5
train_df[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Age', ascending=True)





In [26]:
train_df = train_df.drop(['Band'], axis=1)
combine = [train_df, test_df]
train_df.head()

In [27]:
grid = sns.FacetGrid(train_df, row='Embarked')
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex')
grid.add_legend()

In [28]:

grid = sns.FacetGrid(train_df, row='Embarked', col='Survived')
grid.map(sns.barplot, 'Sex', 'Fare')
grid.add_legend()

People with 1st and 2nd class ticket had higher chance of survivability than those from 3rd class.
And also we see that Embarked point contributes to survavibility

Also lets make a new feature from existing ones. Let's add siblings / spouses aboard the Titanic    
and parents / children aboard the Titanic to check if there is any correlation between family size and Survivability

In [29]:
for dataset in combine:
    dataset['FamilyCount'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilyCount', 'Survived']].groupby(['FamilyCount'], as_index=False).mean().sort_values(by='FamilyCount', ascending=True)

    We can save it as another paameter: 
    Alone: is alone: 1, is not alone: 0

In [30]:
for dataset in combine:
    dataset['Alone'] = 0
    #Lets set a default value as 0
    dataset.loc[dataset['FamilyCount'] == 1, 'Alone'] = 1 
    # If that person had only 1 family member it means he/she was alone

train_df[['Alone', 'Survived']].groupby(['Alone'], as_index=False).mean()

Now, after constructing new feature we can drop those which we won't use

In [31]:
print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked' 'FamilyCount' 'Alone']

In [32]:
train_df.describe(include=['O'])

Now we can drop columns like: PassengerId, Name, SibSp, Parch, Ticket, Cabin, FamilCount.
It makes analysys easier and faster

1. Name feature is unique.
2. Ticket feature, there may not be a correlation between Ticket and survival.
3. Cabin feature may be dropped becouse it is highly incomplete or contains many null values.
4. PassengerId does not contribute to survival.
 



In [33]:
train_df = train_df.drop([ 'PassengerId','Name','SibSp','Parch','Ticket','Cabin','FamilyCount',], axis=1)
test_df = test_df.drop([ 'PassengerId','Name','SibSp','Parch','Ticket','Cabin','FamilyCount',], axis=1)
combine = [train_df, test_df]


In [34]:
print(train_df.columns.values)

['Survived' 'Pclass' 'Sex' 'Age' 'Fare' 'Embarked' 'Alone']

## Create model, learn and predict



The problem is a classification and regression problem. I want to identify relationship between output (Survived or not) with other variables or features. This type of machine learning called supervised learning as we are training our model with a given dataset. 

- Logistic Regression
- Random Forrest







In [35]:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.copy()

#X_train.shape, Y_train.shape, X_test.shape

## Logistic Regression 

In [36]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)




- Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.
- Inversely as Pclass increases, probability of Survived=1 decreases the most.

In [37]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

Random forests is learning method for classification and regression.

In [38]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
round(random_forest.score(X_train, Y_train) * 100, 2)


In [39]:
print(f"Regression score: {round(logreg.score(X_train, Y_train) * 100, 2)}",f"Random Forest score:{round(random_forest.score(X_train, Y_train) * 100, 2)}")

Random forest model gives better results