<header>
    <h1>Understanding https://www.kaggle.com/startupsci/titanic-data-science-solutions</h1>
</header>

<b>PROBLEM</b>:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg,
killing 1502 out of 2224 passengers and crew.

This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers
and crew. Although there was some element of luck involved in surviving the sinking,
some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive.
In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

<b>GOAL:</b>

It is your job to predict if a passenger survived the sinking of the Titanic or not. 
For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

<b>DATA:</b>

https://www.kaggle.com/c/titanic/data

In [8]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [9]:
train_df = pd.read_csv('C:/Users/mia.renauly/Documents/02. Belajar/03 Project at GRAB/Kaggle Titanic/train.csv')
test_df = pd.read_csv('C:/Users/mia.renauly/Documents/02. Belajar/03 Project at GRAB/Kaggle Titanic/test.csv')
combine = [train_df, test_df]

<header>
    <h3>DATA EXPLORATION</h3>
</header>

In [14]:
print(train_df.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


<b>FEATURE/COLUMN IDENTIFICATION:</b>

<b>Categorical</b>
1. Nominal: Survived, Sex, and Embarked
2. Ordinal: Pclass

<b>Numerical</b>
1. Continous: Age, Fare
2. Discrete: SibSp, Parch

<b>Mixed Data Types</b>
1. Alphanumeric: Ticket and Cabin

In [16]:
train_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [19]:
train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

In [74]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Parch_dis,SibSp_dis
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.239057,0.317621
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.426747,0.465813
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,0.0,0.0
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,0.0,0.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,0.0,1.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0


In [80]:
# Data Distribution
train_df.quantile([.1, .2, .3, .4, .5, .6, .65, .7, .75, .8, .85, .9, .95, .98, .99, 1])

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Parch_dis,SibSp_dis
0.1,90.0,0.0,1.0,14.0,0.0,0.0,7.55,0.0,0.0
0.2,179.0,0.0,1.0,19.0,0.0,0.0,7.8542,0.0,0.0
0.3,268.0,0.0,2.0,22.0,0.0,0.0,8.05,0.0,0.0
0.4,357.0,0.0,2.0,25.0,0.0,0.0,10.5,0.0,0.0
0.5,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,0.0,0.0
0.6,535.0,0.0,3.0,31.8,0.0,0.0,21.6792,0.0,0.0
0.65,579.5,1.0,3.0,34.0,0.0,0.0,26.0,0.0,0.0
0.7,624.0,1.0,3.0,36.0,1.0,0.0,27.0,0.0,1.0
0.75,668.5,1.0,3.0,38.0,1.0,0.0,31.0,0.0,1.0
0.8,713.0,1.0,3.0,41.0,1.0,1.0,39.6875,1.0,1.0


In [86]:
#Explore Categorical Column (to see whether we can make other derivative column from it)
train_df[['Name', 'Sex', 'Ticket', 'Cabin','Embarked']].describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Aks, Mrs. Sam (Leah Rosen)",male,CA. 2343,B96 B98,S
freq,1,577,7,4,644


In [35]:
#How many people survived from the accident (0=yes, 1=no)
survived = train_df.groupby(['Survived']).agg({'PassengerId': 'count'})
print (survived)
survived.groupby(level=0).apply(lambda x:(x/train_df['Survived'].count())*100)

          PassengerId
Survived             
0                 549
1                 342


Unnamed: 0_level_0,PassengerId
Survived,Unnamed: 1_level_1
0,61.616162
1,38.383838


In [73]:
#How many people traveling with siblings or parents
def func_par(row):
    if row['Parch'] == 0:
        return 0
    else:
        return 1

def func_sib(row):
    if row['SibSp'] == 0:
        return 0
    else:
        return 1

train_df['Parch_dis'] = train_df.apply(func_par, axis=1)
train_df['SibSp_dis'] = train_df.apply(func_sib, axis=1)
                                       
                        ##############

family = train_df.groupby(['Parch_dis', 'SibSp_dis']).agg({'PassengerId': 'count'})
print(family)

a = family.groupby(level=0).apply(lambda x:(x/x.sum())*100)
print(a)
print('_'*40)

family_survived = train_df.groupby(['Survived','Parch_dis', 'SibSp_dis']).agg({'PassengerId': 'count'})
print(family_survived)

b = family_survived.groupby(level=0).apply(lambda x:(x/x.sum())*100)
print(b)

                     PassengerId
Parch_dis SibSp_dis             
0         0                  537
          1                  141
1         0                   71
          1                  142
                     PassengerId
Parch_dis SibSp_dis             
0         0            79.203540
          1            20.796460
1         0            33.333333
          1            66.666667
________________________________________
                              PassengerId
Survived Parch_dis SibSp_dis             
0        0         0                  374
                   1                   71
         1         0                   24
                   1                   80
1        0         0                  163
                   1                   70
         1         0                   47
                   1                   62
                              PassengerId
Survived Parch_dis SibSp_dis             
0        0         0            68.123862
                 

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Aks, Mrs. Sam (Leah Rosen)",male,CA. 2343,B96 B98,S
freq,1,577,7,4,644
