# Predicting Titanic Survivors

## Data Source:
'titanic_data.csv'

## Description:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The point of this analysis is to predict if an individual will survive based on the features in the data like: Traveling Class, Sex, Age and Fare Price.

## Data Dictionary:

|  Field Num |  Field Name |  Description | 
| ---------- |:-----------:| ------------:|
| 1          | PassengerId | Numberic    |
| 2          | Survived | Survival (Numberic-  (0 = No; 1 = Yes))     |
| 3          | Pclass | Passenger Class (Numberic - (1 = 1st; 2 = 2nd; 3 = 3rd))     |
| 4         | Name | Last Name, First Name     |
| 5         | Sex | Sex     |
| 6         | Age | (Numeric)    |
| 7         | SibSp | Number of Siblings/Spouses Aboard     |
| 8         | Parch | Number of Parents/Children Aboard     |
| 9         | Ticket | Ticket Number |
| 10         | Fare | Passenger Fare|
| 11         | Cabin | Cabin     |
| 12        | Embarked| Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)  |
                   



## Data Handling

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing

df = pd.read_csv('/home/ppham/workspace/titanic_data.csv')

In [13]:
df.shape

(891, 12)

Database has 891 rows x 12 columns.

In [16]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


There are some columns such as 'Age', 'Cabin','Embarked' have missing values. Additionally,'Tickets' and 'Cabin' have so many missing values,so they wont add much value to the result.

### Explore the Data:

In [31]:
df["Survived"].value_counts(normalize = True)

0    0.595506
1    0.404494
Name: Survived, dtype: float64

40% Passengers that survived vs 60% passengers that passed away



##### By Gender:
Males that survived vs males that passed away

In [32]:
df["Survived"][df["Sex"] == 'male'].value_counts(normalize = True)


0    0.794702
1    0.205298
Name: Survived, dtype: float64

Females that survived vs Females that passed away

In [34]:
df["Survived"][df["Sex"] == 'female'].value_counts(normalize = True)

1    0.752896
0    0.247104
Name: Survived, dtype: float64

##### By Gender and class:


In [35]:
df.Survived[df.Sex == 'female'][df.Pclass != 3].value_counts(normalize = True)


1    0.942675
0    0.057325
Name: Survived, dtype: float64

In [36]:
df.Survived[df.Sex == 'female'][df.Pclass == 3].value_counts(normalize = True)


0    0.539216
1    0.460784
Name: Survived, dtype: float64

In [39]:
df.Survived[df.Sex == 'male'][df.Pclass == 3].value_counts(normalize = True)

0    0.849802
1    0.150198
Name: Survived, dtype: float64

In [38]:
df.Survived[df.Sex == 'male'][df.Pclass != 3].value_counts(normalize = True)

0    0.725
1    0.275
Name: Survived, dtype: float64

##### By Age:
Assume age of a passenger less than 18 is 1 and greater than 18 is 0.

In [43]:
# Create the column Child and assign to 'NaN'
df["Child"] = float('NaN')

In [44]:
df["Child"][df["Age"] < 18] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [45]:
df["Child"][df["Age"] >= 18] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [47]:
df["Survived"][df["Child"] == 1].value_counts(normalize = True)

1    0.539823
0    0.460177
Name: Survived, dtype: float64

In [48]:
df["Survived"][df["Child"] == 0].value_counts(normalize = True)

0    0.621035
1    0.378965
Name: Survived, dtype: float64

A passenger with age less than 18 has more survival chances compared to older passengers. Based on the breakdown, 'Gender, Class and Age are one of the important features to be considered. 

##### By Embarked:

In [4]:
#Convert values of 'Embakred' to number.
labelEncoder = preprocessing.LabelEncoder()
df['Embarked'] = labelEncoder.fit_transform(df['Embarked'])
df["Survived"][df["Embarked"] == 1].value_counts(normalize = True)

  flag = np.concatenate(([True], aux[1:] != aux[:-1]))


1    0.553571
0    0.446429
Name: Survived, dtype: float64

In [22]:
df["Survived"][df["Embarked"] == 2].value_counts(normalize = True)

0    0.61039
1    0.38961
Name: Survived, dtype: float64

In [23]:
df["Survived"][df["Embarked"] == 3].value_counts(normalize = True)

0    0.663043
1    0.336957
Name: Survived, dtype: float64

###### Age,Sex, Pclass and 'Embarked' are important features. So those will be used for training my first model.

##### Final Data Preparation:
'Ticket' an 'Fare' have too many missing values. they wont addmuch to the training. So I will drop those two column first. 

In [30]:
df = df.drop(['Ticket','Cabin'],axis = 1)

ValueError: labels ['Ticket' 'Cabin'] not contained in axis

In [5]:
#convert 'Male' and 'Female' to 0 or 1 of 'Sex' feature
df['Sex'] = labelEncoder.fit_transform(df['Sex'])

In [37]:
# fill the null value of age by the mean:
labelEncoder = preprocessing.Imputer(missing_values='NaN',strategy='mean',axis=0)
df['Age'] = labelEncoder.fit_transform(np.array(df[['Age']]))

In [38]:
# fill the null value of 'Embarked' by most frequent value:
labelEncoder = preprocessing.Imputer(missing_values='NaN',strategy='most_frequent',axis=0)
df['Embarked'] = labelEncoder.fit_transform(np.array(df[['Embarked']]))

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null float64
dtypes: float64(3), int64(6), object(1)
memory usage: 69.7+ KB
