# Project Title: Predict Survival of Tragic Voyage - Titanic

## Introduction

Some 100 years ago, a massive ship sank after it collided with an iceberg on the North Atlantic Ocean. About 2200 people were aboard the ship, sadly 1500 people perished in the ocean for its maiden voyage. Royal Mail Ship Titanic was the name given to this British luxury passenger liner. The ship which was known as iconic and unsinkable was the largest liner during that era, was constructed in Belfast, Ireland.

What contributed to the loss of so many lives? Reports cited the ship was travelling too fast, the iceberg warning was dismissed, a fatal wrong turn was made and there were not enough lifeboats. Could the tragedy have been prevented? If the lifeboat drill was not cancelled, more would survive. 

**Which variables contribute to higher survival rate?** That lead us into the focus of this topic, to predict the survival of Titanic.

## Project Goal

1. To explore and analyse who were the passengers onboard Titanic.
2. To predict the passenger survival using machine learning models.
3. Evaluate the best model which is suited for this project.

**ATTENTION**
In order to mount google drive, you need add shortcut to dataset/input into your drive first. Follow this instruction:
1. Go to *'Shared with me'>'Data Analytics'>'The Titanic'>datasets* 
2. Right click and pick 'Add shortcut to Drive'
3. Then just put in in 'My Drive'

In [20]:
#import library
import pandas as pd
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
#import dataset
path = "/content/gdrive/My Drive/input"
train = pd.read_csv(path+"/train.csv")
test = pd.read_csv(path+"/test.csv") 

## Understanding Data

Show the Dataset
Structure, variables, size of dataset, etc.

First a broad overview. What are the types of data and their typical shape and content?

In [26]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Together with the PassengerId which is just a running index and the indication whether this passenger survived (1) or not (0) we have the following information for each person:

- *Pclass* is the Ticket-class: first (1), second (2), and third (3) class tickets were used. This is an ordinal integer feature. 

- *Name* is the name of the passenger. The names also contain titles and some persons might share the same surname; indicating family relations. We know that some titles can indicate a certain age group. For instance *Master* is a boy while *Mr* is a man. This feature is a character string of variable length but similar format.

- *Sex* is an indicator whether the passenger was female or male. This is a categorical text string feature. 

- *Age* is the integer age of the passenger. There are NaN values in this column.

- *SibSp* is another ordinal integer feature describing the number of siblings or spouses travelling with each passenger.

- *Parch* is another ordinal integer features that gives the number of parents or children travelling with each passenger.

- *Ticket* is a character string of variable length that gives the ticket number.

- *Fare* is a float feature showing how much each passenger paid for their rather memorable journey.

- *Cabin* gives the cabin number of each passenger. There are NaN in this column. This is another string feature.

- *Embarked* shows the port of embarkation as a categorical character value.

In summary we have 1 floating point feature (*Fare*), 1 integer variable (*Age*), 3 ordinal integer features (*Plcass, SibSp, Parch*), 2 categorical text features (*Sex, Embarked*), and 3 text string features (*Ticket, Cabin, Name*).

In [27]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The minimum/maxim values for pclass, age, sibsp, parch show us the range of these features. Also we see that there's quite a range in fares.