# Titanic Machine Learning Project

This project is using machine learning with python to create a model that predicts which passengers will survive the Titanic shipwreck. The model will use data from the Titanic - Machine Learning from Disaster competition on Kaggle available at: https://www.kaggle.com/c/titanic/overview.



## Explore the Data 

The project will begin by looking at the first few lines of the training data set and then the testing data set. 

In [19]:
#load packages
import pandas as pd

#load the training data
train_set = pd.read_csv("titanic_train.csv")

#view the first few lines
train_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can see that there is 11 columns within the data with the columns of PassengerId representing passenger identification numbers. "Survived" has 0 representing no survival or 1 representing survival. "Pclass" represents class of travel which has 1 = 1st, 2 = 2nd, 3 = 3rd. Sex and age is represented by "Sex" and "Age." Whilst "SibSp" is the number of siblings or spouses aboard and "Parch" is the number of parents or children aboard. We have "Ticket", "Fare" and "Cabin" showing the ticket number, the price of the passengers fare, and the cabin number respectively. With finaly "Embarked" describing the port of embarkation.

In [20]:
#load the test data
test_set = pd.read_csv("titanic_test.csv")

#view the first few lines
test_set.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


We can see that the test set has exactly the same columns and so will be perfect for testing.

## Percentage of Female Survivors


The history of the disaster suggest that gender is a good indicator of survival. 

This is a good starting point for the training set and below we will calculate the percentage of females that survived the disaster.


In [21]:
#filter for female and survived from the training dataset
female_survive = train_set.loc[train_set.Sex == 'female']["Survived"]

#calculate a percentage of those females that survived
female_percent = sum(female_survive)/len(female_survive)*100

#print the results
print(female_percent,"% females survivors")

74.20382165605095 % females survivors


We can can see above that over 74% of female passengers survived.

We can use a similar method on the training dataset to understand the percentage of male passengers that survived below.


# Percentage of Male Survivors

In [22]:
#filter for male and survived from the training dataset
male_survive = train_set.loc[train_set.Sex == 'male']["Survived"]

#calculate a percentage of those males that survived
male_percent = sum(male_survive)/len(male_survive)*100

#print the results
print(male_percent,"% male survivors")

18.890814558058924 % male survivors


From this you can see that over 75% of the female passengers survived, where only around 19% of the male passengers lived from the training dataset.

We also want to consider other factors that would contribute to a passengers survival not just gender. 

We will contruct a random forest model containing "trees" that will individually consider each passenger's data and vote on whether the individual survived. 

We will focus on the four different columns class(Pclass), gender(Sex), the number of siblings or spouses aboard(SibSp), and the number of parents or children aboard the Titanic(Parch) of the data. Using the patterns in the train data set to apply predictions to the test data set. 


## Create a Prediction

In [23]:
#import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

#selecting the Survived column of the training set
y = train_set["Survived"]

#selecting the relevant factors for the model 
factors = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_set[factors])
X_test = pd.get_dummies(test_set[factors])

#create the random forest model and fit to the training data
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

#print the results
results = pd.DataFrame({'PassengerId': test_set.PassengerId, 'Survived': predictions})
print(results)


     PassengerId  Survived
0            892         0
1            893         1
2            894         0
3            895         0
4            896         1
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         0

[418 rows x 2 columns]


We can see above that the model has gone through each passenger id and predicting if they would survive based on their class, gender, the number of siblings or spouses aboard, and the number of parents or children aboard the Titanic. Listing their survival with "1" representing survival or "0" representing death.

Resulting in a effective machine learning predictor of who would survive the Titanic Disaster.