# Titanic: Machine Learning from Disaster


## Competition Description

[Kaggle Titanic Competition](https://www.kaggle.com/c/titanic)
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge I was asked to complete the analysis of what sorts of people were likely to survive. In particular, I applied the tools of machine learning to predict which passengers survived the tragedy.

# Predicting with Decision Trees

A decision tree automates the data slicing process for you and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, you do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.


## Cleaning and Formatting Data
Before we can begin constructing trees we need to get our hands dirty and clean the data so that we can use all the features available. In the first iteration, we saw that the Age variable had some missing value. Missingness is a whole subject with and in itself, but we will use a simple imputation technique where we substitute each missing value with the median of the all present values.

Another problem is that the Sex and Embarked variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. Embarked also has some missing values which we should impute with the most common class of embarkation, which is "S".

In [70]:
# Import the Pandas library
import pandas as pd
from pandas import Series, DataFrame

# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn  import tree


# Load the train and test datasets to create two DataFrames
train_data = "train.csv"
train = pd.read_csv(train_data)

test_data = "test.csv"
test = pd.read_csv(test_data)

# Convert the male and female groups to integer form
train["Sex"].loc[train["Sex"] == "male"] = 0
train["Sex"].loc[train["Sex"] == "female"] = 1

test["Sex"].loc[test["Sex"] == "male"] = 0
test["Sex"].loc[test["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")
train["Age"] = train["Age"].fillna(train["Age"].median())

#test["Embarked"] = test["Embarked"].fillna("S")
test["Age"] = train["Age"].fillna(test["Age"].median())
test["Fare"] = train["Fare"].fillna(test["Fare"].median())

# Convert the Embarked classes to integer form
train["Embarked"].loc[train["Embarked"] == "S"] = 0
train["Embarked"].loc[train["Embarked"] == "C"] = 1
train["Embarked"].loc[train["Embarked"] == "Q"] = 2

#test["Embarked"].loc[test["Embarked"] == "S"] = 0
#test["Embarked"].loc[test["Embarked"] == "C"] = 1
#test["Embarked"].loc[test["Embarked"] == "Q"] = 2

print(train.isnull().any())
print(test.isnull().any())

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked       False
dtype: bool
PassengerId    False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked       False
dtype: bool


## Creating A Decision Tree

We will use the scikit-learn and numpy libraries to build a decision tree. scikit-learn can be used to create tree objects from the DecisionTreeClassifier class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. We will need the following to build a decision tree

target: A one-dimensional numpy array containing the target/response from the train data. (Survival in our case)
features: A multidimensional numpy array containing the features/predictors from the train data. (ex. Sex, Age)

One way to quickly see the result of our decision tree is to see the importance of the features that are included. This is done by requesting the .feature_importances_ attribute of the tree object. Another quick metric is the mean accuracy that we can compute using the .score() function with features_one and target as arguments.

In [71]:
# Print the train data to see the available features
# print(train)

# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

[ 0.12885059  0.31274009  0.24953791  0.30887141]
0.977553310887


### Prediction

To send a submission to Kaggle we need to predict the survival rates for the observations in the test set. In the last exercise we created simple predictions based on a single subset. Luckily, with our decision tree, we can make use of some simple functions to "generate" our answer without having to manually perform subsetting.

First, we make use of the .predict() method. We provide it the model (my_tree_one), the values of features from the dataset for which predictions need to be made (test). To extract the features we will need to create a numpy array in the same way as we did when training the model. However, we need to take care of a small but important problem first. There is a missing value in the Fare feature that needs to be imputed.

Next, we need to make sure our output is in line with the submission requirements of Kaggle: a csv file with exactly 418 entries and two columns: PassengerId and Survived. Then use the code provided to make a new data frame using DataFrame(), and create a csv file using to_csv() method from Pandas.

In [75]:
# Impute the missing value with the median
#test.loc[test.Fare[152]] = 0#test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

print test_features

# Make your prediction using the test set and print them.
try:
 my_prediction = my_tree_one.predict(test_features)
except ValueError, e:
 print e

print(my_prediction)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

[[3 0 22.0 7.25]
 [3 1 38.0 71.2833]
 [2 0 26.0 7.925]
 ..., 
 [3 0 28.0 8.05]
 [3 0 34.0 32.5]
 [3 0 18.0 13.0]]
[0 1 0 1 1 0 0 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1 1
 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0
 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1
 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0
 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1
 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0
 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0
 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
 1 1 0 0 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0
 1 1 0 0 1 0 0 1 0 0 0]
      Survived
892          0
893    