In this course, you will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.


#####  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## Intro to Decision Trees
In the previous chapter, you did all the slicing and dicing yourself to find subsets that have a higher chance of surviving. A decision tree automates this process for you and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, you do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.



In [13]:
# Import the Pandas library
import pandas as pd
# Load the train and test datasets to create two DataFrames
# you can import from URL in this way:
#
#

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Import necessary packages
# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree

import warnings
pd.options.mode.chained_assignment = None  
warnings.simplefilter(action='ignore', category=FutureWarning)

test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### Cleaning and Formatting your Data
Before you can begin constructing your trees you need to get your hands dirty and clean the data so that you can use all the features available to you. In the first chapter, we saw that the Age variable had some missing value. Missingness is a whole subject with and in itself, but we will use a simple imputation technique where we substitute each missing value with the median of the all present values.

*train["Age"] = train["Age"].fillna(train["Age"].median())*

Another problem is that the Sex and Embarked variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. Embarked also has some missing values which you should impute witht the most common class of embarkation, which is "S".

In [41]:
# Convert the male and female groups to integer form

train_one = train.copy()

train_one["Sex"][train_one["Sex"] == "male"] = 0
train_one["Sex"][train_one["Sex"] == "female"] = 1

# Impute Age variable
train_one["Age"] = train_one["Age"].fillna(train_one["Age"].median())

# Impute the Embarked variable
train_one["Embarked"] = train_one["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train_one["Embarked"][train_one["Embarked"] == "S"] = 0
train_one["Embarked"][train_one["Embarked"] == "C"] = 1
train_one["Embarked"][train_one["Embarked"] == "Q"] = 2

In [3]:
# Print the Sex and Embarked columns
print(train_one["Sex"])

0      0
1      1
2      1
3      1
4      0
5      0
6      0
7      0
8      1
9      1
10     1
11     1
12     0
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     1
26     0
27     0
28     1
29     0
      ..
861    0
862    1
863    1
864    0
865    1
866    1
867    0
868    0
869    0
870    0
871    1
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    1
881    0
882    1
883    0
884    0
885    1
886    0
887    1
888    1
889    0
890    0
Name: Sex, Length: 891, dtype: object


In [5]:
print(train_one["Embarked"])

0      0
1      1
2      0
3      0
4      0
5      2
6      0
7      0
8      0
9      1
10     0
11     0
12     0
13     0
14     0
15     0
16     2
17     0
18     0
19     1
20     0
21     0
22     2
23     0
24     0
25     0
26     1
27     0
28     2
29     0
      ..
861    0
862    0
863    0
864    0
865    0
866    1
867    0
868    0
869    0
870    0
871    0
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    0
881    0
882    0
883    0
884    0
885    2
886    0
887    0
888    0
889    1
890    2
Name: Embarked, Length: 891, dtype: object


#### Creating your first decision tree
You will use the scikit-learn and numpy libraries to build your first decision tree. scikit-learn can be used to create tree objects from the DecisionTreeClassifier class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. We will need the following to build a decision tree

target: A one-dimensional numpy array containing the target/response from the train data. (Survival in your case)
features: A multidimensional numpy array containing the features/predictors from the train data. (ex. Sex, Age)
Take a look at the sample code below to see what this would look like:

###### *target = train["Survived"].values*

###### *features = train[["Sex", "Age"]].values*

###### *my_tree = tree.DecisionTreeClassifier()*

###### *my_tree = my_tree.fit(features, target)*


One way to quickly see the result of your decision tree is to see the importance of the features that are included. This is done by requesting the .feature_importances_ attribute of your tree object. Another quick metric is the mean accuracy that you can compute using the *.score()* function with features_one and target as arguments.

Ok, time for you to build your first decision tree in Python! 

In [40]:
# Print the train data to see the available features
#print(train)


# Create the target and features numpy arrays: target, features_one
target = train_one["Survived"].values
features_one = train_one[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))


[0.12482906 0.31274009 0.22797589 0.33445495]
0.9775533108866442


#### Predict and submit to Kaggle
To send a submission to Kaggle you need to predict the survival rates for the observations in the test set. In the last exercise of the previous chapter, we created simple predictions based on a single subset. Luckily, with our decision tree, we can make use of some simple functions to "generate" our answer without having to manually perform subsetting.

First, you make use of the .predict() method. You provide it the model (my_tree_one), the values of features from the dataset for which predictions need to be made (test). To extract the features we will need to create a numpy array in the same way as we did when training the model. However, we need to take care of a small but important problem first. There is a missing value in the Fare feature that needs to be imputed.

Next, you need to make sure your output is in line with the submission requirements of Kaggle: a csv file with exactly 418 entries and two columns: PassengerId and Survived. Then use the code provided to make a new data frame using DataFrame(), and create a csv file using to_csv() method from Pandas.

In [47]:
# Impute the missing value with the median
test_one = test.copy()
test_one.Fare[152] = test_one.Fare.median()

test_one["Sex"][test_one["Sex"] == "male"] = 0
test_one["Sex"][test_one["Sex"] == "female"] = 1

# Impute Age variable
test_one["Age"] = test_one["Age"].fillna(test_one["Age"].median())

In [48]:
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test_one[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])



[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0
 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]
      Survived
892          0
893          0
894          1
895          1
896          1
897          0
898          0
899          0
900          1
9

#### Overfitting and how to control it
When you created your first decision tree the default arguments for max_depth and min_samples_split were set to None. This means that no limit on the depth of your tree was set. That's a good thing right? Not so fast. We are likely overfitting. This means that while your model describes the training data extremely well, it doesn't generalize to new data, which is frankly the point of prediction. Just look at the Kaggle submission results for the simple model based on Gender and the complex decision tree. Which one does better?

Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor, the depth of our model is defined by two parameters:

* the *max_depth* parameter determines when the splitting up of the decision tree stops.
* the *min_samples_split* parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

By limiting the complexity of your decision tree you will increase its generality and thus its usefulness for prediction!

In [50]:
# Create a new array with the added features: features_two
features_two = train_one[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.score(features_two, target))

0.9057239057239057


#### Feature-engineering for our Titanic data set
Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.

While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: *family_size*.

A valid assumption is that larger families need more time to get together on a sinking ship, and hence have lower probability of surviving. Family size is determined by the variables *SibSp* and *Parch*, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, you add a new variable *family_size*, which is the sum of SibSp and Parch plus one (the observation itself), to the test and train set.

In [51]:
# Create train_two with the newly defined feature
train_two = train_one.copy()
train_two["family_size"] = train_two.SibSp + train_two.Parch+1

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", 'family_size']].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))


0.9797979797979798
