Getting Started With Random Forests
------------------------------
------------------------------

The last step in this series of tutorials will be a quick guide to multivariate models that truly learn from your data, this will be a taste of how simple it can be to apply a sophisticated algorithm.

Quick Recap:
-----------------

The previous tutorials allowed us to open the data in both Excel and in the scripting language Python. We created a simple model in both and wrote a csv to use as our submission. We are going to build on these skills to create what is known as a random forest (or, an ensemble of decision trees).

The beauty of Python
------------------------

This is where the effort you put into getting Python running pays off! As was mentioned in the previously, Python has handy, built-in packages to help you manipulate arrays, compute complex math functions, read in complex file formats, and in this case, create random forests. You will need to install the ```scikit-learn``` package, which has the handy RandomForestClassifier class. (If you originally installed the Anaconda distribution of python, then ```scikit-learn``` is already installed too.)

What is a Random Forest?
--------------------------------------------

As with all the important questions in life, this is best deferred to the Wikipedia page. A random forest is an ensemble of decision trees which will output a prediction value, in this case survival. Each decision tree is constructed by using a random subset of the training data. After you have trained your forest, you can then pass each test row through it, in order to output a prediction. Simple! Well not quite! This particular python function requires floats for the input variables, so all strings need to be converted, and any missing data needs to be filled.

How do I clean and fill?
--------------------------------

In the previous Pandas tutorial you gained the skills to clean and fill in data using the pandas package. Now you can put that into practice.

Not all types of data can be converted into floats. For example, Names would be very difficult. In these cases let's decide to neglect these columns. Although they are strings, the categorical variables like male and female can be converted to 1 and 0, and the port of embarkment, which has three categories, can be converted to a 0, 1 or 2 (Cherbourg, Southamption and Queenstown). This may seem like a non-sensical way of classifying, since Queenstown is not twice the value of Southampton-- but random forests are somewhat robust when the number of different attributes are not too numerous.

Converting from categorical strings to floats is intuitive. However, filling in data can be more tricky. Some data cannot be trivially filled (such as Cabin) without complete knowledge of every cabin and ticket price for the entire ship.  Nonetheless, Fare price can be estimated if you know the class, or the age of a passenger can be estimated using the median age of the people on board. Fortunately for us, the amount of missing data here is not too large, so the method for which you choose to fill the data shouldn’t have too much of an effect on your predictive result.

My data is complete and floating nicely, I want to predict
----------------------------------------

Using the predictive capabilities of the ```scikit-learn``` package is very simple. In fact, it can be carried out in three simple steps: initializing the model, fitting it to the training data, and predicting new values.

Note that almost all of the model techniques in scikit-learn share a few common named functions, once they are initialized. You can always find out more about them in the documentation for each model. These are 

```some-model-name.fit( )```

```some-model-name.predict( )```

```some-model-name.score( )```

At this point, it is assumed that you have read in the training data into an array train_data, where the first column [0] is the Survived column, that the Name, Cabin, and Ticket columns have been removed, and also that the Gender and Embarked columns have been converted to numbers. (If you just completed the previous pandas tutorial this means you also need to drop the PassengerId column.)

Preparing Training Data
--------------
----------------

In [296]:
import csv as csv
import numpy as np

csv_file_object = csv.reader(open('train.csv', 'rb')) 
header = csv_file_object.next() 
data=[] 

for row in csv_file_object:
    data.append(row)
data = np.array(data) 

In [297]:
import pandas as pd
import numpy as np
df = pd.read_csv('train.csv', header=0)

In [298]:
type(data), type(df)

(numpy.ndarray, pandas.core.frame.DataFrame)

In [299]:
df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

In [300]:
df['EmbNo'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}) #(Cherbourg, Southamption and Queenstown)

In [301]:
median_ages = np.zeros((2,3))

for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = df[(df['Gender'] == i) & \
                              (df['Pclass'] == j+1)]['Age'].dropna().median()
 
median_ages


array([[ 35. ,  28. ,  21.5],
       [ 40. ,  30. ,  25. ]])

In [302]:
df['AgeFill'] = df['Age'] #This is where we will enter our best guess on age.

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Gender,EmbNo,AgeFill
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0.0,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1.0,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0.0,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,0.0,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,0.0,35.0


In [303]:
df[ df['Age'].isnull() ][['Gender','Pclass','Age','AgeFill']].head(10)

Unnamed: 0,Gender,Pclass,Age,AgeFill
5,1,3,,
17,1,2,,
19,0,3,,
26,1,3,,
28,0,3,,
29,1,3,,
31,0,1,,
32,0,3,,
36,1,3,,
42,1,3,,


In [304]:
for i in range(0, 2):
    for j in range(0, 3):
        df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),'AgeFill'] = median_ages[i,j]

In [305]:
df[ df['Age'].isnull() ][['Gender','Pclass','Age','AgeFill']].head(10)

Unnamed: 0,Gender,Pclass,Age,AgeFill
5,1,3,,25.0
17,1,2,,30.0
19,0,3,,21.5
26,1,3,,25.0
28,0,3,,21.5
29,1,3,,25.0
31,0,1,,35.0
32,0,3,,21.5
36,1,3,,25.0
42,1,3,,25.0


In [306]:
df['AgeIsNull'] = pd.isnull(df.Age).astype(int)

In [307]:
df['FamilySize'] = df['SibSp'] + df['Parch']
df['Age*Class'] = df.AgeFill * df.Pclass
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Gender           int32
EmbNo          float64
AgeFill        float64
AgeIsNull        int32
FamilySize       int64
Age*Class      float64
dtype: object

In [308]:
df = df.drop(['Age','PassengerId','Name','Cabin','Ticket','Sex','Embarked','Pclass','SibSp','Parch','AgeIsNull'], axis=1)

In [309]:
df.dtypes

Survived        int64
Fare          float64
Gender          int32
EmbNo         float64
AgeFill       float64
FamilySize      int64
Age*Class     float64
dtype: object

Somewhat heavy handed I now drop any rows which still have missing values - ```.dropna()``` removes an observation from ```df``` even if it only has 1 ```NaN```, anywhere, in any of its columns. It could delete most of your dataset if you aren't careful with the state of missing values in other columns!

In [310]:
df = df.dropna()

In [311]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 7 columns):
Survived      889 non-null int64
Fare          889 non-null float64
Gender        889 non-null int32
EmbNo         889 non-null float64
AgeFill       889 non-null float64
FamilySize    889 non-null int64
Age*Class     889 non-null float64
dtypes: float64(4), int32(1), int64(2)
memory usage: 52.1 KB


889 non-null enries, so all filled, and ready to be analysed.

In [312]:
train_data = df.values

Preparing Test Data
-------------
------------

I now need to prepare the test data before applying the forest.

In [313]:
test_df = pd.read_csv('test.csv', header=0)        # Load the test file into a dataframe

In [314]:
# I need to convert all strings to integer classifiers:
# female = 0, Male = 1
test_df['Gender'] = test_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

In [315]:
# Embarked from 'C', 'Q', 'S'
# All missing Embarked -> just make them embark from most common place
if len(test_df.Embarked[ test_df.Embarked.isnull() ]) > 0:
    test_df.Embarked[ test_df.Embarked.isnull() ] = test_df.Embarked.dropna().mode().values

In [316]:
test_df['EmbNo'] = test_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

In [317]:
# All the ages with no data -> make the median of all Ages
median_age = test_df['Age'].dropna().median()
if len(test_df.Age[ test_df.Age.isnull() ]) > 0:
    test_df.loc[ (test_df.Age.isnull()), 'Age'] = median_age

In [318]:
# All the missing Fares -> assume median of their respective class
if len(test_df.Fare[ test_df.Fare.isnull() ]) > 0:
    median_fare = np.zeros(3)
    for f in range(0,3):                                              # loop 0 to 2
        median_fare[f] = test_df[ test_df.Pclass == f+1 ]['Fare'].dropna().median()
    for f in range(0,3):                                              # loop 0 to 2
        test_df.loc[ (test_df.Fare.isnull()) & (test_df.Pclass == f+1 ), 'Fare'] = median_fare[f]

In [319]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Gender,EmbNo
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,2
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1,2
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0,0


I want to be left with Fare, Gender, EmbNo, AgeFill, FamilySize, and Age*Class.  

In [320]:
test_df['AgeFill']= test_df['Age']

In [321]:
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch']
test_df['Age*Class'] = test_df.AgeFill * test_df.Pclass

In [322]:
test_df.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Gender           int32
EmbNo            int64
AgeFill        float64
FamilySize       int64
Age*Class      float64
dtype: object

In [323]:
# Collect the test data's PassengerIds before dropping it so we can match up predictions to passengers.
ids = test_df['PassengerId'].values

In [324]:
test_df = test_df.drop(['PassengerId','Pclass','Name','Sex'], axis=1)

In [325]:
test_df = test_df.drop(['SibSp','Parch','Ticket','Cabin','Embarked'], axis=1)

In [326]:
test_df = test_df.drop(['Age'], axis=1)

In [327]:
test_df.dtypes

Fare          float64
Gender          int32
EmbNo           int64
AgeFill       float64
FamilySize      int64
Age*Class     float64
dtype: object

In [328]:
df.dtypes

Survived        int64
Fare          float64
Gender          int32
EmbNo         float64
AgeFill       float64
FamilySize      int64
Age*Class     float64
dtype: object

In [329]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
Fare          418 non-null float64
Gender        418 non-null int32
EmbNo         418 non-null int64
AgeFill       418 non-null float64
FamilySize    418 non-null int64
Age*Class     418 non-null float64
dtypes: float64(3), int32(1), int64(2)
memory usage: 18.0 KB


No gaps in the data so we are now ready to take back into arrays and apply forest analysis.

The final ```output``` is an array with a length equal to the number of passengers in the test set and a prediction of whether they survived. You can change the parameters as you see fit, as described on the ```RandomForestClassifier``` [documentation page](http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [330]:
# Import the random forest package
from sklearn.ensemble import RandomForestClassifier

In [331]:
# Convert back to a numpy array
train_data = df.values
test_data = test_df.values

In [333]:
print 'Training...'
# Create the random forest object which will include all the parameters for the fit
forest = RandomForestClassifier(n_estimators=100)
# Fit the training data to the Survived labels and create the decision trees
forest = forest.fit( train_data[0::,1::], train_data[0::,0] )

Training...


In [336]:
print 'Predicting...'
output = forest.predict(test_data).astype(int)

Predicting...


In [339]:
predictions_file = open("FamSizeAgeClassForest.csv", "wb")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId","Survived"])
open_file_object.writerows(zip(ids, output))
predictions_file.close()
print 'Done.'

Done.


"Your submission scored 0.70335, which is not an improvement of your best score. Keep trying!"

This seems strange, as my initial thoughts are "This is more complicated, therefore should be better!" So, a simple model is not always a bad model. Sometimes, concise, simple views of data reveal their true patterns and nature.

Because the data set is very small, the differences in scores can be just one or two flips in decisions between survived or not survived. This means it will be very hard to determine the quality of the model from this data set. The aim of working through this tutorial was to show you an easy way into more difficult problems.