# Model Iteration 2

In this iteration of predicting survival rates for the kaggle competition, [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic), I will try out a few more machine learning models. While doing this, I hope to learn why one algorithm is more effective than other for this problem. I'll also try to explain the tradeoffs of picking one model over another.

Specifically, I will be exploring the following three models:
1. Random Forests
2. Boosting
3. Support Vector Machines

### Read and Clean the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

predictors = ['Sex', 'Age', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Embarked']

def clean_data(data_frame, predictors):
    # convert Sex to binary, (male = 0, female = 1, )
    data_frame.loc[data_frame.Sex == 'male', 'Sex'] = 0
    data_frame.loc[data_frame.Sex == 'female', 'Sex'] = 1
    
    # convert Embarked to S = 0, C = 1, Q = 2    
    data_frame.loc[data_frame.Embarked == 'S', 'Embarked'] = 0
    data_frame.loc[data_frame.Embarked == 'C', 'Embarked'] = 1
    data_frame.loc[data_frame.Embarked == 'Q', 'Embarked'] = 2

    # fill all NaN's with the median value for that feature
    for predictor in predictors:
        data_frame[predictor] = data_frame[predictor].fillna(data_frame[predictor].median())
        
clean_data(train, predictors)
clean_data(test, predictors)

## Random Forests

### What's a Random Forest?

The **Random Forest** model is a learning method that can be used for classification as well as regression. In our case we'll use the classification functionality because we want to predict the survival of passengers aboard the titanic. 

So how does the **Random Forest** model actually classify our data? This model relies on another machine learning model called **Decision Trees** to perform classifications and regressions. Since we're using a bunch of decision *trees*, we end up with a random *forest*. Because **Random Forests** uses another model, we call **Random Forests** an *ensemble* learner. It works by using an ensemble, or collection, of other models.

#### Decision Trees

Let's talk about **Decision Trees**. As the underlying model for **Random Forests**, **Decision Trees** must also be able to perform classifications and regressions. Here's what a decision tree might look like: 

![Titanic Decision Tree Image](img/decision_tree.svg)

At the top of the diagram, we begin with all the passengers. If a passenger's sex is female(Sex=1) then they are sorted towards the right side of the tree, and males(Sex=0) are sorted toward the left side. Males are then sorted based on their age. Men younger than 6.5 filter into the next left node and men older than 6.5 filter into the lower node. We continue sorting the passengers based on similar questions. Once passenger's make their way to the lowest tier of nodes, their class is decided.

**Decision Trees** work like we just saw above. They ask a series of question and sort the passenger's into categories based on the answer to those questions.

#### Random Forest is an Ensemble of Decision Trees

A random forest is created by generating a number of decision trees. Each of these decision trees is created using a random subset of the input data. That's where the *random* in random forests comes from. The decision trees then predict a class for a given passenger. Some may predict *Survived* and some might predict *Died*, but what's important is the majority vote. If two trees think Jack survived and three predict Jack died, then we go with the conclusion that Jack sadly died.

### When to Use Random Forests

* Classification or Regression problems
* Noisy inputs
* Many features of unknown importance

### Random Forests in Action

With scikit-learn, using a random forest learner is simple. There are some optoional parameters that can affect the performance of the classifier though. The ones I've found to be more important are **max_depth**, the maximum depth of any of the decision trees used, and **n_estimators**, the number of decision trees to use. It is probably worthwhile to adjust these parameters to get the best score you can.

In [2]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=4)
scores = cross_val_score(clf, train[predictors], train.Survived, cv=3)
print "Cross Validation Score: %f" % scores.mean()

Cross Validation Score: 0.804714


![Kaggle Score](img/kaggle_random_forest.png)

Kaggle Score: **0.78947**

### Resources

* https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests
* http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
* https://en.wikipedia.org/wiki/Random_forest
* https://en.wikipedia.org/wiki/Decision_tree_learning
* http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
* http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz

## Boosting

### What's boosting?

Like Random Forests, **Boosting** models are ensemble learners. This means that models like AdaBoost rely on other underlying learners. In the case of Boosting, many weak learners are used together to achieve a strong learner. 

Boosting works much like a bunch of struggling students in a math class. Let's say each student can score 60% on their test independently. If each student is better at answering different questions on the test, then maybe as a group all of the students would be able to bring their separate pockets of knowledge together and answer all of the questions correctly. Boosting works the same way. We take a bunch of learners that have different areas of expertise and get them to work together and answer the questions each learner knows best.

### When to use Boosting

* Well labeled data
* Low noise
* Few outliers present
* As a first pass model
* High dimensional data

### Boosting in Action

In [3]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier()
scores = cross_val_score(clf, train[predictors], train.Survived, cv=3)
print "Cross Validation Score: %f" % scores.mean()

Cross Validation Score: 0.797980


![Kaggle Score](img/kaggle_adaboost.png)

Kaggle Score: **0.75120**

### Resources

* https://en.wikipedia.org/wiki/Boosting_%28machine_learning%29
* http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf
* http://math.mit.edu/~rothvoss/18.304.3PM/Presentations/1-Eric-Boosting304FinalRpdf.pdf
* http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

## Support Vector Machines

### What are Support Vector Machines

SVM is learning model that linearly separates data into classes. The following image shows three possible lines the SVM could draw to separate the black dots from the white dots. H1 doesn't actually separate the data, this would not be chosen. H2 does separate the data into the correct classes, but it leaves little margin. Finally, H3 both separates the data and leaves ample margin. H3 would be the most effective separator. ![SVM](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Svm_separating_hyperplanes_%28SVG%29.svg/330px-Svm_separating_hyperplanes_%28SVG%29.svg.png) By User:ZackWeinberg, based on PNG version by User:Cyc [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

### When to use SVM

* High dimensional data
* Slightly more dimensions than data points
* Probability of classifications are not needed

### SVM in Action

In [4]:
from sklearn.svm import SVC

clf = SVC()
scores = cross_val_score(clf, train[predictors], train.Survived, cv=3)
print "Cross Validation Score: %f" % scores.mean()

Cross Validation Score: 0.691358


![Kaggle Score](img/kaggle_svc.png)

Kaggle Score: **0.62679**

### Resources

* http://www.cs.princeton.edu/~schapire/talks/picasso-minicourse.pdf
* https://en.wikipedia.org/wiki/Support_vector_machine
* http://scikit-learn.org/stable/modules/svm.html

## Ensemble

♪ all together now... ♫ 

In [18]:
forest = RandomForestClassifier(max_depth=4)
boost = AdaBoostClassifier()
svc = SVC()

forest.fit(train[predictors], train.Survived)
boost.fit(train[predictors], train.Survived)
svc.fit(train[predictors], train.Survived)

y_forest = forest.predict(test[predictors])
y_boost = boost.predict(test[predictors])
y_svc = svc.predict(test[predictors])

df = pd.DataFrame({
    'forest': y_forest,
    'boost': y_boost,
    'svc': y_svc
})

df['mode'] = df.mode(axis=1)

results = pd.DataFrame({
    'PassengerId': test.PassengerId,
    'Survived': df['mode']
})
results.to_csv('ensemble.csv', index=False)

![Kaggle Score](img/kaggle_ensemble.png)

Kaggle Score: 0.74641