# Problem Set 6: Trees and Forests


## Part 1: Exploring the Titanic

Your mission for this problem set is to use your knowledge of supervised machine learning to try to predict which passengers aboard the Titanic were most likely to survive. The prompts for this part of the problem set are deliberately vague - the goal is to leave it up to you how to structure (most of) your analysis.

To get started, read about the prediction problem on [Kaggle](https://www.kaggle.com/c/titanic). Then, download the data [here](https://www.kaggle.com/c/titanic/data) - you'll at the very least need the train.csv data.

### 1.1 Exploratory data analysis

Create 2-3 figures and tables that help give you a feel for the data. Make sure to at least check the data type of each variable, to understand which variables have missing observations, and to understand the distribution of each variable (and determine whether the variables should be normalized or not). Are any of the potential predictor variables (i.e., anything except for survival) collinear or highly correlated? 

In [1]:
# enter your code here

### 1.2 Correlates of survival

Use whatever methods you can think of to try and figure out what factors seem to determine whether or not a person would survive the sinking of the Titanic. What do you conclude?

In [2]:
# enter your code here

*Enter your observations here*

## Part 2: Decision Trees
### 2.1 Decision Tree
Using the basic [Decision Tree](http://scikit-learn.org/stable/modules/tree.html#tree) library in sklearn, fit a model to predict titanic survival. Make sure you come up with an appropriate way of handling each of the input variables before feeding them in to the decision tree. You can use the [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) method to implement 10-fold cross-validation.

For this any the following problems, you should set aside some of your training data as held-out test data, prior to cross-validation. Report the average training and testing accuracy across your 10 folds, and show a diagram of the tree (at least the first three levels). Finally, select the best-performing decision tree (i.e., the one that achieved the highest cross-validated performance) and report the performance of the fitted model on the held-out test data -- how does it compare to the cross-validated accuracy?


In [3]:
from sklearn import tree
# Enter your code here

*Enter your observations here*

### 2.2 Features

Use all of the data (minus the held-out data) to re-fit a single decision tree with max_depth = 4 (i.e., no cross-validation). Show the tree diagram and also plot the feature importances. What do you observe?

In [None]:
# Enter your code here

*Enter your observations here*

### 2.3 Tree Tuning
The built-in algorithm you are using has several parameters which you can tune. Using cross-validation, show how the choice of these parameters affects performance.

First, show how max_depth affects train and test accuracy. On a single axis, plot train and test accuracy as a function of max_depth. Use a red line to show test accuracy and a blue line to show train accuracy. Do not use your held-out test data.

Second, show how test accuracy relates to both max_depth and min_samples_leaf. Specifically, create a 3-D plot where the x-axis is max_depth, the y-axis is min_samples_leaf, and the z-axis shows accuracy. What combination of max_depth and min-samples_leaf achieves the highest accuracy? How sensitive are the results to these two parameters?

In [None]:
# Enter your code here

*Enter your observations here*

### 2.4 Support Vector Machines, for comparison

Now use an [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) to predict survival, using  the default value for the constant penalization (C=1).  Report your accuracy on the test and train sets. 

Use cross-validation to determine a possibly better choice for C. Note that regularization is inversely proportional to the value of C in sklearn, i.e. the higher value you choose for C the less you regularize. 
    
    
* How does the test performance with SVM for your best choice of C compare to the decision tree performance?


In [None]:
# Enter your code here

*Enter your observations here*

### 2.5 Missing Data
Have you been paying close attention to your features? If not, now is a good time to start. Perform analysis that allows you to answer the following questions:
* Do any of your features have missing data? If so, which ones? What percent of observations have missing data?
* What happens to observations with missing data when you run the decision tree and SVM models above?
* Use one of the methods we discussed in class to impute missing values
* Rerun your decision tree and SVM on the new dataset with imputed missing values. What do you notice?

In [None]:
# Enter your code here

*Enter your observations here*

## Part 3: Forests

### 3.1: Random Forest
Use the [random forest classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to predict survival on the titanic. Use cross-validation on the training data to choose the best hyper-parameters. 
* What hyperparameters did you select with cross-validation?
* How does the cross-validated performance (average across validation folds) compare to the test performance (using the top-performing, fitted model selected through cross-validation)?
* How does the RF performance compare to the decision tree and SVM?
* Create a plot that shows how cross-validated performance (y-axis) relates to the number of trees in the forest (x-axis).

In [None]:
# Enter your code here

*Enter your observations here*

### 3.2: Gradient Boosting

Use the [Gradient Boosting classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) to predict survival on the Titanic. Tune your hyperparameters. 
* How does the GBM performance compare to the other models?
* Create a figure showing the feature importances in your final model (with properly tuned hyperparameters)


In [None]:
# Enter your code here

*Enter your observations here*

### 3.3 Feature Engineering
Revisit the features in your dataset.
* Are each of the features being appropriately included in the analysis? 
* Find a way to engineer meaningful features from the "Name" and/or "Cabin" fields in the data.
* Create a final table that summarizes the performance of your models as follows:

| Model | Cross-validated Performance   | Test Performance | 
|------|------|------|
|   Decision Tree        |  |  |
|   Decision Tree (with imputed missing values and new features)        |  |  |
|   SVM  |  |  |
|   SVM (with imputed missing values and new features)        |  |  |
|   Random Forest        |  |  |
|   Random Forest (with imputed missing values and new features)        |  |  |
|   Gradient Boosting    |  |  | 
|   Gradient Boosting (with imputed missing values and new features)        |  |   x |


In [None]:
# Enter your code here

*Enter your observations here*

## Part 4 (extra credit): Neural Networks

###  Let's get familiar with neural networks!
Now, try to predict survival on the Titanic using feed forward neural networks. This will likely be easiest with [TensorFlow](https://www.tensorflow.org/), as highlighted in the lab section.

For this problem, you are responsible for choosing the number of layers, their corresponding size, the activation functions, and the choice of gradient descent algorithm (and its parameters e.g. learning rate). Pick those parameters by hand. For some of them you can also perform cross-validation if you wish. Your goal is to tune those parameters so that your test accuracy is higher than 78%. Make sure you process your data appropriately before training your networks.

* Report your best accuracy on the test set along with your choice of parameters. More specifically, report the number of layers, their size, the activation functions and your choice of optimization algorithm. 

* Provide a plot of the test accuracy (y-axis) with respect to the number of epochs (x-axis). The number of epochs is the number of times we have iterated through our entire training set.

* It is a good exercise to experiment with different optimizers (gradient descent, stochastic gradient descent, AdaGrad etc), learning rates and batch sizes to get a feeling of how they affect neural network training. No need to report anything here. 


In [None]:
# Enter your code here
import tensorflow as tf