# EXTRA | Lab 12 | Decision Trees and Random Forests

In [23]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree, ensemble, metrics, cross_validation, externals

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

Today we are revisiting the `boston` dataset from session 8.

## Dataset documentation

The Boston dataset concerns itself with housing values in suburbs of Boston.  A description of the dataset is as follows:

- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sqft
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate (per ten thousands of dollars)
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes (in thousands of dollars)

In [24]:
df = pd.read_csv(os.path.join('..', 'datasets', 'boston.csv'))

In [25]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,...,TAX,PTRATIO,BLACK,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,...,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,...,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,...,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,...,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,...,222,18.7,396.9,5.33,36.2


In [26]:
df.columns

Index([u'CRIM', u'ZN', u'INDUS', u'CHAS', u'NOX', u'RM', u'AGE', u'DIS',
       u'RAD', u'TAX', u'PTRATIO', u'BLACK', u'LSTAT', u'MEDV'],
      dtype='object')

## Part A - Regression Decision Trees

> ### Question 1.  Create a feature matrix `X` with all features but `MEDV`.  Also create a response vector `y` with `MEDV`

In [27]:
df2 = df.drop('MEDV', axis=1)

In [28]:
df2.columns

Index([u'CRIM', u'ZN', u'INDUS', u'CHAS', u'NOX', u'RM', u'AGE', u'DIS',
       u'RAD', u'TAX', u'PTRATIO', u'BLACK', u'LSTAT'],
      dtype='object')

> ### Question 2.  Split the dataset into a training and testing sets (60%, 40%)

In [29]:
from sklearn.model_selection import train_test_split

In [30]:
X = df2
y = df.MEDV

In [31]:
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

> ### Question 3.  Create a decision tree (`tree.DecisionTreeRegressor`) and train the model

(Check http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html as needed)

> ### Question 4.  What are the training and testing errors?

(Check http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html as needed)

In [None]:
# use MSE

Answer:

> ### Question 5.  Retrain your model limiting the depth of the tree to 4

> ### Question 6.  What are the top 5 most important features?

Answer:

## Part B - Classification Decision Trees

> ### Question 7.  Let's categorize `MEDV` to 4 groups: Bottom 20% as Level 1, next 30% as Level 2, next 30% categorized as Level 3, and the top 20% as Level 4.  Please create a new variable `MEDV_Category` that stores the level number

In [None]:
level_2 = ((df.MEDV > df.MEDV.quantile(.2)) & (df.MEDV <= df.MEDV.quantile(.5)))
level_3 = ((df.MEDV > df.MEDV.quantile(.5)) & (df.MEDV <= df.MEDV.quantile(.8)))
level_4 = (df.MEDV > df.MEDV.quantile(.8))

df['MEDV_Category'] = '1'
df.loc[level_2, 'MEDV_Category'] = '2'
df.loc[level_3, 'MEDV_Category'] = '3'
df.loc[level_4, 'MEDV_Category'] = '4'

> ### Question 8.  Create a feature matrix `X` with all features but `MEDV`.  Also create a response vector `y` with `MEDV_Category`

> ### Question 9.  Split again your dataset into a training and testing sets (60%, 40%)

> ### Question 10.  Create a decision tree (`tree.DecisionTreeRegressor`) of depth 2 and with leaf nodes having a minimum of 5 observations.  Then train the model model

> ### Question 11.  What is your 10-fold cross-validation error on your training set?

> ### Question 12.  What are the features used by the model?

Answer:

## Part C - Random Forest with Categorical Variables

> ### Question 13.  Create a Random Forest Classifier with `n_estimators = 1000, max_features = 4, min_samples_leaf = 5, oob_score = True`

(Check http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html as needed)

> ### Question 14.  What is your 10-fold cross-validation error on your training set?

Answer: