In [8]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, export_graphviz, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

In [9]:
data = load_boston()
print(data['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

### prepare the X (features) and y ('target' or 'response' variable) dataframes


In [10]:
X = pd.DataFrame(data['data'], columns = data['feature_names'])
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [11]:
y = pd.DataFrame(data['target'], columns=['MEDV'])
y.head()

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


### We are trying to predict house price using many different housing attributes (e.g. Rooms; Crime rate; etc)
#### Question 1: What type of Machine Learning Problem is this?

#### Question 2: What other regressors besides trees could we use for this problem?

### Split data in to training and test set, with half of data in the test set

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=2)

#### Question 3: How many houses are in the training and test sets? And how many features are there? Print them to the screen.

#### Question 4: Using DecisionTreeRegressor, Fit a Decision Tree to the training data; with maximum tree depth set to 3. Make sure to set a random seed.

#### Question 5: Visualize the Tree that has been fit to the training data using the plot_tree function

#### Question 6: Use the trained model to predict prices for the houses in the test set. Plot y-predicted (x-axis) vs y-actual on a scatter plot. Add a line to represent where perfect predictions would be.

#### Question 7: Use Mean Squared Error (MSE) to calculate the test error

# Ensemble Methods: Bagging and Random Forests

### Use the RandomForestRegressor Function to apply Bagging. Remember the only difference between Random Forests and Bagging is the number of 'max_features' to consider at each split. Bagging considers all features. 

In [13]:
# Bagging: using all features
regr1 = RandomForestRegressor(max_features=13, random_state=0)
regr1.fit(X_train, y_train.values.ravel())

RandomForestRegressor(max_features=13, random_state=0)

#### Question 8: Like before, plot y-predicted (x-axis) vs y-actual on a scatter plot. Add a line to represent where perfect predictions would be.

#### Question 9: Use Mean Squared Error (MSE) to calculate the test error

### Use the RandomForestRegressor Function to apply Random Forest. Random Forest considers a subset of the features.

In [14]:
# Random forests: using 6 randoml selected features at each split
regr2 = RandomForestRegressor(max_features=6, random_state=0)
regr2.fit(X_train, y_train.values.ravel())

RandomForestRegressor(max_features=6, random_state=0)

#### Question 10: Use Mean Squared Error (MSE) to calculate the test error

#### Question 11: Plot a feature importance figure to understand which of the 13 features can best explain the response variable. Do the most important features make intuitive sense to you?

#### Bonus Question #1: use GridSearchCV to tune two hyper-parameters to improve the model further

#### Bonus Question #2: if this MSE is not lower than the previous ones, what does this suggest about the previous models?