# Lab assignment: estimating housing values with regression trees

<img src="img/boston.jpg" style="width:640px;">

In this assignment we will learn how to use decision trees for regression problems. In particular, we will make use of the reference implementation in scikit-learn, and we will seek explainability on the prices of houses in Boston.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed. Both correctness of the solution and code quality will be taken into account for marking.</td></tr>
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.</td></tr>
 <tr><td width="80"><img src="img/pro.png" style="width:auto;height:auto"></td><td style="text-align:left">This is an advanced and voluntary excercise that can help you gain a deeper knowledge into the topic. This exercise won't be taken into account towards marking, but you are encouraged to undertake it. Good luck!</td></tr>
</table>

To avoid missing packages and compatibility issues you should run this notebook under one of the [recommended Ensembles environment files](https://github.com/albarji/teaching-environments-ensembles).

Lastly, if you need any help on the usage of a Python function you can place the writing curso over its name and press Shift+Tab to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## Preliminaries

First of all, let's fix a random seed so all results are reproducible in different runs of the notebook.

In [None]:
import numpy as np
np.random.seed(12345)

The following code will embed any plots into the notebook instead of generating a new window:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## Data preparation

For this assignment we will use the **Boston Housing** dataset. It concerns estimating the value of housing in different areas in Boston, using some geographical and societal variables as explanatory features.

### Data loading

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
 The data is contained in the file <b>boston.csv</b>. Take a look at the file format and the <b>README.md</b> description, and write the appropriate code to load all of it into a Pandas DataFrame named <b>data</b>.
 </td></tr>
</table>

<table>
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">
This data file is formatted as fixed-width lines. The pandas <a href=https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html>read_fwf</a> function can help you. Note that the data file does not include column names, but they are documents in the <b>README.md</b>; you can provide them to read_fwf through the parameter <i>colnames</i>.
 </td></tr>
</table>

In [None]:
####### INSERT YOUR CODE HERE

If you have loaded the data properly, the following cell should output a table with the first 5 rows of the data:

In [None]:
data.head()

### Extracting the target

Let's now extract the column that represents the quantity we want to predict (*MEDV*).

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
 Create DataFrames X, Y. The X DataFrame should contain only the explanatory variables, while the Y DataFrame must contain only the target variable <b>MEDV</b>.
 </td></tr>
</table>

In [None]:
####### INSERT YOUR CODE HERE

Let's check we have done this correctly:

In [None]:
X.head()

In [None]:
Y.head()

### Splitting the data

In what follows we will use a training/test splitting of the data. We will perform that splitting now.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
 Create DataFrames <b>Xtrain</b>, <b>Ytrain</b>, <b>Xtest</b> and <b>Ytest</b>, containing approximately two thirds and one third of the available data, respectively. Make sure the corresponding X and Y DataFrames are aligned, that is, that each row of an X and its respective Y correspond to the same original data pattern.
 </td></tr>
</table>

<table>
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">
 You can use scikit-learn <b>train_test_split</b> function to do this easily. Note that in a regression problem it does not make sense to try to do a stratified division, so there is no need to provide that parameter.
 </td></tr>
</table>

In [None]:
####### INSERT YOUR CODE HERE

Let's check that:

In [None]:
Xtrain.head()

In [None]:
Ytrain.head()

In [None]:
Xtest.head()

In [None]:
Ytest.head()

## How to create a regression tree

In scikit-learn, the machine learning library for Python, a Decision Tree for regression can be created through the DecisionTreeRegressor class:

In [None]:
from sklearn.tree import DecisionTreeRegressor

For now let's create a simple Decision Tree letting scikit-learn choose all the model parameters:

In [None]:
decisiontree = DecisionTreeRegressor()

The usage of the regression tree is exactly similar as the classification tree. For instance, the following code trains the tree on some toy data and plots it.

In [None]:
Xtoy = pd.DataFrame(data=[[151, 2], [219, 3], [156, 1], [912, 7]], columns=["Gallons of spilled grog", "People using R"])
ytoy = pd.DataFrame(data=[0, 1, 1, 5], columns=["Piracy crimes"])

decisiontree.fit(Xtoy, ytoy)

from sklearn.tree import plot_tree

plt.figure(figsize=(10,5))
plot_tree(decisiontree, feature_names=Xtoy.columns, filled=True)
plt.show()

Note how for a regression tree the Mean Squared Error (MSE) is used as impurity function instead of the usual Gini value for classification trees.

## Analyzing house prices

We will now make use of a Regression Tree to obtain an estimator of house prices in Boston areas. We will also learn something about the factors that influence most the prices.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
 Create a new Regression Tree and fit it on the training data. Adjust the tree depth parameter to obtain an interpretable visualization of the tree. Which variables are more relevant in predicting the price? What level of R score can you achieve in the test set? (this is the score returned by the tree <b>score</b> method
 </td></tr>
</table>

In [None]:
####### INSERT YOUR CODE HERE

### Optimizing the model

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
 Run a GridSearchCV crossvalidation to find the best tree parameters over the training data. Consider both pre-pruning and post-pruning techniques, as well as different impurity criteria. What level of R score can you achieve? (this is the score returned by the tree <b>score</b> method. Which variables are more relevant in the resultant tree?
 </td></tr>
</table>

In [None]:
####### INSERT YOUR CODE HERE