# KIOS Graduate Summer School 2018 on Intelligent Systems and Control

# Regression exercise

Welcome to this practical session!

In this exercise you will be asked to predict the energy efficiency of buildings given various building parameters, such as, the surface area and height. You will become familiar with **pandas** (a powerful tool for data analysis) and **scikit-learn** (a popular machine learning library).

Let's get started!

## Problem description

You will be asked to predict the heating load and cooling load of buildings given various building parameters. Specifically, the following information will be given for each building.

    X1 Relative Compactness
    X2 Surface Area
    X3 Wall Area
    X4 Roof Area
    X5 Overall Height
    X6 Orientation
    X7 Glazing Area
    X8 Glazing Area Distribution

    y1 Heating Load
    y2 Cooling Load

## Outline

![outline](images/outline.png)

## 1. Import libraries

Run the following to import all the necessary libraries we will be using.

In [None]:
# Hide warnings
import warnings
warnings.filterwarnings("ignore")

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# For visualisations
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pydot

Fix the random seed. Do not change the seed in order to allow the reproducibility of the results!

In [None]:
seed = 0
np.random.seed(seed)

## 2. Dataset exploration

Below we provide the directory of the training set.

In [None]:
dir_training = 'datasets/energy_efficiency_training.csv'

First let's load the dataset into a pandas dataframe. [(Hint)](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [None]:
df_training = #TODO

It is generally a good practise to randomly shuffle the dataset to make sure that the training/validation sets are representative of the overall distribution of the data. ([Hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html); use the random seed defined earlier).

In [None]:
df_training = #TODO
df_training.reset_index(drop=True, inplace=True) # re-order the indices

Let us now find the dimensionality (shape) of the dataset.

In [None]:
training_shape = #TODO
print 'Shape of the training set: ', training_shape

Print the first three lines of the training set to display a small sample. [(Hint)](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)

In [None]:
#TODO

Generate descriptive statistics that summarize the central tendency, dispersion and shape of the dataset’s distribution. ([Hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html))

In [None]:
#TODO

Let's explore the dataset in more detail. Print a consise summary of the training set that includes, among others, the type of the columns and the number of non-null values. [(Hint)](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html)

In [None]:
#TODO

Obtain the column names of the dataset.

In [None]:
columns = #TODO
print 'Columns:\n', columns

Run the following to obtain the feature names.

In [None]:
targets_all = ['Y1', 'Y2']
features = [f for f in columns if f not in targets_all]
print 'Features: ', features

Let's currently focus on a single target.

In [None]:
target = 'Y1'

Split the feature columns from the target column.

In [None]:
df_training_X = #TODO
df_training_y = #TODO

print 'Shape of training set (features): ', df_training_X.shape
print 'Shape of training set (target): ', df_training_y.shape

## 3. Data pre-processing

Data pre-processing refers to a sequence of transformations applied to data before feeding them to a machine learning algorithm, for example:
* dealing with missing values
* dealing with outliers
* feature scaling
* converting categorical features to dummy variables (one hot encoding)
* transforming skewed data distributions

This practical exercise will focus on one such transformation called *feature scaling* that causes the features to have roughly the same magnitude. Without this step some features may gain more importance or have a higher influence. Feature scaling is particularly useful for methods that consider a distance-related metric (e.g. k-NN) or use gradient descent (e.g. neural networks).

One way to perform feature scaling is *feature standardisation*. where a continuous feature $X$ with mean $\mu$ and standard deviation $\sigma$ will be transformed as follows: $X \leftarrow \frac{X - \mu}{\sigma}$

Let's start by calculating the means ([Hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html)) and standard deviations ([Hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.std.html)) of the *df_training_X* dataset.

In [None]:
means = #TODO
stds = #TODO

Write a function that takes as inputs a dataframe, its feature means and standard deviations, and performs feature scaling.

In [None]:
def standardise_features(df, means, stds):
    df_std = #TODO
    
    return df_std

Use the function you have just written to standardise the training set.

In [None]:
df_training_X_std = #TODO

Run the following to observe a sample of the standardised dataset.

In [None]:
df_training_X_std.head(3)

## 4. Cross-validation

*Cross-validation* is used to help you identify the best model for the problem. In this exercise we will use the *holdout* cross-validation. In another exercise you will learn about the *k-fold* cross-validation.

Let's start by creating the validation set (i.e. the holdout set). You are given the ratio of the training set that will form the validation set.

In [None]:
valid_ratio = 0.25

Create the *train* and *validation* sets from the *training* set. Do the same for the *standardised training* set. ([Hint](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html); disable data shuffling)

In [None]:
df_train_X, df_valid_X = #TODO
df_train_X_std, df_valid_X_std = #TODO
df_train_y, df_valid_y = #TODO

Re-order the dataframes' indices.

In [None]:
splits = [df_train_X, df_valid_X, df_train_y, df_valid_y, df_train_X_std, df_valid_X_std]
for d in splits:
    d.reset_index(drop=True, inplace=True)

Run the following to display the dimensionalities of the created dataframes.

In [None]:
print 'Shape of:'
print 'df_train_X: ', df_train_X.shape
print 'df_train_y: ', df_train_y.shape
print 'df_valid_X: ', df_valid_X.shape
print 'df_valid_y: ', df_valid_y.shape

## 5. Importance of feature scaling

The following function takes as inputs a machine learning model, training set and test set. Complete the function that trains ('fits') the model on the training set, makes predictions on both the training and test sets, and returns the performance. In this exercise our performance metric is the [mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html). (Hint: you may find the examples from this [tutorial](http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html) useful)

In [None]:
def train_model(model, df_train_X, df_train_y, df_test_X, df_test_y):
    # Fit model on training set
    #TODO
    
    # Predictions and performance (MSE) on training set
    train_y_pred = #TODO
    mse_train = #TODO
    
    # Predictions and performance (MSE) on test set
    test_y_pred = #TODO
    mse_test = #TODO
    
    return mse_train, mse_test

Define a neural network ([MLPRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor)). Set the maximum number of iterations to be 1000 and don't forget to fix the random seed. Leave the rest of the parameters to their default values.

In [None]:
nn = MLPRegressor(max_iter=1000, random_state = seed)

Use the *train_model* function you wrote earlier to train the neural network you have just defined, and obtain the performance on the original validation set and the standardised one.

In [None]:
# without feature scaling
_, nn_mse_valid = #TODO

# with feature scaling
_, nn_mse_valid_std = #TODO

Run the following to see how important feature scaling is!

In [None]:
print 'Mean squared error on validation set : ', nn_mse_valid
print 'Mean squared error on standardised validation set: ', nn_mse_valid_std

## 6. Model selection

We will use the validation set to identify the best machine learning model. You will get the chance to try out both linear and non-linear models, specifically, you will use linear regression, a decision tree and a neural network!

### Linear Regression

Let's start by defining a linear regression model. ([LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html))

In [None]:
lr = #TODO

Obtain its performance on the validation set. Take a moment to think whether you need to use the original or standardised training / validation sets.

In [None]:
_, mse_valid_lr = train_model(lr, df_train_X, df_train_y, df_valid_X, df_valid_y)
print 'Mean squared error on validation set using linear regression: ', '%.2f' % mse_valid_lr

### Decision Tree

Define a decision tree. ([DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor); don't forget to fix the random seed, leave the rest of the parameters to their default value)

In [None]:
dt = #TODO

Obtain its performance on the validation set. Take a moment to think whether you need to use the original or standardised training / validation sets.

In [None]:
mse_train_dt, mse_valid_dt = train_model(dt, df_train_X, df_train_y, df_valid_X, df_valid_y)
print 'Mean squared error on validation set using decision tree: ', '%.2f' % mse_valid_dt

This is a considerable improvement over linear regression! This is attributed to the fact that a decision tree is non-linear.

### Neural Network

You will now examine various neural network models! A neural network has many hyper-parameters and cross-validation will help us tune these and select the best.

A neural netowrk is very sensitive to these hyper-parameters. To demonstrate these you will try out many neural networks. We have provided the model *nn1* below, define your own models *nn2*, *nn3* and *nn4* using [MLPRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor).

In [None]:
nn1 = MLPRegressor(
        max_iter=1000,
        hidden_layer_sizes = (100,),
        activation = 'logistic',
        solver = 'sgd',
        alpha = 0.0001,
        batch_size = 'auto',
        learning_rate_init = 0.001,
        learning_rate = 'constant',
        random_state = seed
    )

nn2 = MLPRegressor(
        max_iter=#TODO,
        hidden_layer_sizes = #TODO,
        activation = #TODO,
        solver = #TODO,
        alpha = #TODO,
        batch_size = #TODO,
        learning_rate_init = #TODO,
        learning_rate = #TODO,
        random_state = seed
    )

nn3 = MLPRegressor(
        max_iter=#TODO,
        hidden_layer_sizes = #TODO,
        activation = #TODO,
        solver = #TODO,
        alpha = #TODO,
        batch_size = #TODO,
        learning_rate_init = #TODO,
        learning_rate = #TODO,
        random_state = seed
    )

nn4 = MLPRegressor(
        max_iter=#TODO,
        hidden_layer_sizes = #TODO,
        activation = #TODO,
        solver = #TODO,
        alpha = #TODO,
        batch_size = #TODO,
        learning_rate_init = #TODO,
        learning_rate = #TODO,
        random_state = seed
    )

Below we provide a neural network model after we performed hyper-parameter tuning.

In [None]:
nn5 = MLPRegressor(
        max_iter=1000,
        hidden_layer_sizes = (100, 16, 16),
        activation = 'relu',
        solver = 'adam',
        alpha = 0.001,
        batch_size = 16,
        learning_rate_init = 0.001,
        learning_rate = 'constant',
        random_state = seed
    )

Let's obtain the performance of each model on the validation set.

In [None]:
# all neural network models
nn_models = [nn1, nn2, nn3, nn4, nn5]

# Loop over all models
for i in range(len(nn_models)):
    nn = nn_models[i]
    
    _, mse_valid_nn = train_model(nn, df_train_X_std, df_train_y, df_valid_X_std, df_valid_y)
    print 'Mean squared error on validation set using neural network nn' + str(i + 1), ': ', '%.2f' % mse_valid_nn

To sum up, a neural network has many hyper-parameters and it's very sensitive to them. For instance, *nn1* has a similar performance to linear regression while the well-tuned *nn5* performs slightly better than the decision tree.

## 7. Learning curves

Assume that we wish to find out if getting more training data would be beneficial. The learning curves will come in handy!

Use the *train_model* function you defined earlier to calculate the mean squared error of the neural network model *nn5* on the training and validation sets for various given training set sizes.

In [None]:
sizes = list(range(10,df_train_X.shape[0],50)) + [df_train_X.shape[0]]

lst_mse_train_nn = []
lst_mse_valid_nn = []

for s in sizes:
    mse_train_nn, mse_valid_nn = train_model(nn5,
                                             #TODO,
                                             #TODO,
                                             df_valid_X_std,
                                             df_valid_y)
    
    lst_mse_train_nn.append(mse_train_nn)
    lst_mse_valid_nn.append(mse_valid_nn)

Run the following to generate the learning curves.

In [None]:
plt.figure(0)
plt.title('Neural Network')
plt.plot(sizes, lst_mse_train_nn, label='mse train')
plt.plot(sizes, lst_mse_valid_nn, label='mse valid')
leg = plt.legend()

We can you conclude about getting more data?

## 8. Final evaluation

The final evaluation will be conducted on an independent test set completely unseen by the training process.

Run the following to load the test set into a pandas dataframe.

In [None]:
# directory of test set
dir_test = 'datasets/energy_efficiency_test.csv'

# load test set
df_test = pd.read_csv(dir_test, index_col=False)

Split the features and the target.

In [None]:
df_test_X = #TODO
df_test_y = #TODO

Since the performnace of the decision tree *dt* and neural network *nn5* is very close, we will focus on the former because, as we will show shortly, it is more 'interpretable'.

Calculate the predictions and performance (MSE) on the test set for *dt*.

In [None]:
test_y_pred_dt = #TODO
mse_test_dt = #TODO

Run the following to display the MSE on all datasets.

In [None]:
print 'Mean squared error on training set using decision tree' + ': ', '%.2f' % mse_train_dt
print 'Mean squared error on validation set using decision tree' + ': ', '%.2f' % mse_valid_dt
print 'Mean squared error on test set using decision tree' + ': ', '%.2f' % mse_test_dt

Lastly, let's print out a sample of the predictions to see how we did!

In [None]:
print('Sample test set')
print(list(df_test_y)[:10])

print '\n'

print('Predictions')
print(list(test_y_pred_dt)[:10])

## 9. Feature selection

It turns out that some of the features have a stronger predictive power than others. Feature selection offers many potential benefits; it can boost regression or classification performance and provide insight to the data by returning only the top predictors. It can further facilitate data visualisation, reduce storage requirements and execution runtime of learning algorithms.

We will learn about <span style="color:red">feature selection</span> and <span style="color:red">dimensionality reduction</span> (feature projection) in detail in another practical exercise. For now, we provide a list of the 'best' features.

In [None]:
best_features = ['X1', 'X7']

Run the following to define a new decision tree.

In [None]:
model = DecisionTreeRegressor(random_state=seed)

Fit the model on the selected subset of features and calculate the performance on all datasets.

In [None]:
# Fit the model on train set
#TODO

# performance on train set
train_y_pred = #TODO
mse_train = metrics.mean_squared_error(df_train_y, train_y_pred)

# performance on validation set
valid_y_pred = #TODO
mse_valid = metrics.mean_squared_error(df_valid_y, valid_y_pred)

# performance on test set
test_y_pred = #TODO
mse_test = metrics.mean_squared_error(df_test_y, test_y_pred)

Run the following to display the MSE on all datasets using feature selection.

In [None]:
print 'Mean squared error on training set using decision tree: ', '%.2f' % mse_train
print 'Mean squared error on validation set using decision tree: ', '%.2f' % mse_valid
print 'Mean squared error on test set using decision tree: ', '%.2f' % mse_test

## 10. Visualisation

We have mentioned earlier that a decision tree is highly 'interpretable' compared to other models. Let's visualise it to see why! Check the file *tree.png* that will be generated.

In [None]:
dot_data = tree.export_graphviz(
    model,
    out_file='tree.dot', 
    feature_names=best_features,
    class_names=target,
    filled=True, rounded=True, special_characters=True)   

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')

Congratulations on finishing this practical exercise! :-)

There are still plenty of things you could experiment with, such as:
* Change target to 'Y2' to see the new behaviour and performance of the models (Section 2)
* Try out more machine learning models such as SVMs (Section 6)
* Experiment with other feature subsets (Section 9)