# Week 2: Introduction to regression problems 

----------------------------------------------------
Machine Learning     Year 2019/2020

*Vanessa Gómez Verdejo vanessa@tsc.uc3m.es* and *Pablo M. Olmos olmos@tsc.uc3m.es*

----------------------------------------------------

To begin to understand what is a method or algorithm of machine learning, let's start working with a real problem: house sale-value prediction. For this purpose, we will start by loading **Boston housing dataset** and analyzing what it is made of.... 

In [0]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


### Boston housing database

To goal of this problem or dataset is to predict the value of houses in the suburbs of Boston during the 80's using different features. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.

We can load this dataset from the sklearn repository as follows:


In [0]:
from sklearn.datasets import load_boston
boston = load_boston()


Variable 'boston' is a dictionary from where we can extract the data, the labels or targets, and some information about the feature meaning....  

In [0]:
X = boston.data
Y = boston.target
feature_names = boston.feature_names

Now, for instance, from the input data or observations, we can check their dimensions...

In [0]:
print(X.shape)

Each record (a single each datum) is usually placed in a row; whereas columns are used for features or variables. In this dataset, we have 506 records or observations, and each one has 13 features.

Here, each record in the database describes a Boston suburb (or town) and it is characterized by a set of features. If we want to know what the are representing, we can check variable 'feature_names' toghether to the sklearn [User Guide](https://scikit-learn.org/stable/datasets/index.html#boston-dataset).


In [0]:
print(feature_names)

Besides, we can acces to the target information. 

In [0]:
print(Y.shape)
print(Y[:10])

It's a numpy array with a single column with as many elements as records in X. Each element of Y is associated to a row of X.

Finally, we can try to analyze the relationship between each input feature and the target variable by plotting Y vs. each feature....

In [0]:
import matplotlib.pyplot as plt
#%matplotlib inline

D = X.shape[1]

plt.figure( figsize=(20,5))
for d in range (D):
  plt.subplot(2,7,d+1)
  plt.plot(X[:,d],Y,'.',)
  plt.title(feature_names[d])


Which variables are most useful to predict the house sale prize?



# Machine Learning Pipeline

When we have to design a regressor (or classifier) to solve a data analysis problem, we usually follow these steps:

<img src="http://www.tsc.uc3m.es/~vanessa/Figs_notebooks/ML/Regression/MLPipeline.png" width="100%" > 


In the **preprocessing** step we can find several tasks:
* Imputation of missing values
* Dimensionality reduction: feature extraction or feature selection
* Data normalization

Along this course, we will review all these techniques. But for now, let's start with the data normalization.

# Preparing the dataset

Folowing the ML pipeline, before desinging any ML regressor to predict the house sale-prize, we need:
1. Generate the train and test partitions.  
2. Normalize the data variables (preprocessing).
3. Any additional preprocessing (in this case, feature selection).


## Generating training/test partitions
This task consist in dividing the overall datasets into two independt subsets:
* The **training subset**: used to train the ML model. We will later see that this subset can be divided into diferent (validation) subsets if hiperparameters of the ML model have to be adjusted.
* The **test subset**: exclusively used for performance evaluation.

This division usually depends on the avaliable number of data, but 60/40\%, 70/30\% or 80/20\% train/test divisions are quite common.

For this task, we can use some preprocessing tools from sklearn; for instance, we can use the function [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).



In [0]:
from sklearn.model_selection import train_test_split

# Parameter test_size lest you select the percentage of data into the test partition (in this case 20%)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

# Note that we have to divide both observations and labels!!!

## Normalizing our input data

Next step implies data normalization. In general, features are defined in different ranges. I.e., if $x_1$ for instance can take values in the $[-1,1]$ set and $x_2$ in the $[-10^6,10^6]$ range. To improve numerical robustness of ML methods, we usually apply a linear normalization preprocessing stage to use as input to ML model features with (sample) mean 0 and (sample) variance equal to 1. 



Let's check the mean and standard deviation of our training data

In [0]:
# Analyze mean and std of each feature
print('Mean values')
print(np.mean(X_train, axis=0))

print('Std values')
print(np.std(X_train, axis=0))

**Normalization process**

Given the unnormalized feature matrix $\mathbf{X}_{N\times D}$, we compute the sample mean per feature column , $\mu_j$, and the sample variance per  feature column, $\sigma^2_j$ for $j=1,\ldots,D$. Then, each row of the normalized feature matrix $\overline{\mathbf{X}}_{N\times D}$ is obtained as follows:

$$\overline{\mathbf{x}}^{(i)}= \left[\frac{x_1^{(i)}-\mu_1}{\sqrt{\sigma^2_1}}, ~~ \frac{x_2^{(i)}-\mu_2}{\sqrt{\sigma^2_2}}, \ldots, \frac{x_D^{(D)}-\mu_D}{\sqrt{\sigma^2_D}}\right]$$

One last thing, in general we will not normalize the output variable $y$. We can do it and but typically it leads to mistakes in the evaluation of the cost function. 

To carry out this proccess we can also use other of the sklearn preprocessing tools: the object [StandardScalerStandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)


In [0]:
# Data normalization

from sklearn.preprocessing import StandardScaler

# Define the object
transformer = StandardScaler()
transformer.fit(X_train)  # fit does nothing, just learns mean and std from data
X_train_norm = transformer.transform(X_train)
X_test_norm =  transformer.transform(X_test)


One important thing, we normalize the test data using the sample means and variances computed from the **training** set. This is important to **make sure that all datapoints are equally normalized**.

Check mean and std of the normalized data

In [0]:
print('Mean values')
print(np.mean(X_train_norm, axis=0))

print('Std values')
print(np.std(X_train_norm, axis=0))

### Exercise 1

Desing your own normalization proccess

In [0]:
# <SOL>
# </SOL>

## Other preprocesing tools: Feature selection
Other possible preprocessing step can be a feature selection. This consists in selecting a subset of the original features which seem to be more usefull to solve the predicition task. 

There are many sophisticated methods to carry out this task (we wiil review some of them along this course), but now, simply from the above display (variables vs. target values), let's select the variables:
* Feature 0: CRIM
* Feature 5: RM
* Feature 12: LSTAT 


In [0]:
sel_feat = [0, 5, 12]
X_train_sel = X_train_norm[:,sel_feat]
X_test_sel = X_test_norm[:,sel_feat]

Note that the feature selection is not affecting to the target vectors.

For the sake of simplicity, from now on, let's consider that train and test data are the preprocessed versions.

In [0]:
X_train = X_train_sel
X_test = X_test_sel

# The ML model

Consider we have a **training** database of $N$ entries of the form $(\mathbf{x}^{(i)},y^{(i)})$, where $\mathbf{x}\in\mathbb{R}^{D}$ is an observation  and $y \in\mathbb{R}$  is its target value. Our goal is to design a  function $f(\mathbf{x}^*)$  able to estimate the associated target value for a new point $\mathbf{x}^*$ as follows

\begin{align}
y^* \approx f(\mathbf{x}^*)
\end{align}

Note that in the above equation we are using the $\approx$ because we consider that the estimation is not goint to be perfect, i.e.,  the designed function will have an estimation error. As we will see later, a good design will try to minimize this error.

## My first estimation model:  K-NN (Nearest Neighbours) 

K-NN is a *non-parametric* method, since there are not parameters to be learned. To estimate the output a new sample $\mathbf{x}^*$, you only have to:
* Select the value of K.
* Search, among the training data, the K nearest neighbours of $\mathbf{x}^*$
* Compute the estimated target  $\mathbf{x}^*$ as the average value of the targets associated to the K-nearest neighbours of $x^*$. 
That is,

\begin{align}
f(\mathbf{x}^*) =\frac{1}{K}\sum_{k\in \mathcal{S}_{K}(\mathbf{x}^*)} y^{(k)},
\end{align}
where $\mathcal{S}_{K}$ is the set of the **$K$ training points that are closest to $\mathbf{x}^*$ according to a given distance metric $d(\mathbf{x}^{(i)},\mathbf{x}^*)$**. For instance, in real spaces the most common choice is the **euclidean distance**:

\begin{align}
d(\mathbf{x}^{(i)},\mathbf{x}^*) = \left|\left|\mathbf{x}^{(i)}-\mathbf{x}^*\right|\right|^2
\end{align}


**Some comments on K-NN**

1. There are many variants of the above expression. For instance, we can **weight differently the regression values according to the distance**. In this way, the closest points has a bigger influence in the regression value:

\begin{align}
f(\mathbf{x}^*) =\frac{1}{K}\sum_{k\in \mathcal{S}_{K}(\mathbf{x}^*)}  \frac{d(\mathbf{x}^{(k)},\mathbf{x}^*)}{\sum_{q\in \mathcal{S}_{K}} d(\mathbf{x}^{(q)},\mathbf{x}^*) } y^{(k)}
\end{align}


2. The K-NN complexity is determined by the [**search of the set of closest neighbors**](https://en.wikipedia.org/wiki/Nearest_neighbor_search). This complexity grows as $\mathcal{O}(DN)$, where $D$ is the dimension and $N$ is the number of points. Researches in computer science have developed many approximate lower-complexity algorithms to perform this task in high dimension spaces. The most common ones are [**k-d trees**](https://en.wikipedia.org/wiki/K-d_tree) and [**Local Sensitive Hashing**](https://en.wikipedia.org/wiki/Locality-sensitive_hashing).


## Model evaluation

In regression problems, it is quit common using the Mean Square Error (MSE) as performance measurement of our model. In fact, we can consider the following two metrics to diagnose how well our model is performing:

- **Training MSE**: Considering $f({\bf x}^*)$ is the regressor output for the data ${\bf x}^*$, the MSE over the training data is defined as:
$$MSE_{train} = \frac{1}{N} \sum_{i=1}^{N} \left(y^{(i)}-f({\bf x}^{(i)})\right)^2$$

- **Test MSE**: evaluating the same measurement over the test data, we have that
\begin{align}
MSE_{test} =  \frac{1}{N_{test}}\sum_{i=1}^{N_{test}} \left(y^{(i)}-f({\bf x}^{(i)})\right)^2
\end{align}

Note that we are interested in evaluating how well our data **generalizes to data we have never seen!** This is the real point of Machine Learning! Learn from data to adapt! To emulate this effect, we have a test-database **which is never used for model training!** and is only used to evaluate generalization. So, the test error informs about the generalization capability of our regressor.


### Exercise 2 
Implement your own K-NN regressor (with $K=5$) to predict the house sale-prize. Obtain the predictions over the test data set and evaluate the regression performance computing the MSE over the test data.

You can use the function [euclidean_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) to compute the euclidean distances of each test point to the training data.

#### SOLUTION

In [0]:
# <SOL>
# </SOL>


### Exercise 3
Use the sklearn [K-NN regression](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) function to train and obtain the test predictions and check that you obtain the same MSE. Of couse, set $K=5$.

#### SOLUTION

In [0]:
# <SOL>
# </SOL>

## Analyzing prediction curve

The K-NN model is learning a function to map from an observation to a predicted value (an estimation of the house sale-prize). In case, we are working with unidimensional observations, we can plot this prediction function. 

Next cell analizes the prediction function ${y^*}=f({\bf x}^*)$ when we select a single input variable (feature 12: LSTAT, after the previous feature selection, this is feature #2). 

In [0]:
# Analize the regresion function

# We select a single feature 

Id_feat = 2 #LSTAT

# We train and evaluate the model with the unidimensional observation
neigh.fit(X_train[:,Id_feat][:,np.newaxis], Y_train) 
f_test=neigh.predict(X_test[:,Id_feat][:,np.newaxis])
MSE_KNN = np.mean((Y_test-f_test)**2)
print('MSE using a single input feature:')
print(MSE_KNN)

# Plot the training data
plt.figure()
plt.plot(X_train[:,Id_feat][:,np.newaxis], Y_train, '.')
# Plot the regression function
## Define a range of equispaced values of x
X_plot = np.arange (-2,4,0.01)
## Obtain the output of the prediction funcion for these values
f_plot=neigh.predict(X_plot[:,np.newaxis])
## Plot them
plt.plot(X_plot,f_plot, 'r')
plt.show()

# Analyzing the influece of parameter K

For now, we have prefixed a value for the K parameter ($K=5$) and we have analyzed the K-NN for this value. But, can another K value provide a different performance? 


### Exercise 4

Analyze the performance of the K-NN regressor for a range of values of K from 1 to 40. Analyze this performance over the train and test MSE.

#### SOLUTION

In [0]:
# <SOL>
# </SOL>

How can I select the optimum K value?
Can I use the training MSE? 


# Parameter selection: Cross Validation

In order to fit $K$, we will split the training set once more, to create the **validation set**:

<img src="http://www.tsc.uc3m.es/~vanessa/Figs_notebooks/ML/Regression/validation_set.png" width="70%" > 


The validation set will be used to choose the best $K$ among the same model with different $K$ values, all trained using the training set. This process is known as **cross validation**.

-----

**Cross Validation Steps**:

1) Select a ML model, for instance a K-NN regressor, and construct the normalized feature matrix for both the training, validation and test sets. 

2) Set a grid of $K$ values to test (i.e. $1, 2,3,4,5,6,7,8,....20$)

3) Train the ML model for each $K$ value **using only the training set**. For each, $K$ value we get a model $f_{K}()$.

4) Compute the validation MSE (or other performance measurement) for each model:

$$MSE_{val}(K) = \frac{1}{N_{val}} \sum_{i=1}^{N_{val}} \left(y^{(i)}-{f_K}({\bf x}^{(i)})\right)^2$$

Recall, both the validation and test sets are normalized using the train set statistics.

5) Select $K^*$ that minimizes the validation error 

6) Retrain the model for $K^*$ using **both the training and validation sets**. So, you get a new model $f_{K^*}()$

7) Compute the final test MSE:

$$MSE_{test}(K^*) = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}} \left(y^{(i)}-{f_{K^*}}({\bf x}^{(i)})\right)^2$$

--- 

Unless the dataset is large enough (and so does the validation set), the robustnetss of the above procedure is improved by averaging the validation MSE (step 4) over multiple random partitions of the validation/training set (**Repeated random sub-sampling validation**) or splits the training set in K-folds and averages the results using one of the folds as validation set at a time (**K-fold cross validation**). See the [Wikipedia entry](https://en.wikipedia.org/wiki/Cross-validation) about CV for more details. 



### Exercise 5.1

Now, apply a 10 fold cross validation (CV) process to select the optimum value of K from a range of values from 1 to 40. You can use the [GridSearchCV( )](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) function to implement it. 

Note: Define the scoring parameter of GridSerch as 'neg_mean_squared_error' (to minimize the MSE during the CV)

#### SOLUTION

In [0]:
# <SOL>
# </SOL>

### Analyzing CV output
After running the CV process, the estimator object contains the information of the CV process (next cell explore the parameter ".grid\_scores\_" to obtain this information).

In [0]:
# Printing results
print("Cross validation results:")

paramsFolds = clf.cv_results_['params']
meanScoreFolds = clf.cv_results_['mean_test_score']
stdScoreFolds = clf.cv_results_['std_test_score']

for i, K in enumerate(rang_K):
    params = paramsFolds[i]
    mean_score = meanScoreFolds[i]
    std_score = stdScoreFolds[i]
    print("For K = %d, validation accuracy is %2.2f (+/-%1.3f)%%" 
          % (params['n_neighbors'], 100*mean_score, 100*std_score / 2))

# Selecting validation error (mean values)
vect_val=meanScoreFolds

# Ploting results
plt.figure()
plt.plot(rang_K,vect_val,'g', label='Validation accuracy')
plt.legend()
plt.xlabel('K value')
plt.ylabel('Accuracy')
plt.title('Evolution of K-NN accuracy (including validation result)')
plt.show()

## Exercise 5.2

Examine the fields ".best\_estimator\_" and ".best\_params\_" of the estimator generated by the CV process:
* ".best\_estimator\_" contains  the final estimator trained with this select value.
* ".best\_params\_" is a dictionary with the selected parameters. In our example, "best\_params\_['n\_neighbors']" would provide the selected value of K.

Save the selected value of K in variable denoted "K_opt" and compute the test error of the final estimator.

### SOLUTION

In [0]:
# <SOL>
# </SOL>

Note that you can also compute the KNN output directly over the estimator object returned by the CV process:

In [0]:
f_test = clf.predict(X_test)
MSE_test = np.mean((Y_test-f_test)**2) 
print("The test accuracy is %2.2f" %(MSE_test ))

## Exercise 5.3  Advanced work

Complete the following code to implement the CV process without using GridSearchCV().

You can use the sklearn [Kfold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) function to generate the different validation folds.

### SOLUTION

In [0]:
# <SOL>
# </SOL>

# Performance evaluation

The last step of the ML pipeline is the performance evaluation. So far, we have used the MSE as performance measurement, but we can find other metrics. In fact, sklearn models include a wide variety of [metrics for regression problems](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics). The most frequent are:
* Mean Square Error (MSE)
$$MSE = \frac{1}{N} \sum_{i=1}^{N} \left(y^{(i)}-f({\bf x}^{(i)})\right)^2$$

* Mean Absolute Error (MAE)
$$MAE = \frac{1}{N} \sum_{i=1}^{N} \left|y^{(i)}-f({\bf x}^{(i)})\right|$$

* R2-score (R2)
$$R2 =  1- \frac{ \sum_{i=1}^{N} \left(y^{(i)}-f({\bf x}^{(i)})\right)^2}{\sum_{i=1}^{N} \left(y^{(i)}-\bar{y}\right)^2}$$

where $\bar{y} = \frac{1}{N} \sum_{i=1}^{N} y^{(i)}$ (mean value of the targets).

The R2 score is the default measurement used by sklearn regression methods. It has the advantage of being more interpretable, since the best possible score is 1.0, a constant model that always predicts the expected value of y would get a score of 0.0, and models working even worse that this constant model can provide negative values. 


## Exercise 6

Use the [regression metrics](https://scikit-learn.org/stable/modules/classes.html#regression-metrics) of sklearn to obtain the MSE, MAE and R2  scorings of the K-NN regressor. 

Note that you can also use the .scoring() method of the K-NN model to evaluate the regressor performance, but this method only returns the R2-score. 

In [0]:
# <SOL>
# </SOL>