# A neural network model for regression problems

We will see here how to use Keras to implement a simple neural network model for a **regression problem** (continuous target variable).
For this practical session we are using the [California house princing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html), exploring the relationship between house prices (in thousands of dollars) and a series of thirteen numerical properties of houses in Boston suburbs.

Fortunately for us, is a very well known dataset, handily included in the [sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing).

# The usual seed priming

In [None]:
#resetting the seeds
!wget -O support_code.py https://raw.githubusercontent.com/ne1s0n/coding_excercises/master/lab_day1/support_code.py
%run support_code.py
n_seed = 10

reset_random_seeds(n_seed)

# Loading libraries and getting the data

We first load some necessary libraries and get the data from the `sklearn` module.

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt

#(features, target_variable) = datasets.fetch_california_housing(return_X_y=True)
cal = datasets.fetch_california_housing(return_X_y=False)

#for data description and manipulation a pandas dataframe is handier
#than a numpy matrix. Moreover, you can always go back using .to_numpy()
#method
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
#features = pd.DataFrame(features)
features = pd.DataFrame(cal.data, columns=cal.feature_names)

# Data description

In [None]:
#taking a look at the features
features.describe()

In [None]:
#taking a look at the target variable
target_variable = cal.target
plt.hist(target_variable, bins = 15)
plt.show()

# Data preprocessing

## Missing values removal

It's always useful to check if something is missing, since usually regression models don't like missing values.

In [None]:
features.isnull().sum()

No values are missing, but just in case the code belows takes care of holes in the features data frame.

In [None]:
#fast way, but only for one data frame...
#features = features.dropna()

#finding features with at least one missing data point
missing = features.isnull().sum(axis = 1) > 0

#you cannot simply use the "not" operator on a series
#it's ambigous. We need to pass through a proper
#negation function from numpy
not_missing = np.logical_not(missing)

#subsetting
features = features.loc[not_missing, :]
target_variable = target_variable[not_missing]

#just checking
print("Size of target variable: ", target_variable.shape)
print("Size of feature matrix: ", features.shape)

## Discussion: what if we are missing target values?

The code above shows us how to remove samples with missing values in the `features` data frame. What if we have missing values in the target array?

Pay attention: `target_variable` is a NumPy array. What kind of data should we expect?

## Discussion: outliers removal?

Should we remove outliers data? If yes, do it in the space below. If no, tell us why.

In [None]:
######## YOUR CODE HERE ########
if True:
  #some very refined logic...

  #subsetting


  #just checking

################################

## Discussion/Exercise: data normalization?

Should we normalize the features? If yes, do it in the space below. If no, tell us why.

In [None]:
######## YOUR CODE HERE ########
if True:
  #getting mean and standard deviation for train set


  #normalizing


  #checking if we are doing everything fine

################################

# Training and validation sets

We have a resonable-sized dataset, we can go with one of the standard split-validation cuts, e.g. 10%

> **[PRO TIP]** In very small dataset you may be forced to do "leave one out" crossvalidation. For N samples you cycle selecting one sample for predictions (this would be your validation set) and keeping N-1 for training. Unfortunately in this case you cannot compute correlations (can you guess why?) and must use error-related metrics like Mean Squared Error.




In [None]:
#we import a function to perform the split
from sklearn.model_selection import train_test_split

## YOUR CODE

#let's print some shapes to get an idea of the resulting data structure
print(features_train.shape)
print(features_val.shape)
print(target_train.shape)
print(target_val.shape)

# Building the neural networks model

We are now ready to build our neural networks model for regression. First, we set some hyperparameters:

- the activation function in the output layer in this case is **linear**: we get directly the results from $\sum wx + b$
- the loss function is MSE (**Mean Squared Error**)

In [None]:
# Configuration options

#Keras wants the data shape as a tuple. In case of images is something
#like (image width, image height, number of channels).
#In our case data is tabular so each data point is an
#array (a row in the table) and we just need to specify
#a single number, i.e. the number of columns. However, we still need
#to input a tuple, so...
input_shape =

#other standard features describing our network


In [None]:
print(input_shape)

As we learnt to do throughout the course, we now use `keras` to build our sequential neural networks model:

In [None]:
reset_random_seeds(n_seed)

#we are building a "sequential" model, meaning that the data will
#flow like INPUT -> ELABORATION -> OUTPUT.
from keras.models import Sequential

#a "dense" layer is a layer were all the data coming in are connected
#to all nodes.
from keras.layers import Dense

#adding the hidden layer with the required number of nodes, plus the final
#layer with a single node (since we want to output a single number)

#the model is declared, but we still need to compile it to actually
#build all the data structures


In [None]:
print(model.summary())

The `model.summary()` tells us that we now have 161 model parameters to learn:

- hidden layer: 8 feature weights x 16 units + 16 bias terms $\rightarrow$ 144 parameters
- output layer: 16 unit weights + 1 bias term $\rightarrow$ 17 parameters  
- 144 + 17 = 161 parameters

# Training the neural network model

In [None]:
## YOUR CODE

In [None]:
def plot_loss_history(h, title):
    plt.plot(h.history['loss'], label = "Train loss")
    plt.plot(h.history['val_loss'], label = "Validation loss")
    plt.xlabel('Epochs')
    plt.title(title)
    plt.legend()
    plt.show()

plot_loss_history(history, 'Logistic ({} epochs)'.format(num_epochs))

# Predictions

We can now see the predictions our neural networks model produced for the house prices. Below we report also the known target values from the test set

In [None]:
#predicting, and reporting the shape of the obtained object (also known as y hat)


In [None]:
#a couple of print on true target values (also known as y)


In regression problems you can not use the same accuracy metrics as in classification problems (e.g. error rate, confusion matrix, etc.): in stead, other metrics are used like:

- **Pearson linear correlation**
- **Spearman rank correlation**
- **RMSE** (root mean squared error)
- **MAE** (mean absolute error)
- etc. (there are many more)

In [None]:
#changing forma to pandas Series, for handiness


#computing correlation

print("accuracy (measured as Pearson's correlation) is: ", accuracy)


A scatter plot of predicted vs true values can also be useful to visualise the result of the model:



In [None]:
plt.style.use('ggplot')


#equivalent: plt.plot(y, y_pred, "o")

# draw the main diagonal, for reference


#labelling the axis


In [None]:
#Root Mean Square Error
print('Root Mean Square Error : ' + str(np.sqrt(sum((y-y_pred)**2)/len(y))))
print('Target mean            : ' + str(y.mean()))
print('Predictions mean       : ' + str(y_pred.mean()))

# Exercise: improve the network

The predictions are not terrible but for sure not excellent. How to improve the performances?