# Lesson 18 & 19 Neural Networks For Regression

Neural networks or a biomimicry concept based on the brain. 
The brain uses neurons to determine "activation" states. 
If an activation state is reached in one neuron, information is passed to the next neuron. 
Neural networks look to copy this model.  

<img src="neural_node.png" width="500">

Each neuron has a **biased input**, $x_0$ with value 1. 
Neuron **weights** ($\vec{w}_i$) are the <mark> model parameters </mark>.

The output is obatined by applying some **activation function $g(x)$**:
$$y_{output} = \hat{y} = f(x, w) = g\left( \sum_{i=0}^{I} w_i x_i\right)$$  

## Feed-Forward Networks  

Feedforward networks are organized as layers where each layer is **fully connected** to the next. They are directed acyclic graphs. 

- **Perceptron:** Single layer neural networks (perceptron): an input layer ->  output layer
- **MLP / Multilayer Neural Networks** Input layer -> hidden layers, output layer. 

<mark> Some notes: </mark>  
- The input layer has as many units are there are inputs in the problem. 
- The output layer has as many units as are needed for the problem. 
- Each hidden layer has multiple units (possibly a different number per layer). 
- Input of each unit are all the outputs of the units of the previous layer. 

<img src="multilayer_neural_network.png" width="300"> 

## Data  
We will work with the energy data used previously to predict the power consumption based on several features. This data is a <mark> time-series</mark> dataset. We can check "Lesson 16 - Support Vector Machines for Regression" for the initial notes on the dataset. We don't need to repeat that here.  

There are 19735 datapoints and  28 features + 1 output = 29. The output is the energy consumption column. This is represented in the 'Appliances' column.  

Remember, since this is time series data, we will not do the traditional train-test-splitting we do for non-transient data. We can split it at a specific timepoint.  We will use this <mark> Time-series window </mark> approach. This creates a matrix of repetitive, broken-up chunks of data we use to separate our output information. This approach will produce a 3D matrix, because it splits according to: time and space. We only have 1 output, so this will be redundant

<img src="sliding_window.png" width="300">  

<mark> It is important  to normalize our data </mark> in this approach.  



# Multilayer Perceptron Neural Network

- MLP utilizes a supervised learning technique called back-propagation (calculus / diffeq based). 
- Except for input nodes, each node we have is a neuran that uses a **nonlinear activation function**.  
- <mark> it can distinguish data that is not linearly separable </mark>  
- <mark> MLP is a deep learning approach </mark>

This mode receives multiple hyperparameters:   
(these are stored in "params". Array values correspond to no. hidden layers.)  
- `hidden_layer_sizes`: Number of hidden layers.
- `activation`: Activation function. String values identifying predetermined functions. This may include "RELU", "logistic", etc...
- `alpha`: Magnitude of coefficient of $L2$ loss function. 
- `momentum`: Similar to physics momentum. Update direction resists change when momentum is in effect. This has the effect of <mark> overshooting local minima</mark> because they are too shallow to stop a minimization algorithm unless a sufficiently low minima exists. IE <mark> only deep enough minima </mark> will stop this training momentum. 
- `learning_rate_init`: Step size control for weight updates. 
- `n_iter_no_change`: No. Iterations allowed during training while score is not improving. 
- `learning_rate`: Scheme for updating learning rate started by `learning_rate_init`. **Constant**: does not change from initial. **invscaling**: gradually decrease learning rate at each timestep using inverse scaling exponent. **adaptive**: keeps learning rate constant to `learning_rate_init` as long as training loss keeps decreasing. Once two consecutive epocs fail to decrease training loss by at least the tolerance of the optimization, or fails to increase validation score by at least tol if `early_stopping` is enabled, the current learning rate is cut by 5 (20%).

\begin{align}
    \text{RELU (rectified linear unit)} : \ \ f(x) = max(0, x) \\[0.5ex]
    \text{Logistic (sigmoid function)} : \ \ f(x) = 1 / (1 + exp(-x))
\end{align}  

**note:** this learning_rate business is similar to an "adaptive timestep" scheme in CFD.  

<mark> To solve </mark> we will use **BayesSearch cross validation**.

# < START >

# Imports

In [56]:
%cd ..

/Users/jaimemerizalde/Desktop/Library


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle

from Library import data

from numpy.lib.stride_tricks import sliding_window_view

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer

from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor

from skopt import BayesSearchCV

# Get Data

In [None]:
filename = "Datasets/Energy.csv"
df = data.get_data(filename, index_col=[0])

Index(['date', 'Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2'],
      dtype='object')

## EDA

In [14]:
df.shape
df.columns
df.describe(include="all")

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,...,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
unique,19735,,,,,,,,,,...,,,,,,,,,,
top,11-01-2016 17:00,,,,,,,,,,...,,,,,,,,,,
freq,1,,,,,,,,,,...,,,,,,,,,,
mean,,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,...,19.485828,41.552401,7.41258,755.522602,79.750418,4.039752,38.330834,3.760995,24.988033,24.988033
std,,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,...,2.014712,4.151497,5.318464,7.399441,14.901088,2.451221,11.794719,4.195248,14.496634,14.496634
min,,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,...,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,...,18.0,38.5,3.67,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,...,19.39,40.9,6.92,756.1,83.666667,3.666667,40.0,3.43,24.897653,24.897653
75%,,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,...,20.6,44.338095,10.4,760.933333,91.666667,5.5,40.0,6.57,37.583769,37.583769


# Partition Data Set / Train-Test-Split DataSet

In [61]:
from numpy.lib.stride_tricks import sliding_window_view
from sklearn.preprocessing import MinMaxScaler

# Output Data
energy = df.loc[:, "Appliances"]
energy

# Split Output Data
e_train, e_test = energy.iloc[:12000], energy.iloc[12000:]

# Data Scale
scaler = MinMaxScaler()
e_train_scaled = scaler.fit_transform(e_train.to_numpy().reshape(-1, 1))
e_test_scaled = scaler.transform(e_test.to_numpy().reshape(-1, 1))

# Sliding Window
## 4 predictors for 1 output in this case. 
## This will produce a window of shape 11996 x 1 x 5, which we don't need one column of.
## That is why we use the squee method, to reduce this unnecessary complexity.
## The last value in our sliding window is what we will consider the test value, and the
## ones that came before our our training values.
w = 4
## Train
window_train = sliding_window_view(e_train_scaled, w + 1, axis=0)
window_train = window_train.squeeze()
X_train_w = window_train[:, :-1]
y_train_w = window_train[:, -1]

## Test
window_test = sliding_window_view(e_test_scaled, w + 1, axis=0)
window_test = window_test.squeeze()
X_test_w = window_test[:,:-1]
y_test_w = window_test[:,-1]

## Input data /parameters is split later during application of BayesSearch Cross Validation.


# Multilayer Perceptron Neural Network

In [None]:
from sklearn.neural_network import MLPRegressor
from skopt import BayesSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error

params = {
    "hidden_layer_sizes": [100, 200, 300], 
    "activation": ["relu", "logistic"], 
    "alpha": [0.0001, 0.001, 0.01], # L2
    "momentum": [0.95, 0.90, 0.85], # local minima momentum 
    "learning_rate_init": [0.001, 0.01, 0.1], #step-size control for weight updates.
    "n_iter_no_change": [30, 40, 50],  # No. Iterations allowed during training while score is not improving. 
    "learning_rate": ["constant", "invscaling", "adaptive"], # scheme for changing the learning rate started by `learning_rate_init`
}

mlp = MLPRegressor(max_iter=100000, early_stopping=True, random_state=0)
mlp_bs = BayesSearchCV(
    mlp, params,
    cv = TimeSeriesSplit(n_splits=5, gap=w + 1), 
    scoring=make_scorer(mean_squared_error, greater_is_better=False),
    n_iter=15, n_jobs=-1, refit=True, random_state=0, 

)
mlp_bs.fit(X_train_w, y_train_w)

with open("multilayer_perceptron_nerual_network_fit.pkl", "wb") as file:
    pickle.dump(mlp_bs, file)


In [None]:
# Visualize  
