# AI application in Structural Engineering
_Larissa Driemeier and Gabriel Lopes Rodrigues_


This introductory notebook replicates the geometry of the structure analysed in the paper
[*Background Information of Deep Learning for Structural Engineering*](https://www.researchgate.net/publication/318190131_Background_Information_of_Deep_Learning_for_Structural_Engineering).

## Structural Model

### Geometry

The figure below shows a beam like 2D truss with 10 bars. The length of the bars are fixed, however the cross section areas are obtained through a random uniform sampling between $0.6$ $cm^2$ and $225.8$ $cm^2$. In total, 500 different structures were generated.

 ![](https://drive.google.com/uc?export=view&id=1xOuJYBWiWGkq5l_Z_hcAjYak_hjG26l5)

Then, the structure is loaded and analysed in the commercial FE software Abaqus. Since the dimensions in the structure are fixed, the input the set of areas, while all nodal displacements and also bar stresses are computed as output.

### Linear material model

The material characteristics are generic values for Aluminum alloy 6061, as listed below.

Property  | Value &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    | Unity&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
---   | --- | ---
Mass density $\rho$ | $2.768\times  10^{-9}$ | $ton/mm^2$
Poisson $\nu$ | $0.35$ | -
Young's Modulus $E$| $68950$ | $MPa$
Yield stress $\sigma_{y0}$| $200$ | $MPa$

The material undergoes elastic deformation until it reaches the elastic limit defined by the yield stress. After the elastic limit, the material exhibits plastic behavior,that is, the material deforms irreversibly and does not return to its original shape and size, even when the load is removed. Initially, only elastic behaviour is considered.

## Machine Learning tool
Virtually  no one develops its own code to implement and train a ANN since there are numerous development tools, already tested, that do most of this work and are widely used.

The great advantage of using one of these tools comes from the fact that we only need to define the configuration (architecture) of the RNA, that is, to define how *forward propagation* is performed. When forward propagation is defined, the back propagation, which is in fact the most difficult part of coding an ANN, is automatically generated using symbolic manipulation.

The most used deep-learning tools today are the following:
- TensorFlow;
- Keras;
- Pytorch;
- Caffe;
- Theano;
- MXNET;
- CNTK;
- Others.

Almost all of these tools are freely available on the Internet.

The Figure below shows the classification of these tools by users in the deep-learning area, which also provides an indication of the relative percentage of use of these tools.

 ![](https://drive.google.com/uc?export=view&id=1bTRu1eKniwD9nVB-i0F2LG5as--bFe6e)


Google's TensorFlow and Facebook's PyTorch are both widely used machine learning and deep learning frameworks. TensorFlow, a symbolic math library used for machine learning and training neural networks, was open sourced in 2015 and backed by a huge community of machine learning experts.

PyTorch, on the other hand, is a Python package released by Facebook in 2016 for training neural networks. It quickly gained popularity because developers found it easy to use unlike TensorFlow.

Keras was developed by the MIT and is the most used deep learning framework among top-5 winning teams on Kaggle. It is, nowadays, TensorFlow's high-level API. Originally, Keras' default backend was Theano. When Google released TensorFlow, Keras started supporting TensorFlow as a backend, being the default since the release of Keras v1.1.0.

Both TensorFlow and Keras usage grew together and, finally, the `tf.keras` submodule was introduced in TensorFlow v1.10.0, the first step in integrating Keras directly within the TensorFlow package itself.

Keras actually consists of only a more user-friendly interface for other tools, so to use Keras we must also have TensorFlow, or Theano, or CNTK installed on the computer.

Some Keras commands from TensorFlow are slightly different from the original Keras, so a program made for Keras must be modified to run with TensorFlow's Keras, but modifications are few and simple.

Care must be taken with the numerous versions of TensorFlow and Keras, because a program made for an older version may not work with a newer version.


### Keras

Keras is a tool for developing deep-learning ANNs, based on the Python language, which provides a simple and convenient way to build, train and test an ANN.

The fundamental structure of the ANNs is the layer, which receives a tensor as input and generate another tensor as an output.

There are several types of layers, each type being specific for a given tensor format and for a certain type of processing. So, for example:

- Data in the form of vectors are stored in 2D tensors (1st axis: examples; 2nd axis: characteristics) and typically processed in densely connected layers, called *dense* layers;


- Grayscale image data is stored in 3D tensors (1st axis: examples; 2nd axis: height; 3rd axis: width) and typically processed in *convolutive* layers;


- Sequences of temporal data are stored in 3D tensors (1st axis: examples; 2nd axis: time; 3rd axis: characteristics) and typically processed in *recurring* layers, for example, LSTM or GRU layers.

The layers can be seen as *blocks* that we use to build an ANN.  Building deep-learning models in Keras is done by clipping together compatible layers to form useful data-transformation pipelines.
The notion of layer compatibility here refers specifically to the fact that every layer will only accept input tensors of a certain format and will return output tensors of a certain format.

In Keras there are two ways to define an ANN:

- as *Sequential class* - configure models with a single sequence of layers, which is the most common type of ANNs;

- as *Functional class* - configure models with sequences of cyclic or tree layers, allowing a totally arbitrary ANN configuration.


We will start with the simplest way to create an RNA in Keras, which is the sequential model. Creating, training and testing an ANN with Keras is done in the following steps:
- Definition of training and test data;
- ANN configuration, which consists of defining the layers to map the inputs to the desired outputs;
- Compilation of the ANN, which also includes configuring the training process by choosing the cost function, the optimizer and the metric to evaluate performance;
- ANN training;
- ANN performance evaluation.

The Keras documentation provides details on its use. This documentation can be seen at the [link](https://keras.io/).

The Keras manual for TensorFlow is available at the [link](https://www.tensorflow.org/api_docs/python/tf/keras).

### Libraries
Throughout this notebook the new version 2.12 of Tersorflow was used, with built-in keras support, which has been recently released to the public.
To install it, just follow the instructions in [the official website](https://www.tensorflow.org/install), to guarantee that the right version is installed.

The rest of the libraries used were simply installed using pip, the default Python tool for installing packages. These include:

- NumPy: library for dealing with large matrices and also providing optimized functions for these data structures

- Pandas: used to visualize the data and to work with the dataset.

- matplotlib: used to generate plots from the models.

- sklearn (also known as scikit-learn): used because of the many useful functions and utilities it provides for machine learning.

In [None]:
import tensorflow as tf
tf.__version__

'2.12.0'

In [None]:
from tensorflow import keras
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Data Loading and Manipulation

Uploading four files:
1. the dataset containing the areas, `areas.csv`;
2. displacements and reaction force along the time, `FinalResult.csv`;

If you prefer generate new data, we suggest to use the student version of the software [Abaqus](https://edu.3ds.com/en/software/abaqus-student-edition). The following files are available in the same [link](https://edisciplinas.usp.br/course/view.php?id=82602#section-3):
 1. To generate random areas `AreaGeneration.ipynb`;
 2. Script to run in Abaqus to generate data `10_BarStructure.py`;
 3. Basic geometry to be called by the script mentioned in item 2 `Job-10bar.inp`;
 4. Copy the file `extracted_data_DATA_HOUR.csv` as `FinalResult.csv` to upload.

 **Important**

The script in item 02 automatically generates the bar geometry in Abaqus. If you want to build up a geometry - at least once - with Abaqus, Prof Marcilio Alves kindly prepared a tutorial that can be accessed through the [link](https://www.youtube.com/channel/UCEDn-UheEHKLfOKJKmSKzJw).


In [None]:
from google.colab import files
uploaded = files.upload()

As shown below, there are 10 different areas, which will be the inputs, and various other measurements, which might be used as outputs of the Neural Network

In [None]:
df = pd.read_csv('FinalResult.csv', index_col=0)

In [None]:
df.dtypes

In [None]:
# To show all the columns
pd.set_option('display.max_columns', None)
df.head()

### Splitting dataset

The whole dataset will be split into training and test sets and organized in tensors.
The training set will be used to train the model and the test set to verify its performance.

The way the data is organized depends on the data type. Keras expects the first axis of the data, both in and out of the ANN, to be the number of examples $m$.

For example, in our problem where the input data for each example is a vector with $ n_x = 10$ areas, the output is a vector with $n_y = 2$ and there are $m = 520$ examples, so the input and output tensors of the ANNs expected by Keras are as follows:

- Size of the input tensor $(m, n_x)$;
- Size of the output tensor $(m,n_y)$.

When divided into train $(80\%)$ and test $(20\%)$, we expect the dimensions $(416,10)$, $(104,10)$ for train and test input and $(104,10)$, $(104,2)$ for train and test output, respectively.





In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
train.head()

#### Defining the training values and the expected outputs

The input for the NN is a vector with all 10 areas that compound the geometry of the structure. Remember that we mantain all other parameters, such as material properties and dimensions, fixed.

Formato esperado dos dados keras o Keras espear que o primeiro eixo dos dados, tanto de entrada como de saída da RNA seja p número de exemplos. Dimensão do tensor de entrada (m,nx). Dimensão do tensor de saída (m,ny).

As output for the NN, let's generate an array of displacements (\[*d4*\]). This values correspond to the vertical displacement of the rightmost node of the structure.


In [None]:
x_train = train.loc[:,'area1':'area10'].values
y_train = train[['d4']].values
x_val = test.loc[:,'area1':'area10'].values
y_val = test[['d4']].values

In [None]:
print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)

### FEA results

Read the results from FEA, where `d2` and `d4` are the displacements at the rightmost nodes of the structure to see the variation our future NN has to learn.

In [None]:
disp4 = df[['d4']].values
xAxis = [i + 1.0 for i, _ in enumerate(disp4)]
plt.scatter(xAxis,disp4,color='darkslateblue',s=8, label =r'$d_4$')
plt.title('Displacement at the end of the truss structure')
plt.legend()
plt.show()

### Normalizing Dataset

Most of times different features in the data might be have varying magnitudes. For example, a dataset containing two resources, displacement($x_1$), which ranges from 0-1) and stresses ($x_2$), about 100-1000 times greater than displacement. So, these two features are at very different ranges with high values dominating those with small values. The reason is that many of the machine learning algorithms use euclidean distance between data point in their computation. In this case, machine learning model treats those with small values as if they don't exist.

To ensure that this is not the case, we need to scale our resources in the same range, that is, within the range of -3 and 3 or -1/3 and 1/3.

The scikit-learn preprocessing module has excellent api and documentation on feature scaling [here](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling).

We'll normalize our dataset using the scikit-learn object [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

Good practice usage with the MinMaxScaler and other scaling techniques is as follows:

* __Fit the scaler using available training data__ For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the `fit()` function.
* __Apply the scale to training data__ This means you can use the normalized data to train your model. This is done by calling the `transform()` function.
* __Apply the scale to test data__ This means you can use the normalized data to test your model. This is done by calling the `transform()` function.
* __Apply the scale to data going forward__ This means you can prepare new data in the future on which you want to make predictions.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Scaling the input data using the MinMaxScaler from scikit-learn
scaler_x = MinMaxScaler().fit(x_train)
x_train_sca = scaler_x.transform(x_train)
x_val_sca = scaler_x.transform(x_val)

# Normalizing the output data using the normalizer from scikit-learn
normalizer_y = MinMaxScaler(feature_range = (-1.,0.)).fit(y_train)#StandardScaler,MaxAbsScaler
y_train_sca = normalizer_y.transform(y_train)
y_val_sca = normalizer_y.transform(y_val)


#  Min and Max in input
min_x_train = np.min(x_train_sca)
min_x_val = np.min(x_val_sca)
max_x_train = np.max(x_train_sca)
max_x_val = np.max(x_val_sca)

# Mean and Standard Deviation in Output
min_y_train = np.min(y_train_sca)#mean
min_y_val = np.min(y_val_sca)
max_y_train = np.max(y_train_sca)#std
max_y_val = np.max(y_val_sca)


print(f'For the input training set, the min is {min_x_train} and the max is {max_x_train}')
print(f'For the input validation set, the min is {min_x_val} and the max is {max_x_val}')
print(f'For the output train set, the min is {min_y_train} and the max is {max_y_train}')
print(f'For the output validation set, the min is {min_y_val} and the max is {max_y_val}')

As can be seen, the training set is always between 0 and 1. Since the dataset is normalized with respect to the training set, for the val data, the limits are slightly different.

## First ANN model

This model is a Neural Network with architecture (10-20-1), sigmoid activation function and Stochastic Gradient Descendent (SGD) as the optimizer. It is exactly the first model described in the paper.



### Configuration

The architecture (10-20-1) our network consists of one input layer with 10 features, followed by a sequence of two Dense layers, which are fully connected neural layers. The first hidden layer has 20 neurons, and the second (and last) layer has 1 neuron.

See that the last activation function is not defined - default in Keras is set to `None`. That means, it is linear $g(z)=z$. That means that by default it is a linear activation.


We will  create the sequencial ANN in two ways:
1. Passing a list of layer instances to the Keras constructor;
2. In steps using commands to add one layer at a time, using `add()`.

In [None]:
from keras import models
from keras.layers import Dense, Activation

##First definition

model = models.Sequential([
    Dense(20, input_shape=(10,)),
    Activation('sigmoid'),
    Dense(1)
])

model.summary()

This command defines an RNA of an intermediate layer and an output layer with the following characteristics:
   * Input data for each training example is a 1-D vector (10);
   * It is observed that the dimension of the second axis of the input tensor is not included in the ʻinput_shape` argument, because at that moment the number of examples that will be used in training is unknown;
   * Be careful - although it seems that in the `input_shape` argument the second axis is the number of examples, Keras expects the first axis of the input tensor to be the number of examples;
   * The hidden layer is of the dense type (fully connected), it has 20 neurons and its activation function is sigmoid;
   * The output layer is dense (fully connected), has one neuron and its activation function is linear.

The name used for this ANN was `model`, but any other name could be given.

The `summary ()` method presents a summary of the main characteristics of the network.

As you saw in the previous code, the architecture of the ANN is presented in a table, the content of which is as follows:
   * First column provides the types of network layers and numbers the layers;
   * Second column shows the number of neurons in the layer;
   * Third column shows the number of layer parameters.

And the valuer $220???$ Where did it come from?

The number of parameters is the sum of those in weight matrix $ \mathbf W$, dimension *number of neurons in the layer* per *number of inputs*, and the bias vector $\mathbf b$, dimension *number of biases*:
$$
20\times 10 + 20 = 220
$$

  It is important to visualize the RNA architecture because any error messages in its compilation or training, reference the layer by its number, which can be obtained by the `summary ()` method.

Another way to visualize an RNA in Keras is to graph it using the `plot_model` function.







In [None]:
from keras.utils import plot_model
import pydot
plot_model(model, to_file = '/content/model.png', show_shapes = True)

In [None]:
from keras import models
from keras import layers

##Second definition

model = models.Sequential()
model.add(layers.Dense(20, activation='sigmoid', input_shape=(10,)))
model.add(layers.Dense(1))

model.summary()

In this case, instead of importing only the sequential model structure, the first command imports all types of models from Keras. The second command imports all types of layers from Keras, instead of only dense layers, as previously done.

The third command creates the RNA instance using a sequential model.

The fourth command adds the first layer of RNA of the dense type, with 20 neurons, with Sigmoid activation function, whose input is a line vector of dimension (10).

The fifth command adds a second dense type layer, with 1 neuron and none activation function.

#### Dimensions of the input data

The ANN needs to know the dimensions of the input data, for this reason the first layer in a sequential model needs to receive this information.

It is observed that only the first layer needs this information, since Keras automatically infers the dimensions of the input data of the other layers of the ANN using the information of the number of neuron of each layer.

There are several ways to define the size of the ANN input data:

- Passing the `input_shape` argument to the first layer. This argument is a tuple of integers or simply `None`, where` None` indicates that any positive integer can be expected;

- In the `input_shape` argument the number of examples is not included and Keras automatically infers this number from the input data provided;

- Some 2D layers, such as dense layers, support the specification of the input data also via the `input_dim` argument.

As an example, the following commands are equivalent:

```
ann = models.Sequential ()
ann.add (Dense (20, input_shape = (10,)))
```
or
```
ann = models.Sequential ()
ann.add (Dense (20, input_dim = 10))
```

If you need to specify a fixed number of examples, you can pass the `batch_size` argument to the input layer. Thus, for example, if `batch_size = 32` and `input_shape = (6, 8)` are used for the first layer, a dimension tensor (32, 6, 8) will be expected as input data.

The development of an ANN requires many iterations to obtain a desirable result, so to avoid executing the same configuration commands over and over again, which can be long depending on the size of the ANN, you can create a function to configure the RNA. For this we have, for example, the following function:

In [None]:
def build_model(data_shape=(10,)):
    model = models.Sequential()
    model.add(layers.Dense(units=20, activation='sigmoid', input_shape=data_shape))
    model.add(layers.Dense(units=1))
    return model

In [None]:
model = build_model()

In this case the argument `data_shape` represents the dimension of the input data without considering the number of examples.

### Compilation

The generation of the ANN is performed in the compilation stage, where the loss function, the training method and the metrics for the ANN evaluation are defined and configurated:

+ The loss function `mean_squared_error` — How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.  
+ The optimizer `sgd` — The mechanism through which the network will update itself based on the data it sees and its loss function.
+ Metrics to monitor during training and testing `mean_absolute_error`, `mean_absolute_percentage_error`.

Keras uses the principle of making things simple, but at the same time it allows the user to control whatever is needed. If we want, we can configurate completely the optimizer.  See more details at the [link](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) or [here](https://ruder.io/optimizing-gradient-descent/).

### Loss Function

Mean squared error $E$ is calculated as the average of the squared differences between the predicted $\hat{\mathbf{y}}^{(i)}$ and target values ${\mathbf{y}}^{(i)}$,
$$
E(\hat{\mathbf{y}}^{(i)},{\mathbf{y}}^{(i)})=\sum\limits_{j=1}^{n_y} \left(\hat{y}^{(i)}_j -{y}^{(i)}_j\right)^2= \left\|\hat{\mathbf{y}}^{(i)} - {\mathbf{y}}^{(i)} \right\|_2^2
$$
for the data $i$, $i=1,...m$.
Then, the loss function $J$
$$
J\left(\mathbf{W},\mathbf{B}\right)={1\over m}\sum\limits_{i=1}^{m}E(\hat{\mathbf{y}}^{(i)},{\mathbf{y}}^{(i)}) = {1\over m}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n_y} \left(\hat{y}^{(i)}_j -{y}^{(i)}_j\right)^2 = {1\over m} \sum\limits_{i=1}^{m}\left\|\hat{\mathbf{y}}^{(i)} - {\mathbf{y}}^{(i)} \right\|_2^2
$$
depends on the weights $\mathbf{W}$ and bias $\mathbf{b}$ parameters.
The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0.

The squaring means that larger mistakes result in more error than smaller mistakes, that is, the model is punished for making larger mistakes.

The mean squared error loss function can be used in Keras by specifying ‘mse‘ or `mean_squared_error` as the loss function when compiling the model.

### Optimizer

SGD is the same as gradient descent, except that it is used to split the data into batches. The parameter is called *mini-batch size*.

Faster optimizers are available in the literature to speed up the training step. We will apply the SGD + Momentum (known as SGD), but, be aware that are other  popular Optimizer approaches such as Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam, and Nadam optimization.

The best optimizer, according to the literature, is Adam. See more about this optimizer in [link](https://www.aiplusinfo.com/blog/what-is-the-adam-optimizer-and-how-is-it-used-in-machine-learning/).

The SGD optimizer used here has a learning rate of 0.001 and momentum of 0.9.

### Metrics

A metric or  Key Performance Indicator (KPI) is a function that is used to judge the performance of your model. The most commonly used are defined below.

**MAE**

The Mean Absolute Error (`mean_absolute_error`,`MAE`, `mae`) computes the mean absolute error between the labels and predictions
$$
MAE = \frac{1}{n} \sum_1^n |y^{(i)} - \hat{y}^{(i)}|
$$
wher $n$ is the length of the validation dataset.

**MAPE**

The Mean Absolute Percentage Error (`mean_absolute_percentage_error`, `MAPE`, `mape`) is one of  which is
$$
MAPE = \frac{100}{n} \sum_i^n \frac{y^{(i)} - \hat{y}^{(i)}}{y^{(i)}}
$$

Similar to MAE, but normalized by true observation. Downside is when true observation value $\hat{y}^{(i)}$ is zero or near to zero, this metric will be problematic.

**MSE**

Mean squared error (`mean_squared_error`, `MSE` or `mse`) is a quadratic scoring rule that also measures the average magnitude of the error. It’s the average of squared differences between prediction and actual observation,
$$
MSE =\frac{1}{n} \sum_i^n (y^{(i)} -\hat{y}^{(i)})^2
$$

MSE is like a combination measurement of bias and variance of your prediction, i.e., $MSE = Bias^2 + Var$.

MAE and MSE are two of the most common metrics used to measure accuracy for continuous variables. They express average model prediction error in units of the variable of interest, can range from $0$ to $+\infty$ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

Taking the average of the squared errors has some interesting implications for MSE. Since the errors are squared, the RMSE gives a relatively high weight to large errors. This means the MSE should be more useful when large errors are particularly undesirable.


In [None]:
from tensorflow.keras import optimizers

sgd = optimizers.SGD(learning_rate=0.001, momentum=0.9)

model.compile(optimizer=sgd,
              loss='mean_squared_error',
              metrics=['mean_absolute_error', 'mean_absolute_percentage_error'])

The code below is a callback to create a loss history for the validation set. It evaluates the model after each epoch and appends the result into an array. It is used to generate a plot of the model loss for both the training and the validation sets.

However, it is also very slow and makes training last for more than half an hour with large epochs, which is why the code to call it is normally commented.

In [None]:
class TestLossHistory(keras.callbacks.Callback):
    def __init__(self, x_test, y_test):
        self.x_test = x_test
        self.y_test = y_test
        self.i = 0
    def on_train_begin(self, logs={}):
        self.losses = []
    def on_epoch_end(self, batch, logs={}):
        #print(f"logs: {logs}")
        self.losses.append(self.model.evaluate(self.x_test, self.y_test))
    def on_train_end(self, logs={}):
        self.losses = np.array(self.losses)

### Training

#### Gradient descent

The Gradient Descent (GD) method is the basic *motor* of artificial neural networks.

As discussed in class, GD is an iterative method, following the steps:

* Initialization of the ANN parameters: assign an initial value to the parameters weights ($\mathbf W$), and bias ($\mathbf b$);
* Execution of the ANN for all examples of the training data set, so that given the inputs, the outputs predicted by the ANN are calculated;
* Calculation of the loss function for all training examples, through the sum of the error function;
* Calculation of the gradient of the cost function in relation to all parameters of the ANN;
* Updating of ANN parameters in the opposite direction to the gradient in order to reduce the value of the loss function.

The steps 2 to 5 are iterative. Therefore, training a deep neural network can be an extremely time-consuming task especially with complex problems.

In the algorithm presented before, updating the ANN parameters is done only after calculating the gradient of the cost function for all training examples and this can be a big problem, or even unfeasible, if we have a large number of training examples, for example, something in the order of 100 thousand or 1 million, which is not uncommon.

To avoid computer memory problems, the optimization process can be changed in order to update the ANN parameters after processing only a few examples (a *batch*) of the training dataset. In general, this process is called in the literature as Stochastic Gradient Descent (SGD) because the training data is randomly divided into smaller sets.

There literature is a bit confused about nomenclature of the other versions of the GD. As far as the size of the dataset used to update parameters is concern,  the options are,
* Batch Gradient Descent (BGD) or simply GD - running on a full dataset. Gradient is more general, but intractable for huge datasets;
* Stochastic Gradient Descent (SGD) - picking a random instance at each step.  Gradient can be noisy;
* Mini-batch Gradient Descent (MBGD) or also Stochastic Gradient Descent (SGD) - running on random subsets of the data with dimension `bath_size` - looking for a balance between running the full data set or only one before update the parameters. Not very noisy and computationally tractable, that means, best of both worlds.

As usually, there is no free lunch and all have pros and cons. Batch Gradient Descent can reach the global minimum at a terribly slow pace, specially with huge problems since in modern day architectures, the number of parameters may be in billions. Mini-batch Gradient Descent gets to the global minimum faster than BGD but it is easier to get stuck in the local minimum, and SGD is usually harder to get to the global minimum compared to the other two.

We'll train our NN using MBGD and BGD, in order to compare effectiveness.

In Keras `batch_size` refers to the batch size in MBGD. The default in keras is a MBGD with `batch_size=32`. If you want to run a GD, you need to set the `batch_size` to the number of training samples.

Below, train and test are performed with mini-batch.

The training of an ANN is carried out with the `fit()` method. For example, train an ANN using 10 epochs, the command used is as follows:


```
model.fit (x_train, y_train, epochs = 10, verbose = 2)
```
The fit method performs the training of the ANN with the training examples composed of the input data, `x_train`, and the output data, `y_train`.
`epochs = 10` means that 10 training seasons are used and `verbose = 2` means that after each epoch the values of the loss function and the metrics are presented.

There are several arguments for the fit method, which you can explore, as your knowledge of the subject grows, and your needs...

All options for the fit method can be seen in detail in the Keras documentation.

In [None]:
history_with_minibatch = model.fit(x_train_sca, y_train_sca, epochs=500, batch_size=32, verbose=2)

# To use the test loss history, comment the lines above and uncomment the lines below
#test_history_with_minibatch = TestLossHistory(x_val_sca, y_val_sca)
#history_with_minibatch = model.fit(x_train_sca, y_train_sca, epochs=10000, batch_size=32,
#                                   callbacks=[test_history_with_minibatch])



#### Saving the training process

If the training process is saved, it is possible to graph the loss function, allowing a more detailed analysis of the process. For this we use:


```
history_MODEL = model.fit (x_train, y_train, epochs = 1000)
```
In this training command the values of the cost function and the metric according to the seasons are saved in the `history_MODEL` object.

The `history_MODEL` object contains a dictionary with the values of the loss function and metrics for each epoch, which can be accessed using the following comment:
```
history_dict = history_MODEL.history
history_dict.keys ()
```
The `history_dict.keys` command displays the dictionary contents saved during training.

In [None]:
history_with_minibatch.history.keys()

In [None]:
plt.plot(history_with_minibatch.history['loss'],linewidth=2.0)
plt.title('Training Mean Squared Error (MSE)\n with architecture (10-20-1) using mini-batch', fontsize=12)
plt.xlabel('epochs')
plt.ylabel('Loss')
#plt.plot(test_history_with_minibatch.losses.T[0])

The same training is performed using BGD, that means, the size of the batch is the length of the dataset.

In [None]:
# Redefining the model
model2 = build_model()

model2.compile(optimizer=sgd,
              loss='mean_squared_error',
              metrics=['mean_absolute_error', 'mean_absolute_percentage_error'])

In [None]:
history_without_minibatch = model2.fit(x_train_sca, y_train_sca, epochs=500, batch_size=x_train.shape[0], verbose = 0)


# To use the test loss history, comment the lines above and uncomment the lines below
#test_history_without_minibatch = TestLossHistory(x_test, y_test)
#history_without_minibatch = model2.fit(x_train_norm, y_train, epochs=10000, batch_size=x_train.shape[0],
#                                       callbacks=[test_history_without_minibatch])

### ANN performance

After training the ANN it is important to evaluate its performance with new data, that is, not used in the training.

The evaluation of the ANN with the test data set can be done using the `evaluate` method.

In [None]:
model2_metric_train = model2.evaluate(x_train_sca, y_train_sca)
print('Training performance')
print(model2_metric_train)
model2_metric_test = model2.evaluate(x_val_sca, y_val_sca)
print('Test performance')
print(model2_metric_test)

In [None]:
plt.plot(history_with_minibatch.history['loss'], label='With minibatch',linewidth=1.5)
plt.title('Training Mean Squared Error (MSE)\nwith neural network architecture (10-20-1)', fontsize=12)
plt.xlabel('epochs')
plt.ylabel('Loss')
plt.plot(history_without_minibatch.history['loss'], label='Without minibatch',linewidth=1.5)
plt.legend()
plt.xlim([-10, 200])
#plt.plot(test_history_without_minibatch.losses.T[0])

As can be seen above, with the use of minibatches, the training converged much faster.

The ANN can also be evaluated by calculating the expected outputs from the test set examples using the `predict` method, as follows:

```
y_prev = model.predict(x_test)
```



Initially, let's define a function to plot target in blue and prediction in red.

In [None]:
pred_sca_train = model.predict(x_train_sca)
pred_sca_val = model.predict(x_val_sca)
print(pred_sca_val.shape,pred_sca_train.shape)

In [None]:
y_new_train = normalizer_y.inverse_transform(pred_sca_train)
y_new_val = normalizer_y.inverse_transform(pred_sca_val)

In [None]:
# Graph of actual and predicted classes
def Target_vs_Predic(y_test,y_pred):
  plt.figure(figsize=(16, 6))
  plt.plot(y_test[:,0], 'o', color = 'darkslateblue', label='Target')
  plt.plot(y_pred[:,0], 'o', color = 'crimson', label='ANN prediction')
  plt.title('Target vs prediction of the ANN')
  plt.xlabel('Example')
  plt.ylabel(r'$d_{4} [mm]$')
  plt.legend()
  plt.show

In [None]:
Target_vs_Predic(y_val,y_new_val)

 ![](https://drive.google.com/uc?export=view&id=1M4Yho73GUFuFCr0CqNOrv5UhvV5r60Jg)


## Changing number of neurons in the hidden layer

Up to this point, the number os neurons in the hidden layer was chosen rather arbitrarily as 20. To experiment, with this, the architectures of (10−10−1), (10−20−1), (10−30−1), (10−40−1), and (10−50−1) will be considered, keeping the rest as before.

In [None]:
# Creates a model with the specific number of neurons num_neurons and specific activation g
def make_model(num_neurons=20, g = 'sigmoid'):
  model = models.Sequential()
  model.add(layers.Dense(units=num_neurons, activation=g, input_shape=(10,)))
  model.add(layers.Dense(1))

  model.compile(optimizer=sgd,
                loss='mean_squared_error',
                metrics=['mean_absolute_error', 'mean_absolute_percentage_error'])
  return model

In [None]:
model_10_neurons = make_model(num_neurons=10)
model_20_neurons = make_model(num_neurons=20)
model_30_neurons = make_model(num_neurons=30)
model_40_neurons = make_model(num_neurons=40)
model_50_neurons = make_model(num_neurons=50)

In [None]:
# Training the models - output will be suppressed
print('10 neurons')
hist_10_neurons = model_10_neurons.fit(x_train_sca, y_train_sca, epochs=500, verbose=0)
print('20 neurons')
hist_20_neurons = model_20_neurons.fit(x_train_sca, y_train_sca, epochs=500, verbose=0)
print('30 neurons')
hist_30_neurons = model_30_neurons.fit(x_train_sca, y_train_sca, epochs=500, verbose=0)
print('40 neurons')
hist_40_neurons = model_40_neurons.fit(x_train_sca, y_train_sca, epochs=500, verbose=0)
print('50 neurons')
hist_50_neurons = model_50_neurons.fit(x_train_sca, y_train_sca, epochs=500, verbose=0)
print('Done!')

In [None]:
plt.title('MSE for the models with\n varying number of neurons in hidden layers', fontsize=12)
plt.xlabel('epochs')
plt.ylabel('Loss')
plt.plot(hist_10_neurons.history['loss'], label='10',linewidth=1.0)
plt.plot(hist_20_neurons.history['loss'], label='20',linewidth=1.0)
plt.plot(hist_30_neurons.history['loss'], label='30',linewidth=1.0)
plt.plot(hist_40_neurons.history['loss'], label='40',linewidth=1.0)
plt.plot(hist_50_neurons.history['loss'], label='50',linewidth=1.0)
plt.xlim([0,20])
#plt.ylim([0,2500])
plt.legend();

From the plot above, it can be seen that the learning gets faster as the number of neurons increases. However, from 40 to 50 neurons the difference is less significative than from 10 to 20 neurons.

In [None]:
# Graph of actual and predicted classes
def Target_vs_Predic2(y_test,y_pred1,label1,y_pred2,label2):
  plt.figure(figsize=(16, 6))
  plt.plot(y_test[:,0], 'o', color = 'darkslateblue', label='Target')
  plt.plot(y_pred1[:,0], 'o', color = 'crimson', label=label1)
  plt.plot(y_pred2[:,0], 'o', color = 'gold', label=label2)
  plt.title('Target vs prediction of the ANN')
  plt.xlabel('Example')
  plt.ylabel(r'$d_{4} [mm]$')
  plt.legend()
  plt.show

In [None]:
pred_sca_val = model_10_neurons.predict(x_val_sca)
y_new_val1 = normalizer_y.inverse_transform(pred_sca_val)

pred_sca_val = model_50_neurons.predict(x_val_sca)
y_new_val2 = normalizer_y.inverse_transform(pred_sca_val)

Target_vs_Predic2(y_val,y_new_val1,'10 neurons', y_new_val2, '50 neurons')

## Activation functions

Until now, the sigmoid or logistic function,
$$g(z) = \frac{1}{1+e^{-z}}$$
was used to create activate the neurons.

The purpose of the activation function is to introduce non-linearity into the output of a neuron. See the figure below for the most commonly used activation functions.

![](https://drive.google.com/uc?export=view&id=1iZPDL1JuFLeH9uOc-SfbTfBWBnXwamRM)

This non-linearity is one of the factors that affect our results and the accuracy of our model. When the NN has several hidden layers, a linear activation function will simply generate a series of related transformations so, this model is no more expressive than a simple standard logistic regression model. Unless we convey nonlinearity, we are not computing interesting models, even if we delve into neural networks.

*Regarding the choice of activation functions in the last layer*, it imposes restrictions on the outputs of the ANN, therefore, its choice depends on the type of problem we want to solve.
For example, the sigmoid function only generates positive values ​​between $0$ and $1$, while the hyperbolic tangent provides values ​​between $−1$ and $1$, and the ReLu function only generates positive values ​​between $0$ and $+\infty$.

In our case, we have a regression problem with arbitrary values for output (function adjustment), so, the linear activation function (default in keras) must be used.

*Regarding the activation functions of the intermediate layers*, the choice is not so direct. But there are a few tips:
* The sigmoid function can be used if both input and output data are in the range $0$ to $1$;
* It is a general rule to use the hyperbolic tangent instead of the sigmoid if the input and output data are in the range of $−1$ to $1$;
* Hyperbolic and sigmoid are very stable. It is observed, however, that the use of these functions in the intermediate layers increases the difficulty of training the ANN, since in general they cause saturation and vanishing gradients problems (some weights go to zero);
* A better choice for the activation function of the intermediate layers is ReLu or leReLu, as they have no saturation problems or small gradients. But they have an opposite problem, gradient explosion (some weights increase a lot). ReLu or leReLu are the most used activation functions for middle layers.

Some of these functions are plotted below.

In [None]:
x = np.linspace(-3, 3, 500)
relu = np.maximum(0, x)
sigmoid = 1/(1+np.exp(-x))
tanh = np.tanh(x)
softplus = np.log(1+np.exp(x))

plt.plot(x, relu, label='ReLU',linewidth=1.5)
plt.plot(x, sigmoid, label='sigmoid',linewidth=1.5)
plt.plot(x, tanh, label='tanh',linewidth=1.5)
plt.plot(x, softplus, label='softplus',linewidth=1.5)
plt.legend(fancybox=True, framealpha=0.5, loc='upper right')
plt.title('Some common activation functions', fontsize=12)

Now, other activations will be used in the model and its result will be evaluated. The functions that will be used are sigmoid, ReLU, tanh e softplus. Of course, we'll implement in the hidden layer (20 neurons), since the last must have linear activation function because we're implementing a regression problem.

Other hyperparameters will be kept constant, with SGD as optimizer, MSE as loss and with (10-20-2) architecture.

In [None]:
sigmoid_model = make_model(g = 'sigmoid')
relu_model = make_model(g = 'relu')
tanh_model = make_model(g = 'tanh')
softplus_model = make_model(g = 'softplus')

In [None]:
# Training the models
print('Sigmoid')
sigmoid_history = sigmoid_model.fit(x_train_sca, y_train_sca, epochs=500, batch_size=32, verbose = 0)
print('ReLU')
relu_history = relu_model.fit(x_train_sca, y_train_sca, epochs=500, batch_size=32, verbose = 0)
print('Tanh')
tanh_history = tanh_model.fit(x_train_sca, y_train_sca, epochs=500, batch_size=32, verbose = 0)
print('Softplus')
softplus_history = softplus_model.fit(x_train_sca, y_train_sca, epochs=500, batch_size=32, verbose = 0)
print('Done!!!')

In [None]:
plt.style.use('fivethirtyeight')
plt.title('Training MSE for the models with different activation functions', fontsize=12)
plt.xlabel('epochs')
plt.ylabel('Loss')
plt.plot(sigmoid_history.history['loss'], label='sigmoid',linewidth=1.5)
plt.plot(relu_history.history['loss'], label='relu',linewidth=1.5)
plt.plot(tanh_history.history['loss'], label='tanh',linewidth=1.5)
plt.plot(softplus_history.history['loss'], label='softplus',linewidth=1.5)
plt.xlim([0,20])
plt.legend(fancybox=True, framealpha=0.5)

In [None]:
pred_sca_val = sigmoid_model.predict(x_val_sca)
y_new_val1 = normalizer_y.inverse_transform(pred_sca_val)

pred_sca_val = softplus_model.predict(x_val_sca)
y_new_val2 = normalizer_y.inverse_transform(pred_sca_val)

Target_vs_Predic2(y_val,y_new_val1,'Sigmoid',y_new_val2, 'Softplus')

## Diferent Optimizers with different activations
A comparison with different optimizers and activation functions will be made. Models with (10-20-18) arquitecture will be made using these diffentent combinations and the results will be gathered and shown.

The optimizers that will be used are:
+ SGD
+ AdaGrad
+ Adadelta
+ RMSprop
+ Adam

And the activation functions will be:
+ sigmoid
+ tanh
+ softplus
+ ReLU

In [None]:
# Makes a model with different optimizer and activation

def make_model_2(activation, optimizer):
  model = models.Sequential()
  model.add(layers.Dense(20, activation=activation, input_shape=(10,)))
  model.add(layers.Dense(1))
  model.compile(optimizer=optimizer,
                  loss='mean_squared_error',
                  metrics=['mean_absolute_error', 'mean_absolute_percentage_error'])
  return model

In [None]:
model_s = {'SGD': {}, 'AdaGrad': {}, 'Adadelta': {}, 'RMSprop': {}, 'Adam': {}}
activation_s =  ['sigmoid', 'tanh', 'softplus', 'relu']

In [None]:
i = 0
for optimizer in model_s.keys():
    for activation in activation_s:
        print(f'Combination {i}: {optimizer} with {activation}')
        model = make_model_2(activation, optimizer)
        hist = model.fit(x_train_sca, y_train_sca, epochs=500, verbose=0)
        train_loss = hist.history['loss']
        val_loss = model.evaluate(x_val_sca, y_val_sca, verbose=0)
        model_s[optimizer][activation] = {'model': model, 'train': train_loss,
                                         'val': val_loss, 'hist': hist}
        i += 1

In [None]:
plt.style.use('fivethirtyeight')
plt.title('Train MSE for different optimizers\nusing softplus as activation', fontsize=12)
plt.xlabel('epochs')
plt.ylabel('Loss')
plt.plot(model_s['SGD']['softplus']['hist'].history['loss'], label='SGD',linewidth=1.5)
plt.plot(model_s['AdaGrad']['softplus']['hist'].history['loss'], label='AdaGrad',linewidth=1.5)
plt.plot(model_s['Adadelta']['softplus']['hist'].history['loss'], label='Adadelta',linewidth=1.5)
plt.plot(model_s['RMSprop']['softplus']['hist'].history['loss'], label='RMSprop',linewidth=1.5)
plt.plot(model_s['Adam']['softplus']['hist'].history['loss'], label='Adam',linewidth=1.5)
plt.xlim([0,100])
plt.legend(fancybox=True, framealpha=0.5, fontsize=10)

Using the softplus as activation function, it can be seen that the training is faster using the SGD or RMSprop. Adam reached also a good result in a slightly larger number of epochs.

To show the resulting test and training data in a dataframe, a pandas MultiIndex will be used. To create it, [the user guide from pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) was consulted.

In [None]:
arrays = [['SGD', 'AdaGrad', 'Adadelta', 'RMSprop', 'Adam'],
          ['Training', 'Val']]
index = pd.MultiIndex.from_product(arrays, names=['Optimizer', 'Train/Val'])
index

In [None]:
data = np.zeros((4,10))
i = 0
for optimizer in model_s.keys():
    j = 0
    isTest = False
    for activation in activation_s:
        data[j, i] = model_s[optimizer][activation]['train'][-1]
        j += 1
    j = 0
    i += 1
    for activation in activation_s:
        data[j, i] = model_s[optimizer][activation]['val'][0]
        j += 1
    i += 1
data

In [None]:
model_s_df = pd.DataFrame(data=data, columns=index, index=activation_s)
model_s_df

As can be seen, the best models were obtained with SGD optimizer, with ReLU or softplus as activation.

For the next steps, let's initiate with SGD or Adam as optimizers, and ReLU as activation.

Before to work adding new layers in our model, let's learn how to save our ANN.

## Save a complete ANN

To save an ANN developed with Keras to a file in hdf5 format we use the  method `save(file_path_and_name)`. The saved file contains the following information:
* ANN architecture;
* ANN parameters;
* optimizer parameters adopted to train the ANN;
* The optimizer status to allow you to continue training exactly where you left off.

Then, the method
```
load_model (file_path_and_name)
```
is applied to re-establish the ANN as it was when it was saved.


In [None]:
# Import library to manipulate files in HDF5 format
import h5py

# Save the network in the format of an HDF5 dictionary
model.save('/content/ANN.h5')

### Delete an ANN

To delete an ANN from memory, use the `del` method.

To load an ANN saved in a file, use the `load_model ()` method.

Before using this method you have to load the `load_model` method from the `tensorflow.keras.models` class.

In [None]:
from keras.models import load_model
# Delet the model
del model
# Recover the ANN from ANN.h5 file
model = load_model('/content/ANN.h5')
model.summary()

### Save the parameters of an RNA

If we want to save only the parameters of an ANN we use the method
`save_weigths(file_path_and_name)`

In [None]:
model.save_weights('/content/ANN_parameters.h5', save_format='h5')


Follow the code below, to load these parameters into another ANN (ann2) with the same architecture used to obtain these parameters,

In [None]:
from keras import models
from keras import layers
# Cria rna2 com mesma arquitetura da rna
ann2 = make_model()
ann2.summary()
# Carrega parâmetros da rede rna na rede rna2
ann2.load_weights('/content/ANN_parameters.h5')

There are many situations where we want to develop a new ANN (`ann_new`) using as base another ANN (` ann_basic`) already trained. However, we only want to take advantage of the parameters of some layers of `ann_basic` in `ann_new`. This is possible to do using the methods `save_weights()` and `load_weights()` with small changes. If necessary, consult [here](https://machinelearningmastery.com/save-load-keras-deep-learning-models/), or [here](https://medium.com/swlh/saving-and-loading-of-keras-sequential-and-functional-models-73ce704561f4).

# <font color=”blue”> Your Homework </font>

<font color=”blue”> **Other architectures - Multilayer Neural Networks** </font>


<font color=”blue”> Up to now, the models had only one hidden layer. Your homework is divided into two partes: theoretical and practical.

<font color=”blue”>__THEORETICAL:__

<font color=”blue”> Explain cross validation with your words.

<font color=”blue”> Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. So, Adam involves a combination of two gradient descent methodologies: Momentum and RMSProp,
\begin{align}
\begin{split}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) G_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) G_t^2
\end{split}
\end{align}

<font color=”blue”>$m_t$ and  $v_t$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, and $\beta_1$ $\beta_2$ are the exponential decay rates for the 1st and 2nd moment estimates, respectively. Default values are, respectively, 0.9 and 0.999.

<font color=”blue”> What is the theory behind the ADAM optimizer?


<font color=”blue”> __PRACTICAL:__

<font color=”blue”> Experiment more hidden layers. Try to improve the performance with your proposed architecture.

<font color=”blue”> Use the *cross validation* technique. There is an exercise and explanation aboutt it at the end of this notebook.

<font color=”blue”> Discuss your results.</font>



Some important points:
* There are not many rules for choosing the number of intermediate layers and the number of neurons in each of the intermediate layers in the first model. The experience and intuition are the main factors that can help in this definition.
* In general, the development process starts using a simple ANN with a single intermediate layer. That's what we did.
* An indication to define the number of layers and neurons in an ANN is associated with the number of data in the training set. **The the number of parameters in an ANN must be less than the total number of data present in the training data.**
* Another important rule to define the neuron number of the intermediate layers is not to create *information bottlenecks*, that is, an intermediate layer can *never* have fewer neurons than the output layer. In a set of layers, each layer has access only to the information in the output of the previous layer. If a layer has few neurons, it will lose some information, which will not be possible to be retrieved in the posterior layers.
* The number of neurons in the middle layers depends a lot on the number of inputs and outputs of the training examples and in general we can do some quick initial tests to determine the appropriate minimum number.
* Obviously, the number of neurons in the output layer is defined by the number of outputs in the training examples.

## Cross validation

When the amount of input data is large enough, one can divide the data into *training, validation* and *test* set. The *training set*, as the name suggests, is used to train the model, that is, to adjust the internal parameters of the model such that the inputs match the outputs with the minimum error possible. The *validation set* is used to test the performance of the model before it is subject to the actual test set, and to tune its hyperparameters. The *test set*, which should be kept separate and untouched in the training process, is used to provide an unbiased evaluation of the performance of the final model.

 ![](https://drive.google.com/uc?export=view&id=19txDsk2QD64Y2zjh5c439e8U5KM7j_UZ)


When there is no validation set, the most widely used approach is  cross-validation (CV). In the well known K-fold CV the training set is split into $K$ parts - see figure below. The input data set is randomly divided into $K$ subsets (also known as folds). The ML model is trained with $K-1$ subsets, and evaluated in the subset that was not used for training. This process is repeated $K$ times with a different subset reserved for evaluation (and excluded from training) each time. For each training step, the average error is calculated. At the end, the model with the smallest error is selected. This can simulate training/validation, which is useful for hyperparameter tuning, without touching the test set.

 ![](https://drive.google.com/uc?export=view&id=15k96E0QY-QtKpELudeqMMpsApCxibAHJ)



For illustration, let's apply cross validation for different number of neurons in the intermediate layer (our first model analyzes).

In [None]:
from sklearn.model_selection import cross_val_score, KFold

In [None]:
def get_cross_val_score(model, x, y, cv=5, epochs=1000):
    cvs = np.zeros((cv, len(model.metrics)))

    k_folds = KFold(n_splits=cv)
    k_folds.split(x, y)
    for j, (train_idx, test_idx) in enumerate(k_folds.split(x, y)):
        model.fit(x_train_sca[train_idx], y_train_sca[train_idx], epochs=100, verbose=0)
        cvs[j,:] = np.array(model.evaluate(x=x_train_sca[test_idx], y=y_train_sca[test_idx], verbose=0))
    return cvs

In [None]:
print('10 neurons')
cv_10_neurons = get_cross_val_score(model_10_neurons, x_train_sca, y_train_sca, cv=5, epochs=500)
print('20 neurons')
cv_20_neurons = get_cross_val_score(model_20_neurons, x_train_sca, y_train_sca, cv=5, epochs=500)
print('30 neurons')
cv_30_neurons = get_cross_val_score(model_30_neurons, x_train_sca, y_train_sca, cv=5, epochs=500)
print('40 neurons')
cv_40_neurons = get_cross_val_score(model_40_neurons, x_train_sca, y_train_sca, cv=5, epochs=500)
print('50 neurons')
cv_50_neurons = get_cross_val_score(model_50_neurons, x_train_sca, y_train_sca, cv=5, epochs=500)

print('Done')

In [None]:
plt.style.use('bmh')
fig, ax = plt.subplots()

xi = np.arange(10,60,10)
means = np.array([cv_10_neurons[:,0].mean(),
                 cv_20_neurons[:,0].mean(),
                 cv_30_neurons[:,0].mean(),
                 cv_40_neurons[:,0].mean(),
                 cv_50_neurons[:,0].mean()])

stds  = np.array([cv_10_neurons[:,0].std(),
                 cv_20_neurons[:,0].std(),
                 cv_30_neurons[:,0].std(),
                 cv_40_neurons[:,0].std(),
                 cv_50_neurons[:,0].std()])

ax.errorbar(xi,
            means,
            yerr=stds);
ax.set(xticks=(xi),
       title='MSE\'s mean plus or minus one standard deviation\nusing 5-fold cross-validation for different numbers of neurons',
       xlabel='Number of neurons',
       ylabel='MSE');

In [None]:
predictions = np.zeros((5,3))
predictions[0,:] = model_10_neurons.evaluate(x_val_sca, y_val_sca)
predictions[1,:] = model_20_neurons.evaluate(x_val_sca, y_val_sca)
predictions[2,:] = model_30_neurons.evaluate(x_val_sca, y_val_sca)
predictions[3,:] = model_40_neurons.evaluate(x_val_sca, y_val_sca)
predictions[4,:] = model_50_neurons.evaluate(x_val_sca, y_val_sca)
predictions

In [None]:
fig, ax = plt.subplots()

xi = np.arange(10,60,10)
ax.plot(xi, predictions[:,0])
ax.set(xticks=(xi),
       title='MSE for the test set\nfor different numbers of neurons',
       xlabel='Number of neurons',
       ylabel='MSE');

In [None]:
# Graph of actual and predicted classes
def Target_vs_Predic2(y_test,y_pred1,label1,y_pred2,label2):
  plt.figure(figsize=(16, 6))
  plt.plot(y_test[:,0], 'o', color = 'darkslateblue', label='Target')
  plt.plot(y_pred1[:,0], 'o', color = 'crimson', label=label1)
  plt.plot(y_pred2[:,0], 'o', color = 'gold', label=label2)
  plt.title('Target vs prediction of the ANN')
  plt.xlabel('Example')
  plt.ylabel(r'$d_{4} [mm]$')
  plt.legend()
  plt.show

In [None]:
pred_sca_val = model_10_neurons.predict(x_val_sca)
y_new_val1 = normalizer_y.inverse_transform(pred_sca_val)

pred_sca_val = model_40_neurons.predict(x_val_sca)
y_new_val2 = normalizer_y.inverse_transform(pred_sca_val)

Target_vs_Predic2(y_val,y_new_val1,'10 neurons', y_new_val2, '50 neurons')