# Tensorflow Learning

In [2]:
import tensorflow as tf
import numpy as np

## Tensorflow introduction

### Tensors and NumPy
NOTE: Tensorflow uses 32-bit precision by default, so set ```dtype=tf.float32```in numpy arrays

## Data loading and preprocessing

### Datasets

#### Creation and manipulation

```tf.data.Dataset.from_tensorslices()```
* Takes a tensor and creates a Dataset whose elements are all slices of X along first dimension
<br>
Example transformation repeating each element 3 times, and then putting into batches of size n:

In [15]:
dataset = tf.data.Dataset.from_tensor_slices(tf.range(5))
dataset_1 = dataset.repeat(2).batch(7)
dataset_2 = dataset.repeat(2).batch(5)
print("Repeat and batches demo")
for i_1, i_2 in zip(dataset_1, dataset_2):
    print(i_1)
    print(i_2)
    print()
print("Take demo")
[print(i) for i in dataset.take(2)];

Repeat and batches demo
tf.Tensor([0 1 2 3 4 0 1], shape=(7,), dtype=int32)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)

tf.Tensor([2 3 4], shape=(3,), dtype=int32)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)

Take demo
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)


#### Shuffling

<b>Note</b>: Gradient Descent works better with independent and identically distributed training sets, thus motivating shuffling of datasets.<br>
The ```shuffle()```method creates a new dataset that starts by filling up a buffer with the first items of the source data. When asking for data, a random pick is chosen from the buffer which then fetches a new item from the source data. When the source dataset is empty, it will pick random items from the buffer until it is empty. Thus, the ```buffer_size```parameter needs to be sufficiently large. 
___
<b>Interleaving</b><br>
For large datasets, a more thorough shuffling is in order as the batch size will be too small compared to the data cardinality. Another method of shuffling is picking multiple files randomly and reading them simultaneously, interleaving their records.

#### Prefetching

The ```prefetch(n)```function is a form of parallelism, wherein the dataset will prepare a batch <i>n</i> steps ahead while the current batch is being trained on. This may improve the performance dramatically.

### Preprocessing

#### Incorporating a preprocessing layer directly into the model

This is generally probably not a good idea, as it is better to preprocess the data beforehand to speed up training and generalization.<br>
Example of creating a standardization layer class which can be put into a model:
```Python
class Standardization(tf.keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(
            data_sample, axis=0, keepdims=True
        )
        self.stds_ = np.std(
            data_sample, axis=0, keepdims=True
        )
    def call(self, inputs):
        return (inputs - self.means_) / 
                (
            self.stds_ + tf.keras.backend.epsilon()
        )
    
std_layer = Standardization()
std_layer.adapt(data_sample)
model = keras.Sequentila([
    ...
    std_layer,
    ...
])
```

Example use:

```

## Customizing Models and Training Algorithms

#### Custom Loss Functions

Example w/ Huber Loss:
```Python
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

model.compile(...,loss=huber_fn, ...)
```
To save a model with custom params for the loss function, a subclassing can be done.
Subclassing the <i>Loss</i> class example:
```Python
class HuberLoss(keras.losses.Loss):
    def __init__(self, treshold=1.0, **kwargs):
        self.treshold = treshold
        super().__init__(**kwargs)
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.treshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.treshold * tf.abs(error) -         self.treshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "treshold": self.treshold}
```
Loading example:
```Python
model = keras.models.load_model(
    "name.h5", 
    custom_objects={"HuberLoss": HuberLoss}
)
```
___

##### Custom Metrics
A note about metrics:<br>
Losses are used to train the model, while metrics are used to <i>evaluate</i> a model.<br>
The implementation of custom metrics can be done in the same was as custom loss functions.
___

### Custom activation function, initializers, regularizers and constraints

<b>Custom activation function</b><br>
This example creates a custom softmax or a ```tf.nn.softpluz(z)```
```Python
def my_softplus(z): 
    return tf.math.log(tf.exp(z) + 1.0)
```
<b>Custom initialization</b><br>
A custom glorot initializer equivalent to ```keras.initializers.glorot_normal()```
```Python
def my_glorot_initializer(shape, dtype=tf.float32):
    stddev = tf.sqrt(2. / (shape[0] + shape[1]))
    return tf.random.normal(shape, stddev=stddev, dtype=dtype)
```
<b>Custom regularizer</b><br>
Equivalent regularizer to ```keras.regularizers.l1(0.01)```
```Python
def my_l1_regularizer(weights):
    # reduce_sum is equivalent to np.mean()
    return tf.reduce_sum(tf.abs(0.01 * weights))
```
<b>Custom constraints</b>
Equivalent to ```keras.constraints.nonneg()``` or ```tf.nn.relu())```
```Python
def my_positive_weights(weights):
    return tf.where(weights < 0., tf.zeros_like(weights), weights)
```
<b>Note:</b> If a function has hyperparams that should be saved, use the subclass approach described for [custom loss functions](#Custom-Loss-Functions)
___

### Custom Layers
Example creating a simplied dense layer:
```Python
class MyDens(keras.layers.Layer):
    def __init__(
        self, units, activation=None, **kwargs
    ):
        super().__init___(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)
    
    def build(self, batch_input_shape):
        self.kernel = self.add_weight(
            name="kernel", 
            shape=[
                batch_input_shape[-1], self.units
            ],
            initializer="glorot_normal"
        )
        self.bias = self.add_weights(
            name="bias",
            shape=[self.units],
            initializers="zeros"
        )
        super().build(batch_input_shape)  # must be at the end
        
    def call(self, X):
        return self.activation(X @ self.kernel + self.bias)
    
    def compute_output_shape(
        self, batch_input_shape
    ):
        return tf.TensorShape(
            batch_input_shape.as_list()[:-1] 
                + [self.units]
        )
    
    def get_config(self):
        base_config = super().get_config()
        return {
            **base_config, "units": self.units,
            "activation": 
                    keras.serialize(self.activation)
        }
```

## Hyperparameter tuning

### K-fold cross validation

This is an example of how to use K-fold cross validation using keras, tensorflow and scikit-learn. <br><br>
___
Steps:
1. Create a function that will build and compile a model given a set of hyperparams
    * E.g.: number of hidden layers, number of neurones, learning rate, optimizer etc.
2. Wrap Keras model using Keraswrapper to build model using scikit-learn.
3. (OPTIONAL)Specify the range for hyperparams, then use scikit-learn's <i>RandomizedSearchCV</i> (or Grid Search, but when the number of hyperparams are great Randomized should be preferred)
    * If the hyperparameter space is very large, another approach is to first run a quick random search with a wide array. Then, focus on the best regions and perform a longer search there.
4. <b>Use and optimization library for tuning the model!</b>

### Guidelines for choosing hyperparameters

#### Number of hidden layers

In general: better to start with fewer layers, then incrementally increase until overfitting becomes a problem.<br>
NOTE: <i>Transfer learning</i>, where lower layers of a network is reused in new model with same values, can help kickstart the network in a beneficial direction. This is relevant when the problem concerns similar problems and representation as that of the existing network.
___

#### Number of Neurons per Hidden Layer

Often helpful to start with too many layers and neurons, then let regularization and early stopping determine the appropriate amount.
___

#### Learning rate, weight initialization and activation functions

<b>Learning rate</b><br>
Dynamically changing the learning rate might be beneficial for some problems. This technique is called <i>learning schedules</i>. <br>
One example is to start with low learning rate, and gradually increase over epochs. Plotting the loss as a function of the learning rate helps in choosing the best value.

<b>Weight initialization</b><br>
Different weight initializations works better for some activation functions. E.g.: <i>He</i> for ReLU and <i>LeCun</i> for SELU. <br>
Examples: <i>Glorot</i> (Keras default), <i>He initialization</i>, <i>LeCun</i><br>

<b>Activation functions</b><br>
Book recommends in general:<br>
<center>SELU > ELU > leaky RELU > ReLU > tanh > logistic</center><br>
HOWEVER: for <b>recurrent networks</b>, the SELU will not self-normalize, so ELU may be preferred for those networks

## Regularization

### L1 and L2 regularization

Specify the kernel_regularizer using a regularization factor. Examples: 
```Python 
    keras.regularizers.l1(regularization_factor)
    keras.regularizers.l2(regularization_factor)
    keras.regularizers.l1_l2(r1, r2)
```
To avoid foor loops when using many layers, use functools' partial:
```Python
from functools import partial
RegularizedDense = partial(
    keras.layers.Dense, 
    activation="elu",                  
    kernel_initializer="he_normal",
    kernel_regularizer=keras.regularizers.l2(0.01)
  )
model = ...RegularizedDense(num_neurons)...
```
___

### Dropout
Add dropout layers after normal layer. Example:
```Python
model = ...Dense(...), keras.layers.Dropout(rate=dropout_rate)
```
Dropout is often only applied to last hidden layer.
Warning: dropout is not active for testing, thus comparing training and validation loss may be misleading. Thus, evaluating the raining loss without dropout gives a better comparison to validation loss.
___

### Monte Carlo Dropout
This can be achieved by making several predictions while still using dropout, making every prediction have some stochasticity to it. Then, compute the mean of these predictions to achieve a Monte Carlo estimate.<br>
NOTE: If the model contains other layers behaving differently during training vs testing, using ```training=TRUE``` should be avoided. Instead, use a MCDropout subclass for the Dropoutlayers where the training argument is forced to be true.
<br>Examples:
```Python
### Monte Carlo Dropout w/ 100 predictions
y_probas = np.stack([model(X_test, training=True) for sample in range(100)])
y_proba = y_probas.mean(axis=0)

### Monte Carlo Dropout class
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super().call(inputs, training=True)
```
___

## Summarized "default" configurations

A regular use case can often consist of these specifications:<br>
![image.png](attachment:26a9ea43-d618-4400-9822-c53192dae9a8.png)

If the network <b>solely consists of simple dense layers</b>, it can self-normalize through `selu` activation and `lecun_normal` kernel initializer with the following specifications: <b>NOTE! REQUIRES NORMALIZING INPUTS</b><br>
![image.png](attachment:ed537874-3cf4-46b8-8a32-0588317e27c5.png)


## Forecasting a Time Series using RNNs

### Introductory elements

#### Dimensionality
The dimensionality of time series input features are often of the type 
<center>[<i>batch size, time steps, dimensionality</i>]</center>
The <i>dimensionality</i> corresponds to the dimensionality of the input features.
___

#### Multiple steps parameter
The parameter ```return_sequences``` determines if the recurrent layer will return only the final output. If it is set to ```True```, it will return one output per time step. When using multiple layers, this must be set to ```True``` to ensure that each new layer gets all timesteps as input!
___

#### Output layer
Generally, it might be better to use a ```Dense```layer or similar for the output, allowing for customizing the activation function. This is particularly relevant when only one prediction is done, as the last layer will not perform any particular function, i.e., ```LSTM(1)```
___

### Splitting a sequential dataset

#### Stationarity and splitting
If the dataset is split along the time dimensions, this implicitly assumes stationarity: patterns the model learns in the past will still be relevant in the future!<br>
One method of checking for stationarity is to plot the model's errors on the validation set over time: if the model performs better on the first part of the validation set, this might inidicate non-stationarity.
___

#### Chopping sequential dataset into windows
The ```window()``` method creates nonoverlapping windows by default, but this can be altered by parameters. Some parameter explanations:
* Using ```shift=n``` makes the window shift by <i>n</i> steps at each new window, ie 0-10, 1-11, 2-12 etc.
* The ```drop_remainder```, when set to True, will drop the extra windows where the number of elements will be less than the stated window_length, e.g., for the sequence 0-10 and window size 3, windows [9,10] and [10] will not be created.<br>

Windows cannot be used as input directly, and must be flattened frist by the ```flat_map()``` method:
```Python
dataset = dataset.flat_map(
    lambda window: window.batch(window_length)
)
```
As always, it is beneficial to shuffle the windows and then separate targets from inputs. Example:
```Python
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(
    lambda windows: (windows[:, :-1], windows[:, 1:])
)
```
Finally, prefetching should be added:
```Python
dataset = dataset.prefetch(1)
```
___

### Forecasting multiple steps ahead

#### Approach 1: Use previous predictions as input
Example:
```Python
for step_ahead in range(10):
    next_pred = model.predict(
        X[:, step_ahead:]
    )[:, None, :]
    X = np.concatenate([X, next_pred], axis=1)
y_pred = X[:, n_steps]
```
This uses the last prediction as input to the model, predicts the step after and so forth for the desired number of steps.<br>
<b>NOTE:</b> This will accumulate the errors, and is generally inappropriate!
___

#### Approach 2: Train RNN to predict all values at once
Example:
```Python
model = keras.models.Sequential([
    keras.layers.SimpleRNN(
        20, 
        return_sequences=True, 
        input_shape=[None, 1]
    ),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10)
])
```
The final line makes the output layer have 10 outputs instead of one, corresponding to making ten new predictions.<br>
___

#### Approach 3: Improved approach 2 - training on forecasting next n values at every layer
It is generally better to train the model to forecast the next 10 values at every layer. This requires the target to be a sequence of samle length as input sequence.
<br> Example:
```Python
model = keras.models.Sequential([
    keras.layers.SimpleRNN(
        20, 
        return_sequences=True, 
        input_shape=[None, 1]
    ),
        keras.layers.SimpleRNN(
        20, 
        return_sequences=True, 
    ),
    keras.layers.TimeDistributed(
        keras.layers.Dense(10)
    )
])
```
The final layer, ```TimeDistributed```, wraps each layer and applies the parameter layer at every step of the input sequence.<br>
For this approach, a custom evaluation metric should be designed to only evaluate output at the last time step:<br>
```Python
def last_time_step_mse(Y_true, Y_pred):
    return keras.metrics.mean_squared_error(
        Y_true[:, -1], Y_pred[:, -1]
    )
```

### Handling long sequences
Long sequences may cause the unstable gradients problems. Additionally, the network will gradually forget the first inputs in the sequence.

#### Handling unstable gradients: Layer Normalization
Normalization is performed along the feature dimension. Can be done with ```tf.keras.layers.LayerNormalization(...)```
___

#### Handling the short-term memory problem: LSTM and GRU

Implementation is straightforward: use ```keras.layers.LSTM```or ```keras.layers.GRU```.<br>

GRU vs LSTM:
* GRU is a simplified version of the LSTM cell, but often performs just as well.<br>

Both models has fairly limited short-term memory, and consequently struggles with learning long-term patterns in sequences of 100 time steps or more.  This can be overcome by shortening the input sequences, e.g. through <i>1D convolutional layers</i>. See <i>Wavenet</i> for an explanation.
___

## Autoencoders
Autoencoders has several uses: 
* They can be used for preprocessing, where they can perform dimensionality reductions.
* They can act as feature detectors, and be used for pretraining deep neural networks
* Some are generative models, capable of generating new data similar to the training data.
___

### PCA with autoencoders
An autoencoder with only linear activations and MSE cost function performs a PCA. Example of PCA from 3D to 2D:
```Python
encoder = .Sequential([Dense(2, input_shape=[3])])
decoder = .Sequential([Dense(3, input_shape=[2])])
autoencoder = .Sequential([encoder, decoder])
```
___

### Unsupervised pretraining using stacked autoencoders
Useful for tasks with little labeled training data.
___

### Recurrent Autoencoders
Example:
```Python
recurrent_encoder = .Sequential([
    .LSTM(
        100, return_sequences=True, input_shape=[None, 28]
    ),
    .LSTM(30)
])

recurrent_decoder = .Sequential([
    .RepeatVector(28, input_shape=[30]),
    .LSTM(
        100, return_sequences=True
    ),
    .TimeDistributed(.Dense(28, activation="sigmoid"))
])

recurrent_ae = .Sequential([
    recurrent_encoder, recurrent_decoder
])
```
The `RepeatVector()` ensures input vector gets fed to decoder at each time step.
___

### Variational autoencoders - a way of doing Bayesian inference
Characteristics:
* Probabilistic - output determined partly by chance
* Generative - can generate new instances<br> 

The encoder produces a mean coding and a standard deviation. Then, the actual coding is sampled randomly from a Gaussian distribution with the given mean and standard deviation. Afterwards, the decoder decodes the sampled coding normally. <br>

<b>Cost function</b><br>

Variational autoencoders has a twofold cost function:
* The usual reconstruction loss pushing autoencoder to reproduce its inputs
* Latent loss pushing autoencoders to have codings looking like they were sampled from a simple Gaussian distribution
    * This is the KL divergence between the target distribution - the Gaussian distribution - and the actual distribution of the codings.

#### Building a variational autoencoder example:

```Python
class Sampling(.Layer):
    def call(self, inputs):
        mean, log_var = inputs
        return 
    keras.backend.random_normal(tf.shape(log_var)) *
    keras.backend.exp(log_var / 2) + mean
```
This layer samples a random vector from a Normal distribution with mean 0 and stddev 1, multiplies by the exp of the stddev/2 (or equivalently, the standard deviation), and adds the mean to the result. In short, it samples a codings vector from the Normal distribution with the given mean and stddev.

Next, create the encoder:
```Python
codings_size = 10
# Create mean and stddev
inputs = .Input(shape=[28, 28])
z = .Flatten()(inputs)
z = .Dense(150, activation="selu")(z)
z = .Dense(100, activation="selu")(z)
codings_mean = .Dense(codings_size)(z)
codings_log_var = .Dense(codings_size)(z)
codings = Sampling()([codings_mean, codings_log_var])
variational_encoder = .Model(
    inputs=[inputs], 
    outputs=[codings_mean, codings_log_var, codings]
)
```
The layers that output the mean and stddev has the same inputs.

Creating the decoder:
```Python
decoder_inputs = .Input(shape[codings_size])
x = .Dense(100, activation="selu")(decoder_inputs)
x = .Dense(150, activation="selu")(x)
x = .Dense(28*28, activation="sigmoid")(x)
outputs = .Reshape([28, 28])(x)
variational_decoder = .Model(
    inputs=[decoder_inputs],
    output=[outputs]
)
```

Building the model:
```Python
_, _, codings = variational_encoder(inputs)
reconstructions = variational_decoder(codings)
variational_ae = .Model(
    inputs=[inputs], outputs=reconstructions
)
```

Adding the latent loss:
```Python
latent_loss = -0.5 * keras.backend.sum(
    1 + codings_log_var - .exp(codings_log_var)
    - .square(codings_mean),
    axis=-1
)
variational_ae.add_loss(.mean(latent_loss) / 784.)
variational_ae.compile(loss="binary_crossentropy",
                      optimizer="rmsprop")
```
The KL divergence is the loss, and it is divided by 784(=28 * 28=input shape) to ensure it has appropriate scale compared to reconstruction loss.