In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Advanced Keras: motivation

We have been using the Keras Model API (mostly the `Sequential`) as a black box.

But it is highly customizable
- A `Model` is a class (as in Python object)
- It implements methods such as
    - `compile`
    - `fit`
    

We can change the behavior of a model in several ways
- Arguments to some methods are objects; we can pass non-default functions/objects
    - e.g., custom loss function
- We can override these (and other) methods to make our models do new things.

The `Layer` is also an abstract class (Python) in Keras.

Hence
- We can create new layer types
- We can override the methods of a given layer

In this module
- we will
illustrate techniques that you can use to customize your Layers/Models.
- Illustrate the Functional model


# Functional model: the basics, illustrated by the Transformer

The `Sequential` model
- organizes layers as an ordered list
- restricts the input to layer $(\ll+1)$ to be the output of layer $\ll$.

The [`Functional` model](https://keras.io/guides/functional_api/)
- imposes **no** ordering on layers
- imposes **no** restriction on connect outputs of one layer to the input of another

To illustrate the `Functional` model let's take a first look
at model implementing a single `Transformer` block
- we will revisit this code later to illustrate other concepts

Here is the picture of a Transformer block

<table>
    <tr>
        <th><center>Transformer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_is_all_u_need_Transformer.png" width=50%></td>
    </tr>
</table>

There are actually 3 models in this [cell](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/neural_machine_translation_with_transformer.ipynb#scrollTo=cadZM4xkYDon) we will visit !

Let's examine each one and try to relate the actual code to our picture.

**pay close attention to the difference between $\bar\h$ (Encoder states) and $\y$ (Decoder outputs)**

<table>
    <tr>
        <th><center>Transformer Layer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Transformer_Encoder_Decoder_2.png" width=70%></td>
    </tr>
</table>

In the above picture of the Transformer:
- The output sequence $\bar\h_{(1..\bar{T})}$ (i.e., latent states) of the Encoder
    - Used in Decoder-Encoder attention
    - $|| \bar\h || = \bar{T} = \text{length of Transformer input}$
- The prefix of the Decoder outputs generated up to time $\tt$
    - The Decoder output at time $(\tt-1)$ is appended to the Decoder inputs available at time $\tt$
    - So the inputs are the Decoder outputs $\y_{(1..T)}$
        - $T$ is *full* length of Transformer output
        - Causal (Masked) Attention is used to restrict the Decoder
            - from attending at step $\tt$ to any $\y_\tp$ where $\tt > (\tt-1)$
            - Can't look at an output that hasn't been generated yet !

## First model: the Encoder side of the Transformer

The Encoder side of the transformer:

    encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
    encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
    encoder = keras.Model(encoder_inputs, encoder_outputs)

This illustrates the pattern common to `Functional` models
- The output of a layer is assigned to a variable (e.g., `encoder_inputs` has the value of the model's inputs)
- The output of a layer is connected to the input of another layer via "function call" syntax
    - e.g., `encoder_inputs` is applied as the input to the `PositionalEmbedding` layer
    
        x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)

The collection (not necessarily a sequence) of `Layer` calls defines a graph of function calls that maps
`Model` inputs to outputs.

To turn this collection into a `Model`
- We define which function to feed Model inputs to
- We define which function's outputs are the output of the model


For example, we define the Encoder side (sub-model of the Transformer) of the Transformer via

    encoder = keras.Model(encoder_inputs, encoder_outputs)
    
This defines `encoder` to be a `Model` with
- input: `Layer` `encoder_inputs` (i.e., the `Input` layer)
- output: `Layer` `encoder_outputs` (i.e., the `TransformerEncoder`)

We can see the Encoder side in pictures via

    tf.keras.utils.plot_model(
        encoder,
        show_shapes=True, 
    )
<table>
    <tr><th><strong>Encoder side of Transformer</strong></th></tr>
    <tr><img src="images/nmt_encoder_plot_model.png"></tr>
</table>

**Note**: the input and output of a `Model` *don't have to be* `Layer` types !

There is also a model for the Decoder side of the Transformer in the cell we will visit:

    decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

- input: An **array**  of 2 `Layer` types -- `[ decoder_inputs, encoded_seq_inputs ]`
- output: `Layer`: `decoder_outputs`

Hence, the Decoder side takes a **pair** of inputs, as per the diagram.

    
    decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

Let's see if we can trace which role each element of the pair serves.

Here is  picture of the Decoder side of the Transformer

<table>
    <tr><th><strong>Decoder side of Transformer</strong></th></tr>
    <tr><img src="images/nmt_decoder_plot_model.png"></tr>
</table>


**Note in the Decoder picture**

`decoder_state_inputs` (on the right hand side) is assigned to variable `encoded_seq_inputs`:

    encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
    
and `encoded_seq_inputs` is the second argument to `TransformerDecoder`

Hence, there should be an arrow from `decoder_state_inputs` to the `transformer_decoder` box.

This confirms that
- The Decoder sides takes a pair of inputs
- The second input is directed to the second argument of `TransformerDecoder`


Let's examine the Encoder code more closely

    encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
    encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
    encoder = keras.Model(encoder_inputs, encoder_outputs)


First, observe that the *Encoder* outputs ($\bar\y_{(1..\bar{T})}$) `encoder_outputs`

    encoder = keras.Model(encoder_inputs, encoder_outputs)

We see that these Encoder outputs become the second part of the pair of the actual arguments that are the inputs to the *Decoder*

    decoder_outputs = decoder([decoder_inputs, encoder_outputs])
    
And the formal argument of `decoder` definition is `encoded_seq_inputs`

    decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

So the `encoder_outputs` ($\bar\y_{(1..\bar{T})}$) become bound to `encoded_seq_inputs` within the Decoder.

Hence, the second part of the input pair of `decoder` serves the role of $\bar\y_{(1..\bar{T})}$, the sequence of Encoder latent states


## Second model: The Decoder side of the Transformer

Now, let's look at the Decoder.

    decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
    encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
    x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
    x = layers.Dropout(0.5)(x)
    decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
    decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

By tracing variable `x` backwards from the Decoder output ($\y_{(1..T)}$) `decoder_outputs`

    decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)

we can see the outputs derive from `Layer` sub-type `TransformerDecoder` 

    x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)

We have already shown that `encoded_seq_inputs` corresponds to $\bar\y_{(1..\bar{T})}$

So we can guess that variable `x` in the call to `TransformerDecoder` corresponds to $\y_{(1..T)}$

Let's confirm that.

Tracing variable `x` backwards from its use a the first argument in the `TransformerDecoder`

    x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
    
we see that it is the positionally-encoded (to enable Causal masking) `decoder_inputs` 
- The `PositionalEmbedding` is added to enforce Masking (causal ordering)

Hopefully: `decoder_inputs` are the `decoder_outputs` shifted by one time step
- That is: the training set enforces "Teacher Forcing"

## Third model: the full Transformer -- Encoder + Decoder

Finally, there is the Transformer Model, combining an Encoder and Decoder:

    decoder_outputs = decoder([decoder_inputs, encoder_outputs])
    transformer = keras.Model(
        [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
    )

The `transformer` input is a pair

    [encoder_inputs, decoder_inputs]

And it's output is

     decoder_outputs

We identify the first part of the input pair (`encoder_inputs`) as the input sequence $\x_{(1\dots \bar{T})}$



Here is the picture of the full Transformer:
<table>
    <tr><th><strong>Full Transformer</strong></th></tr>
    <tr><img src="images/nmt_transformer_plot_model.png"></tr>
</table>


## Note: Where does "Teacher Forcing" happen ?

We had "hoped" that 
>`decoder_inputs` are the `decoder_outputs` shifted by one time step

- which needs to be true at training time to implement "teacher forcing"
- (but which shouldn't happen at inference time)

Where in the code does this happen ?

In the data preparation,
we can see that each row of the training data is mapped by the function

    def format_dataset(eng, spa):
        eng = eng_vectorization(eng)
        spa = spa_vectorization(spa)
        return ({"encoder_inputs": eng, "decoder_inputs": spa[:, :-1],}, spa[:, 1:])

where `eng` is $\x_{1\ldots\bar{T}}$ and `spa` is $\y_{1\ldots T}$

The function replaces the ("feature", "target") example pairs
- denoted as (`eng`, `spa`) in the code (n.b., the "feature" is a sentence in English; the "target" a sentence in Spanish)

by
- example features are the hash `({"encoder_inputs": eng, "decoder_inputs": spa[:, :-1],},`
- example target is now `spa[:, 1:]`

So `decoder_inputs` would seem to be the un-shifted sequence of words in the Spanish sentence.

But wait !

Careful examination of the Spanish sentences show that they begin with a `[start]` token
- This, in essence, has shifted the `decoder_input` by one word
- So the first output of the Decoder will be the first (non-`[start]`) word of Spanish
- Which will be compared to the new target `spa[:, 1:]`, which is the first (non-`[start]`) word of Spanish


At inference time, we do not see Teacher Forcing:

We can see the generated output  `decoded_sentence` built up one word at a time
in the code snippet for `decode_sequence`

    def decode_sequence(input_sentence):
        tokenized_input_sentence = eng_vectorization([input_sentence])
        decoded_sentence = "[start]"
        for i in range(max_decoded_sentence_length):
            tokenized_target_sentence = spa_vectorization([decoded_sentence])[:, :-1]
            predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

              ....
              
            decoded_sentence += " " + sampled_token

            if sampled_token == "[end]":
                break
        return decoded_sentence


    

## Summary

A lot going on here !
- A complex connection of `Layer` outputs to inputs
- Custom `Layer` sub-types
    - `PositionalEmbedding`, `TransformerEncoder`, `TransformerDecoder`
    - We will soon see how to define our own `Layer` sub-classes
- Teacher Forcing

That concludes our first look at the Functional model



# Model specialization

## Custom loss (passing in a loss function)

In introducing Deep Learning, we have asserted that
>It's all about the Loss function

That is: the key to solving many Deep Learning problems
- Is not in devising a complex network architecture
- But in writing a Loss function that captures the semantics of the problem

Up until now
- We have been  using pre-defined Loss functions (e.g., `binary_crossentropy`)
- Specifying the Loss function in the compile statement
>`model.compile(loss='binary_crossentropy')`


You can [write your own loss functions](https://keras.io/api/losses/)

In Keras, a Loss function has the signature
>`loss_fn(y_true, y_pred, sample_weight=None)`

## Custom train step (override `train_step`)

But what if your Loss function needs access to values that are not part of the signature ?

Or what if you want to change the training loop ?

You could write your own training loop by overriding the `fit` method
- Cycle through epochs
- Within each epoch, cycle through mini-batches of examples
- For each mini-batch of examples: execute the *train step*
    - forward pass: feed input examples to Input layer, obtain output
    - compute the loss
    - Compute the gradient of the loss with respect to the weights
    - Update the weights


Rather than overriding `fit`, it sometimes suffices to override the train step: `train_step`

Let's start by looking at the "standard" implementation of a basic train step.

We will see
- How losses are computed
- Gradients are obtained
- Weights are updated

[Basic `train_step`](https://colab.research.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/customizing_what_happens_in_fit.ipynb#scrollTo=9022333acaa7&line=1&uniqifier=1)



We can modify the basic training step too.

For example: suppose we want to make some training examples "more important" than others
- Rather than Total Loss as equally-weighted average over all examples
- Pass in per-example weights

This might be useful, for example, when dealing with Imbalanced Data

# `Layer` specialization

A `Layer` in Keras is an abstract (Python) object
- instantiating the object returns a function
    - That maps input to the layer to the output

We have used specific instances of `Layer` objects (e.g., `Dense`) as arguments in the list passed to the `Sequential` model type.

We can also use instances in the Functional Model.

For example
- `Dense(10)`
    - Is the constructor for a fully connected layer instance with 10 units
    - The constructor returns a function
    - The the function maps the layer inputs to the outputs of the computation defined by the layer

So you will see code fragments like
>
    x = Input(shape=(784))
    x = Dense(10, activation=softmax)(x)
    
- Re-using the variable `x` as the output of the current layer


When the function is invoked, the Layer's `call` method is used
- `call` gets invoked implicitly by "parenthesized argument" juxtaposition
    - e.g., `Dense(10) ( x )`
    - is similar to `obj= `Dense(10); result = obj.call(x)`
- The function maps the inputs to the layer to the output



Overriding `call` allows us to defined a new `Layer` sub-class.

For example, [here](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/neural_machine_translation_with_transformer.ipynb#scrollTo=AupqfNAYaCHn&line=2&uniqifier=1) is the code defining some new `Layer` types that will be used to create a `Transformer` layer type.

The output of `Dense(10)` is a Tensor with final dimension equal to the number of units (e.g., 10)
- The Tensor has leading dimensions too
    - e.g., the implicit "batch index" dimension
    - since the layer takes a mini-batch of examples (rather than a single example) as input
- It may have *additional* dimensions too !
    - Just like `numpy`: threading over additional dimensions
    - e.g., if input is shape $(\text{minibatch_size} \times n_1 \times n_2)$
        - output is shape $(\text{minibatch_size} \times n_1 \times 10)$
        - `Dense` operates over the final dimension

# Studying advanced models

The best way to learn is to study the code of some non-trivial models

## Transformer: Custom layers, Skip connections, Layer Norm

We have already seen part of the Transformer in introducing the basics of the Functional model.

We use the rest of this example to discover other advanced Keras techniques:

[Transformer layer](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/neural_machine_translation_with_transformer.ipynb#scrollTo=4DaEQr-lMkSs)
- [Custom layers](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/neural_machine_translation_with_transformer.ipynb#scrollTo=Ywxggf9Anabk&line=8&uniqifier=1)
- Layer Norm
- Skip connections

### Custom layers: subtle point

Let's look at the constructor for the `TransformerEncoder` custom layer, as an example

    class TransformerEncoder(layers.Layer):
        def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
            super(TransformerEncoder, self).__init__(**kwargs)
            self.embed_dim = embed_dim
            self.dense_dim = dense_dim
            self.num_heads = num_heads
            self.attention = layers.MultiHeadAttention(
                num_heads=num_heads, key_dim=embed_dim
            )
            self.dense_proj = keras.Sequential(
                [layers.Dense(dense_dim, activation="relu"), layers.Dense(embed_dim),]
            )
            self.layernorm_1 = layers.LayerNormalization()
            self.layernorm_2 = layers.LayerNormalization()
            self.supports_masking = True


The custom layer consists of a collection of component layers

Why are the components layers (e.g., Dense, MultiHeadAttention, LayerNormalization) instantiated in the class constructor
- As opposed to being defined in the `call` method 

Had we instantiated each component within the `call` method
- There would be a *new instance* of each component *each time the layer was called on an example* in training !
- Each instance would have *it's own weights*
- So training would not "learn" between examples

### Other custom layers of interest

We can dig deeper to examine how the Attention layers are implemented in code:
- [Scaled dot-product attention](https://www.tensorflow.org/text/tutorials/transformer#scaled_dot_product_attention)
- [Multi-head attention](https://www.tensorflow.org/text/tutorials/transformer#multi-head_attention)

## VAE: Custom Model -- Custom Layer, Training Loop, the Gradient Tape

We use this example to show
- The Functional model
- A custom `Layer`: `Sampling`
- A custom training step
- Inverting a Convolution: `Conv2DTranspose`

Recall the architecture of a Variational Autoencoder (VAE)

<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.png"></td>
    </tr>
</table>

A key step is drawing a random latent vector $\z^\ip$
from a distribution with mean $\mu$ and standard deviation $\sigma$.

This [cell](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/vae.ipynb#scrollTo=Wx_jzzcPtcfz)
- Creates a custom `Layer` type called `Sampling` to perform the random sampling
- By overriding the base `Layer` `call` method

In [this cell](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/vae.ipynb#scrollTo=4rFiHtbCtcf0) we can see that the Encoder and Decoder are both implemented as Functional models
- The Encoder produces a *pair of outputs*: $(\mu, \sigma)$



**Issues**
- Custom **model** (not layer) class VAE
- The *reconstruction loss* depends on the output of the Decoder part of the VAE
    - No other obvious way to define this loss aside from a **custom training** step
- Because we are computing the Loss in the training step
    - we must compute the gradient of the Loss w.r.t weights
    - Update the weights (**gradient tape**)

[Variational Autoencoder (VAE) from github](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/vae.ipynb#scrollTo=DEU05Oe0vJrY)
- [VAE: Custom train step](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/vae.ipynb#scrollTo=0EHkZ1WCHw9E)
    - Complex loss
    

## Visualizing what CNN's learn: Gradient Ascent and the Gradient Tape

[Visualizing what Convnets learn](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/visualizing_what_convnets_learn.ipynb#scrollTo=K8ITQAj7FTZd)
- The Gradient Tape
- Maximize utility (negative loss)
    - mean (across the spatial dimensions) of one feature map in a multi-layer CNN
    - the "weights" being solved for are the pixels of the input image !

We use this example to show how powerful the Gradient Tape is

[Gradient ascent](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/visualizing_what_convnets_learn.ipynb#scrollTo=a9hZnslRFTZZ)


## Factor Models and Autoencoders: Threading

We use this example to show
- A Functional model applied to a common problem in Finance.
- Threading

We will cover the Finance aspects of this in a [separate module](Autoencoder_for_conditional_risk_factors.ipynb)

For now, I want to focus on the idea and the code

Here is the code, excerpted from the [notebook](https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/20_autoencoders_for_conditional_risk_factors/06_conditional_autoencoder_for_asset_pricing_model.ipynb)

    def make_model(hidden_units=8, n_factors=3):
        input_beta = Input((n_tickers, n_characteristics), name='input_beta')
        input_factor = Input((n_tickers,), name='input_factor')

        hidden_layer = Dense(units=hidden_units, activation='relu', name='hidden_layer')(input_beta)
        batch_norm = BatchNormalization(name='batch_norm')(hidden_layer)

        output_beta = Dense(units=n_factors, name='output_beta')(batch_norm)

        output_factor = Dense(units=n_factors, name='output_factor')(input_factor)

        output = Dot(axes=(2,1), name='output_layer')([output_beta, output_factor])

        model = Model(inputs=[input_beta, input_factor], outputs=output)
        model.compile(loss='mse', optimizer='adam')
        return model

Not obvious what is going on here.

A picture will help:


<table>
    <tr>
        <th><center>Autoencoder for Conditional Risk Factors</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_for_conditional_risk_factors.png" width="90%"></td>
    </tr>
</table>


$$
\newcommand{\R}{\mathbf{R}}
\newcommand{\r}{\mathbf{r}}
\newcommand{\F}{\mathbf{F}}
\newcommand{\V}{\mathbf{V}}
\newcommand{\ntickers}{{n_\text{tickers}}}
\newcommand{\ndates}{{n_\text{dates}}}
\newcommand{\nfactors}{{n_\text{factors}}}
\newcommand{\nchars}{{n_\text{chars}}}
\newcommand{\dp}{{(d)}}
\newcommand{\sp}{{(s)}}
\newcommand{\Bbeta}{\mathbf\beta}
$$

### Threading 

Let's focus on the `Dense` layer corresponding to the box labelled "Beta" in the picture

- `Dense` $( \nfactors )$

From the diagram you will notice that 
- the input to this  layer is *two dimensional*: $(\ntickers \times \nchars)$
- the output to this is *two dimensional*: $(\ntickers \times \nfactors)$

We have not yet seen multi-dimensional input/output in regard to a `Dense` layer

What is going on here ?

The layer is implementing a function with signature
- `Dense`( $\nfactors ) :  (\ntickers \times \nchars) \mapsto (\ntickers \times \nfactors) $

Tensorflow/Keras works on higher dimensional objects just like NumPy: 
- [threading](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) over "extra" dimensions

If the input to layer $\ll$ is shape $(\dim_{\llp,1} \times \dim_{\llp,2} \times \ldots \dim_{\llp,N} \,\, \times n_\llp )$
- And the layer type operates over a *single* dimension (usually the last dimension)
    - producing output shape $n_{(\ll+1)}$    

Then threading treats the inputs
- as a tensor of shape $(\dim_{\llp,1} \times \dim_{\llp,2} \times \ldots \dim_{\llp,N} )$ instances, each of shape $n_\llp $

Producing an output of shape 
- If the input to layer $\ll$ is shape $(\dim_{\llp,1} \times \dim_{\llp,2} \times \ldots \dim_{\llp,N} \,\, \times n_{(\ll+1)} )$

In our case
- Input shape is $(\ntickers \times \nchars)$
- The `Dense` layer is defined with $\nfactors$ units ($n_{(\ll+1)} = \nfactors$)
- Hence, the output shape is $(\ntickers \times \nfactors)$

The weight matrix for this layer
- $\W_\beta$ with shape $( \nfactors \times \nchars )$
    - just like any `Dense` layer; number of weights is *independent* of threading
- applies the *same weights* to each of the $\ntickers$ (the rows) of the input


## Neural Style Transfer: Feature extractor, Training Loop

We use this example to illustrate
- [Complex Loss and Training Loop](https://keras.io/examples/generative/neural_style_transfer/#compute-the-style-transfer-loss)
- [Feature extractor](https://keras.io/examples/generative/neural_style_transfer/#compute-the-style-transfer-loss)

[Here](https://www.tensorflow.org/tutorials/generative/style_transfer) is a tutorial view of the notebook.

## Autoencoder: Functional model


<!--- #include (Autoencoder_example.ipynb)) --->
 [Autoencoder example from github](https://colab.research.google.com/github/kenperry-public/ML_Spring_2023/blob/master/Autoencoder_example.ipynb)
- Functional model

**Issues**
- We could use a Sequential model with initial Encoder layers and final Decoder layers
    - But we would not be able to independently access the Encoder nor the Decoder as isolated models

## GAN

We use this example to show
- A custom training step
- Inverting a Convolution: `Conv2DTranspose`

Recall: the training of a GAN is an iterative process among two "players"
- the Discriminator
- the Generator

### Custom `train_step`

Here is a summary from our introductory module on [GANs](GAN_Overview.ipynb)

**Competitive training**

Iteration $\tt$

- Train $D_{\Theta_{D, (\tt-1)}}$ on samples
    - $\tilde{\x} \in p_\text{data} \cup p_{\text{model}, (\tt-1)}$
        - where $G_{\Theta_{G, (\tt-1)}} ( \z) \in p_{\text{model}, (\tt-1)}$
    - Update $\Theta_{D, (\tt-1)}$ to $\Theta_{D, \tp}$ via gradient $\frac{\partial \loss_D}{\partial \Theta_{D,(\tt-1)}}$
        - $D$ is a maximizer of $\int_{\x \in p_\text{data}} \log D(\x) + \int_{\z \in p_\z} \log ( \, 1 - D(G(\z)) \, )$
- Train $G_{\Theta_{G, (\tt-1)}}$ on random samples $\z$
    - Create samples $\hat{\x}_\tp \in G_{\Theta_{G, (\tt-1)}}(\z)  \in p_\text{model}$
    - Have Discriminator $D_{\Theta_{D, \tp}}$ evaluate $D_{\Theta_{D,\tp}} ( \hat{\x}_\tp )$
    - Update $\Theta_{G, (\tt-1)}$ to $\Theta_{G, \tp}$ via gradient $\frac{\partial \loss_G}{\partial \Theta_{G,(\tt-1)}}$
        - $G$ is a minimizer of $\int_{\z \in p_\z} \log ( \, 1 - D(G(\z)) \, )$
            - i.e., want $D(G(\z))$ to be high
    - May update $G$ multiple times per update of $D$

In Keras, one can override a `Model`'s `train_step` method in order to replace the
treatment of a single mini-batch of examples.

Let's see how this is applied to a [simple GAN](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/dcgan_overriding_train_step.ipynb#scrollTo=OzzFLmKIpv0j)

Key points to observe;
- Discriminator trained first
    - Create examples from real and fake images, to be fed to Discriminator
    
          # Decode them to fake images
          generated_images = self.generator(random_latent_vectors)

          # Combine them with real images
          combined_images = tf.concat([generated_images, real_images], axis=0)

- Train the Discriminator on the combined real/fake images and update it's weights

            # Train the discriminator
            with tf.GradientTape() as tape:
                predictions = self.discriminator(combined_images)
                d_loss = self.loss_fn(labels, predictions)
            grads = tape.gradient(d_loss, self.discriminator.trainable_weights)
            self.d_optimizer.apply_gradients(
                zip(grads, self.discriminator.trainable_weights)
            )

- Train the Generator
    - Have it create fake images from random, latent vectors
    - Let the Discriminator evaluate these fakes
    - Update Generator weights to better be able to fool Discriminator
    
    
        # Assemble labels that say "all real images"
        misleading_labels = tf.zeros((batch_size, 1))

        # Train the generator (note that we should *not* update the weights
        # of the discriminator)!
        with tf.GradientTape() as tape:
            predictions = self.discriminator(self.generator(random_latent_vectors))
            g_loss = self.loss_fn(misleading_labels, predictions)
        grads = tape.gradient(g_loss, self.generator.trainable_weights)
        self.g_optimizer.apply_gradients(zip(grads, self.generator.trainable_weights))


### `Conv2DTranspose`: Inverting a Convolution

A CNN layer
- creates new features, for each element of the spatial dimension of the layer input
- May "down-sample" the input (reduce spatial dimension)
    - Using a stride greater than 1
    
We can invert the Convolution, and "up-sample" (increase the spatial dimension)
- with the `Conv2DTranspose` layer type


- We can see the Discriminator using Convolutional layers with down-sampling
- And the Generator using transposed Convolutional layers with up-sampling

in this [cell](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/generative/ipynb/dcgan_overriding_train_step.ipynb#scrollTo=OrfSmOevpv0h)

[Simple GAN](https://keras.io/examples/generative/dcgan_overriding_train_step)
- [Custom train step: GAN training](https://keras.io/examples/generative/dcgan_overriding_train_step/#override-trainstep)             

## Wasserstein GAN with Gradient Penalty
[Wasserstein GAN with Gradient Penalty](https://keras.io/examples/generative/wgan_gp/#create-the-wgangp-model)
- [Gradient Tape: used for loss term, rather than weight update](https://keras.io/examples/generative/wgan_gp/#create-the-wgangp-model)
- [Overide `compile`](https://keras.io/examples/generative/wgan_gp/#create-the-wgangp-model)
- [Custom train step: GAN training](https://keras.io/examples/generative/wgan_gp/#create-the-wgangp-model)

In [2]:
print("Done")

Done
