In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [4]:
%%latex

$$
\newcommand{\statf}{\mathbf{s}} 

\def\timess#1#2{(#1:#2)}
$$

<IPython.core.display.Latex object>

**References**

 [paper](https://proceedings.neurips.cc/paper/2019/file/c9efe5f26cd17ba6216bbe2a7d26d490-Paper.pdf)
- [supplement](https://www.vanderschaar-lab.com/papers/NIPS2019_TGAN_Supplementary.pdf)
- [github](https://github.com/vanderschaarlab/mlforhealthlabpub/tree/main/alg/timegan)

# Introduction


A GAN learns to produce synthetic ("fake") feature vectors $\hat{\x}$ of length $n$ that
are plausible elements $\x \in \pdata$.
- The relationship between two features $\hat{\x}_,\hat{\x}_{j'}$ of synthetic $\hat{\x}$ should be consistent with the relations between $\x_j, \x_{j'}$ of real $\x$
- **cross sectional**
    - consistency between the pixels of an image

But much data (especially in Finance) *also* has a **time** dimension
- $\x$ is two dimensional
    - $\x_\tp$ is a vector of $n$ features, representing the state of the world at time $\tt$


The goal of a generative model would be to generate examples that are sequences (of length $T$) where each element is a vector of length $n$.
- Each training example is a sequence $\x_{(1)}, \ldots \x_{(T)}$
- Generate synthetic $\hat{\x}_{(1)}, \ldots \hat{\x}_{(T)}$
- with both forms of consistency
    - cross sectional between features at a fixed time $\tt$: $\hat{\x}_{\tp, j}$ and $\hat{\x}_{\tp, j'}$
    - sequential ("time series") between the feature vectors at different time steps: $\hat{\x}_\tp$ and $\hat{\x}_{(\tt+1)}$

Following the notation of the paper
- we use Python-like notation for sequences

$$
\x_{\timess{1}{T}} = \x_{(1)}, \ldots \x_{(T)}
$$

We could try to encode a timeseries relationship as a pseudo cross-sectional relationship
- flatten $(T \times n)$ vector $\x$ into a 1D vector of length $(T * n)$
- the relationship between pairs of vector elements at distance $n$ would be one step of time

This is unlikely to work well for common timeseries relationships: autoregressive

The *Time GAN (TGAN)* is a GAN (Generator/Discriminator pair) with an *extra* component: the *Supervisor*.

The Supervisor is responsible for constraining the Generator to produce sequences with sequential properties

- learns an autoregressive model of $\x_{\timess{1}{T}}$
- creates a predicted $\hat\x_{\timess{1}{T}}$
    - creates a single element $\hat\x_\tp$ at a time
    - generating $\hat\x_\tp$ conditional on $\hat{\x}_{\timess{1}{\tt-1}}$
- creates a *Supervised Loss* which is added to the standard Generator Loss


The Supervised Loss enforces the sequence dynamic as a constraint on one step ahead elements of the sequence.

Thus, the Generator is encouraged to produce sequences that
- not only fool the Discriminator
- but also exhibit sequential properties

There is one additional component to the TGAN.

Rather than having the Generator/Discriminator pair work on elements of $\x_\tp$
- they both work on  reduced dimensional *encodings* of $\x_\tp$ denoted as $\h_\tp$
- the reduced dimensional encoding  $\h_\tp$ is called an *embedding* of $\x_\tp$
    - same idea as word embeddings

The embedding is created by the *Embedder*
- The Encoder half
- of an Autoencoder (Encoder/Decoder pair)

This makes for a lot of moving parts (and a lot of notation).

Each component contributes (at least one) one Loss to the total *multi-part* Loss.
- Generator Loss
- Discriminator Loss
- Supervisor Loss
- Autoencoder Reconstruction Loss

It also leads to options for how to train the components.

Is the Autoencoder trained
- independently
- or jointly
with the GAN + Supervisor components ?

**Training scheme** (from paper)]
<img src="https://miro.medium.com/max/1400/1*Dh4JyfFTjkz6D2W4eDuMKg.png">

# Details

**Notation**

text            | Description |
:--|:--|
$$\statf$$ | static feature
$$\x_{\timess{1}{T}}$$ | Temporal feature
$$\tilde\x_{\timess{1}{T}}$$ |  reconstructed (by Autoencoder) $\x$
| $\tilde\x_{\timess{1}{T}} =$ `autoencoder`$(\x_{\timess{1}{T}})$
$$\hat{\x}_{\timess{1}{T}}$$ | fake data |
| $\hat{\x}_{\timess{1}{T}} = $ `decoder`$( \hat\h )$ |
$$\statf, \x_{\timess{1}{T}}$$ | real data -- s: non-sequence input, $x_{1:t}$ sequence input |
$$\tilde{\statf}, \tilde{\x}_{\timess{1}{T}}$$ | reconstructed real data (autoencoder output) |
$$\z_s, \z_{\timess{1}{T}}$$ | random vector (generator input) |
$$\h_s, \h_{\timess{1}{T}}$$ | real data in latent space (encoder output) |
| $\h_{\timess{1}{T}} =$ `embedder`$(\x_{\timess{1}{T}})$ |
$$\hat\e_s, \hat\e_{\timess{1}{T}}$$ | fake data (generator output) in latent space (encoder output) |
| $\hat\e_{\timess{1}{T}} =$ `generator`$(\z )$|
$$\hat\h_s, \hat\h_{\timess{1}{T}}$$ | fake data (generator + supervisor output) in latent space (encoder output) |
| $\hat\h{\timess{1}{T}} =$ `supervisor`$(\hat\e )$|
$$\hat{\hat\h}_s, \hat{\hat\h}_{\timess{1}{T}}$$ | fake data (real + supervisor output) in latent space (encoder output) |
| $\hat{\hat\h}_{\timess{1}{T}} =$ `supervisor`$(\h )$|
$$\y$$       | Boolean: Real/Fake Discriminator or real data |
$$\hat{\y}$$ | Boolean: Real/Fake Discriminator on fake data (generator + supervisor output) |
$$\hat{\hat{\y}}$$ | Boolean: Real/Fake Discriminator on fake data (real + supervisor output) |


**Losses**

name         | Definition &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | Description |
:---|:--|:--
embedding_loss | $\text{MSE}(\x_{\timess{1}{T}}, \tilde\x_{\timess{1}{T}})$ | Autoencoder reconstruction error 
e_loss | $\text{embedding_loss}^{0.5}  +  \text{generator_loss}_\text{supervised} $ | Embedder Loss
| | = Autoencoder reconstruction error + one step ahead prediction error


## Examples, real and synthetic

To be fully general, the paper *also* allows examples to have 2 parts
- static (non-sequence)
- sequence

Thus
- a real example is a pair $\statf, \x_{\timess{1}{T}}$

As we are primarily concerned with the sequence properties of $\x_{\timess{1}{T}}$, 
we may omit $\statf$ to simplify the presentation.

We will assume that each element $\x_\tp$ of (real and fake) sequences $\x_{\timess{1}{T}}$ is a vector of length $n$
- for example, the time $\tt$ characteristics of each of $n$ tickers

We don't need to assume that $\statf$ is also a vector of length $n$
- e.g., no need to assume a unique characteristic per ticker

There are *two kinds* of synthetic sequences (both sequences of embeddings rather than raw inputs)
- $\hat{\h}_{\timess{1}{T}}$: produced by the Supervisor, given input from the Generator

$$
\hat{\h}_{\timess{1}{T}} = \text{supervisor}( \text{generator}(\z) )
$$

- $\hat{\hat{\h}}_{\timess{1}{T}}$: produced by the Supervisor, given (the embedding of) **real** data $\x$

$$
\begin{array} \\
\h_{\timess{1}{T}} & = & \text{embedder}(\x_{\timess{1}{T}}) \\
\hat{\hat{\h}}_{\timess{1}{T}} & = & \text{supervisor}( \h_{\timess{1}{T}}) \\
\end{array}
$$

# Loss functions: High level view of objectives

The *Time GAN (TGAN)* tries to achieve both cross sectional and sequence objectives by a multi-part Loss function.

The first objective (enforced by a loss function) is the familiar GAN objective
- $\pmodel \approx \pdata$ for some definition of equality of distributions
- Also called: the *Unsupervised Loss*

The second objective is sequence related
- *Conditional* (one step ahead) distributions of Fake and Real are equal
- $\pmodel( \x_\tp | \x_{\timess{1}{\tt-1}} ) \approx \pdata( \x_\tp | \x_{\timess{1}{\tt-1}} )$
- Also called: the *Supervised Loss*

We will use the KL Divergence as our measure of "dissimilarity" of two distributions
- Just like in the plain GAN: 
    - under the assumption that the optimal Discriminator can be found, the KL Divergence turns into the JSD.

As we shall see
- there are multiple steps to training the TGAN
    - independent training of the Embedder
    - independent training of the Supervisor
    - joint training of all components
    
Thus, there will be a separate Loss function for each training step.

The Losses will be multi-part: consisting of sums of terms for sub-losses.

**Notation**

We use `BCE` as a function that computes *Binary Cross Entropy*

# Embedder (Autoencoder)

Rather than working on "raw" examples $\statf, \x_{\timess{1}{T}}$ the model creates an *embedding* of the example
- lower dimensional representation
- that preserves "semantics"
    - same idea as embedding words, that is, changing the representation of a word
        - from a categorical encoded by a *sparse*, very long OHE vector
        - to a shorter, *dense* vector


The embedding will be created by an Autoencoder
- Given input example 
- Pass it through an Encoder: a bottle-neck that reduces the dimensions
- Have the Decoder try to reconstruct the input, given the reduced dimension representation

So the Embedder is implemented by the Encoder half of an Autoencoder.

The Encoder
- takes input $\statf, \x_{\timess{1}{T}}$
- outputs reduced dimension embedding/latent $\h_s, \h_{\timess{1}{T}}$

The Decoder
- takes input $\h_s, \h_{\timess{1}{T}}$
- outputs $\tilde\statf, \tilde\x_{\timess{1}{T}}$
The Encoder/Decoder pair is trained with the objective
$$
\tilde\statf, \tilde\x_{\timess{1}{T}} \approx \statf, \x_{\timess{1}{T}}
$$
    
That is: the Decoder attempts to create the best reconstruction of the Encoder input, given the embedding.
    


Both the Encoder and Decoder 
- must be autoregressive
    - generate embedding of future element, conditional on the past elements
- must obey *causal ordering* of the sequence
    - can't look at any of $\x_{\timess{t+1}{}}$ when embedding $\x_\tp$


## Why use embeddings rather than raw data ?

One obvious reason is space considerations
- Having the GAN work with lower dimensional embeddings rather than high dimensional raw input

The less obvious (perhaps) reason comes from our experience with sequence data such as time series of returns
- PCA analysis shows that highly reduced dimensions preserve much of the variation
- Factor models propose that a small number of common factors are responsible for the variation of the cross section of stock returns

## Autoencoder training (independent pass)

There are two obvious choices for training the Autoencoder
- Independent of the GAN Generator/Discriminator + Supervisor
    - That is: just "learn" how to compress sequences without worrying about where the sequences come from
- Jointly with the GAN + Supervisor   

The authors chose
- an initial training of the Autoencoder independently
- followed by a joint training with the GAN + Supervisor

For the independent training of the Autoencoder we use the
standard Reconstruction Loss of an Autoencoder:

$$
\begin{array} \\
\loss_R & = & 
\text{embedding_loss} & = & \text{MSE}(\x_{\timess{1}{T}}, \tilde\x_{\timess{1}{T}}) & \text{Reconstruction Loss} \\
\end{array}
$$

**This is the only Loss Term** that is optimized in the independent training.

We defer the Autoencoder loss term that arises from joint training until after we introduce the Supervisor.

# Supervisor

The Supervisor is a Recurrent Network that 
helps in learning an autoregressive model of $\x_{\timess{1}{T}}$.

However: it works on the embeddings of data rather than the raw data
- creates a predicted $\h_{\timess{1}{T}}$
    - creates a single element $\h_\tp$ at a time
    - generating $\h_\tp$ conditional on $\h_{\timess{1}{\tt-1}}$
- rather than creating a predicted $\x_{\timess{1}{T}}$
    - if desired, can map the predicted $\h_{\timess{1}{T}}$ to predicted $\x_{\timess{1}{T}}$ using the Decoder half of the Autoencoder
    $$
    \hat{\x}_{\timess{1}{T}} = \text{decoder}( \hat\h )
    $$

There are two sources of embeddings that are fed to the Supervisor.

The Supervisor takes the given sequence of embeddings into the autoregressive sequence.

The two inputs to the Supervisor, and their outputs are
- Embeddings of real examples $\x$
$$
\begin{array} \\
\h_{\timess{1}{T}} & = & \text{embedder}(\x_{\timess{1}{T}}) \\
\hat{\hat\h}_{\timess{1}{T}} & = & \text{supervisor} (\h ) \\
\end{array}
$$

- Embeddings created by the Generator (i.e., fake examples, in latent space)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat\h{\timess{1}{T}}  & = & \text{supervisor}(\hat\e )
\end{array}
$$


## Supervisor training (independent pass)

There is a *Supervised Loss* associated with the Supervisor.

It is a measure of the quality of the embedding as input to the Supervisor
- Given the embedding, can the Supervisor construct a loss-less synthetic
    - Note: this synthetic is derived from a **real** example $\x$, **independent** of the Generator
    
$$
\begin{array} \\
\h_{\timess{1}{T}} & = & \text{embedder}(\x_{\timess{1}{T}}) \\
\hat{\hat\h}_{\timess{1}{T}} & = & \text{supervisor}( \h ) \\
\loss_S & = & 
\text{generator_loss}_\text{supervised} & = & \text{MSE}( \h_{\timess{1}{}}, \hat{\hat\h}_{\timess{1}{T}}) \\
\end{array}
$$

As you can see, $\loss_S$ depends on the behavior of the Encoder side (`embedder`) of the Autoencoder.

But just as we had an initial independent training of the Autoencoder
- we have an initial independent training of the Superviser

During joint training, we will use Loss functions that cause the weights
of the `embedder` and `supervisor` to update together.


# Discriminator

There are three sources of examples fed to the Discriminator, each resulting
in a different judgment of Real/Fake.

Recall that the Generator, Discriminator and Supervisor all take embeddings (as opposed to raw examples) as input.

The three inputs to the Discriminator, their outputs (judgments), and associated losses are
- Embeddings of Real examples
$$
\begin{array} \\
\h_{\timess{1}{T}} & = & \text{embedder}(\x_{\timess{1}{T}}) \\
\y & = &  \text{discriminator}(\h) \\
\text{D_loss_real} & = & \text{BCE}(1's,  \y) \\
\end{array}
$$


- Embeddings of Fake examples (created by Generator only)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat{\y} & = & \y_\text{fake_e}  =  \text{discriminator}( \hat\e) \\
\text{D_loss_fake_e} &  = & \text{BCE}(0's, \y_\text{fake_e}) \\
\text{generator_loss}_\text{unsupervised_e} & = &  \text{BCE}( 1's, \y_\text{fake_e} ) \\
\end{array}
$$

    


- Embeddings of Fake examples (created by Generator + Supervisor)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat\h{\timess{1}{T}}  & = & \text{supervisor}(\hat\e ) \\
\dot\y & = & \y_\text{fake}  =  \text{discriminator}( \hat\h ) \\
\text{D_loss_fake} &  = & \text{BCE}(0's, \y_\text{fake}) \\
\text{generator_loss}_\text{unsupervised} &  = & \text{BCE}(1's, \y_\text{fake})
\end{array}
$$


The Discriminator Loss $\loss_D$ is the (weighted, but we ignore the weights) sum
$$
\loss_D = \text{D_loss_real} + \text{D_loss_unsupervised} + \text{D_loss_fake}
$$


# Generator

**Remark and warning**

There are a number of variables in 
[Janssen's implementation of TimeGAN](https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/21_gans_for_synthetic_time_series/02_TimeGAN_TF2.ipynb)
that begin with `generator_loss`.

I will use these variable names in the presentation below (even though I find them somewhat confusing).

- `generator_loss_supervised` refers to the loss of the Supervisor: $\loss_S$
    - I would have chosen: `supervisor_loss` rather than `generator_loss_supervised`
- There are two variable names beginning with `generator_loss_unsupervised`
    - Unfortunately, at least one of them references the output of the Supervisor, which is confusing
        - `generator_loss_unsupervised` is the loss associated with examples created by the Generator **and** the Supervisor
            - I might have chosen `generator_loss_with_sup`
        - `generator_loss_unsupervised_e` is the loss associated with examples created by the Generator alone (without the Supervisor)
            - `generator_loss_without_sup`

# Generator


The Generator tries to fool the Discriminator in two ways.
- via direct output of the Generator
- by the output of the Supervisor
    - creates an autoregressive model of the direct output of the Generator
    
Recall that the Generator, Discriminator and Supervisor all take embeddings (as opposed to raw examples) as input.

The two outputs of the Generator, their evaluation by the Discriminator, and associated losses ar

- Embeddings of Fake examples (created by Generator only)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat{\y} & = & \y_\text{fake_e}  =  \text{discriminator}( \hat\e) \\
\text{generator_loss}_\text{unsupervised_e} & = &  \text{BCE}( 1's, \y_\text{fake_e} ) \\
\end{array}
$$

- Embeddings of Fake examples (created by Generator + Supervisor)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat\h_{\timess{1}{T}}  & = & \text{supervisor}(\hat\e ) \\
\dot\y & = & \y_\text{fake}  =  \text{discriminator}( \hat\h ) \\
\text{generator_loss}_\text{unsupervised} &  = & \text{BCE}(1's, \y_\text{fake})
\end{array}
$$

There is an *additional* loss term associated with the Generator.

It compares the statistical moments (mean, variance) of the
real examples with fake examples
- real examples with first two moments $\mu, \sigma$
- fake examples with first two moments $\hat\mu, \hat\sigma$
    - fake examples obtained from the sequence of embeddings $\hat\h$ produced by (Generator + Supervisor) using the Decoder side of the Autoencoder
    $$
\hat{\x}_{\timess{1}{T}} = \text{decoder}( \hat\h_{\timess{1}{T}} )
$$

The Generator Moments loss is
$$    
\text{generator_loss}_\text{moment}  =  (\mu - \hat{\mu}) + (\sigma - \hat{\sigma} )
$$

The total loss $\loss_G$ for the Generator is the sum of the sub-losses

$$
\loss_G = \text{generator_loss}_\text{unsupervised_e} + \text{generator_loss}_\text{unsupervised} + 
\loss_S + \text{generator_loss}_\text{moment}
$$

# Joint training

The Embedder and Supervisor are initially trained (separately) independently of the GAN.

It is during joint training that the GAN (Generator and Discriminator) are trained, in the usual Adversarial Training manner of a GAN.

It may not be obvious but, even though Adversarial Training looks like it is designed to update
the Generator and Discriminator weights
- all weights are potentially updated, including the Embedder and Supervisor

There is a bit of subtlety here.

The most obvious purpose of joint training is to implement the GAN adversarial training
- Generator weights $\Theta_G$ are updated to better produce examples to fool the Discriminator
- Discriminator weights $\Theta_D$ are updated to better distinguish between Real and Fake examples.

How do the weights of the Supervisor and Autoencoder come into play ?

Recall that the Generator creates two kinds of Fake embeddings of examples.

- Embeddings of Fake examples (created by Generator only)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat{\y} & = & \y_\text{fake_e}  =  \text{discriminator}( \hat\e) \\
\text{generator_loss}_\text{unsupervised_e} & = &  \text{BCE}( 1's, \y_\text{fake_e} ) \\
\end{array}
$$

- Embeddings of Fake examples (created by Generator + Supervisor)

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat\h_{\timess{1}{T}}  & = & \text{supervisor}(\hat\e ) \\
\dot\y & = & \y_\text{fake}  =  \text{discriminator}( \hat\h ) \\
\text{generator_loss}_\text{unsupervised} &  = & \text{BCE}(1's, \y_\text{fake})
\end{array}
$$

The first 
- $\hat\e_{\timess{1}{T}}  =  \text{generator} (\z )$

is independent of the Supervisor, but the second
- $\hat\h_{\timess{1}{T}}   = \text{supervisor}(\hat\e )$
is affected by the Supervisor (hence its weights).

Moreover, the second kind of example is fed into the Discriminator.


Therefore, in order to reduce the associated Generator loss
$$
\text{generator_loss}_\text{unsupervised}   =  \text{BCE}(1's, \y_\text{fake})
$$
or associated Discriminator Loss

$$
\text{D_loss_fake}   =  \text{BCE}(0's, \y_\text{fake}) 
$$

a gradient may arise that affects the weights of the Supervisor.

Note that the Supervisor Loss does not depend *directly* on Fake examples.

It is defined solely on real examples

$$
\begin{array} \\
\h_{\timess{1}{T}} & = & \text{embedder}(\x_{\timess{1}{T}}) \\
\hat{\hat\h}_{\timess{1}{T}} & = & \text{supervisor}( \h ) \\
\loss_S & = & 
\text{generator_loss}_\text{supervised} & = & \text{MSE}( \h_{\timess{1}{}}, \hat{\hat\h}_{\timess{1}{T}}) \\
\end{array}
$$

It is not exposed directly to fake examples, only indirectly
- Through the dependence of part of the Generator and Disciminator losses on the Supervisor


It makes sense that $\loss_S$ depends only on real examples
- they are from $\pdata$, which defines all statistical relationships
- we can't infer any true relationship from Fake data


So Adversarial Training of the GAN can modify $\loss_S$ which depends on the Encoder half of the Autoencoder
- see the equations above
    - $\loss_S$ depends on $\h_{\timess{1}{T}}$
    - $\h_{\timess{1}{T}} = \text{embedder}(\x_{\timess{1}{T}})$
   
Hence Adversarial Training can also affect the weights of the Autoencoder.   
    




Thus, the Autoencoder learns a better encoding by joint training.


We can see the definition of `e_loss` combines these the two "reconstruction" related terms

$$
\begin{array}[llll] \\
\tilde\x_{\timess{1}{\tt}} & = & \text{autoencoder}(\x_{\timess{1}{T}}) & \text{Reconstructed } \x\\
\text{embedding_loss} & = & \text{MSE}(\x_{\timess{1}{T}}, \tilde\x_{\timess{1}{T}}) & \text{Reconstruction error} \\
& = &  \text{embedding_loss}^{0.5}  + \text{generator_loss}_\text{supervised} \\
\end{array}
$$


The fact that the Embedder and Supervisor/Generator/Discriminator are tied together in training
is also apparent if we zoom in on the diagram of the Training Scheme

<table>
    
<tr><th>Training Scheme: zoom in on Embedder gradient updates</th>
</tr>

<tr>
    <img src="images/TGAN_training_Autoencoder.png">
</tr>
</table>

Notice: the gradient

$$
\frac{\partial \loss_S}{\partial \Theta_e}
$$

of the Supervised Loss with respect to the Encoder weights $\Theta_e$ flowing back to the Encoder $e$

# Generating synthetic examples

Once trained, we can generate new Fake examples

- by creating a random noise vector $\z$
- feeding $\z$ to the `generator` and `supervisor` to get a synthetic embedding (created by Generator + Supervisor)
- decoding the synthetic embedding (by the Decoder part of the Autoencoder) to create a sequence in the original data space

$$
\begin{array} \\
\hat\e_{\timess{1}{T}} & = & \text{generator} (\z ) \\
\hat\h_{\timess{1}{T}}  & = & \text{supervisor}(\hat\e ) \\
\hat{\x}_{\timess{1}{T}} & = & \text{decoder}( \hat\h_{\timess{1}{T}} ) \\
\end{array}
$$

# Code

There is an implementation of TimeGAN in Stefan Janssen's [ML4T Chapt 21](https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/21_gans_for_synthetic_time_series/02_TimeGAN_TF2.ipynb)
which is derived from the [author's implementation](https://github.com/vanderschaarlab/mlforhealthlabpub/blob/main/alg/timegan/tgan.py)



In [None]:
print("Done")