In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

In [2]:
from IPython.display import Image

# AutoEncoder (AE): High Level

<div class="alert alert-block alert-warning">
    <b>TL;DR</b> 
    <br>
    <ul>
        <li>The Deep Learning analog of Principal Components (PCA)</li>
        <ul>
            <li>Most of the lessons of AE apply equally to PCA</li>
        </ul>
        <li>Unsupervised: no labels (really semi-supervised)</li>
        <li>Create "synthetic features" from the original set of features</li>
        <li>May be able to use reduced set of synthetic features (dimensionality reduction)</li>
        <li><b>Generative (vs Discriminative)</li>
    </ul>
</div>

<table>
    <tr>
        <th><center>Autoencoder</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_vanilla.png"></td>
    </tr>
</table>

An Autoencoder network has two parts
- An Encoder, which takes input $\x$ and "encodes" it into $\z$
- A Decoder, which takes the encoding $\z$ and tries to reproduce $\x$

Each part has its own weights, which can be discovered through training, with examples
- $\langle \X, \y \rangle = \langle \X, \X \rangle$

That is: we are asking the output to be identical to the input.

$\z$ is an alternative latent representation of $\x$.
- Encoded by the Encoder
- Inverted by the Decoder

But
when the dimension of $\z$ is less than the dimension of $\x$.
- $\z$ is a *bottle-neck*
- the inversion by the Decoder will be imperfect

$\z$ becomes a *reduced-dimensionality* approximation of $\x$.


This is reminiscent of the dimensionality reduction of Principal Components Analysis (PCA).

The *main difference* from PCA
- PCA uses a *linear* transformation
- NN can use *non-linear* transformations too
    - PCA as a special case of AE
   

# Autoencoders: Uses

## Dimension reduction

After training
- we can discard the Decoder
- use the Encoder output (synthetic features) as reduced dimension inputs to a *new* task
    - Encoder weights are **frozen**: non-learnable when training new task
   - It may be easier to solve the new task given $\z$ rather than $\x$
       - have already discovered "structure" of $\x$
   - *Transfer Learning*

<table>
    <tr>
        <th><center>Autoencoder: Encoder + New head</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_encoder_new_head.jpg" width=1200></td>
    </tr>
</table>

In PCA, we eliminated original features that were "less important"
- i.e., explained variation among only a small fraction of the training set
    - recall how we re-denominated explained variance in terms of "number of features"
    
There is no direct similar concept of feature importance in AE
- other than minimizing a Loss function, which *may* wind up focusing on "important" features

## Layer-wise pre-training with Autoencoders

Autoencoders played a vital role in the development of Deep Learning:
- They made it possible to train otherwise untrainable NN's.
- Other innovations supplanted the need for AE's to assist training
    - better initialization
    - better activations functions
    - normalization
 
Although they are no longer needed for that purpose, it is interesting to see how (and why) they were used.


For a NN with $L$ layers that solves a Supervised Learning Problem
- Training attempts to learn the weights of all layers simultaneously
- *Layer wise pre-training* was an attempt
    - to *initialize* the weights of each layer
    - in succession
    - so that the task of simultaneously solving for optimal weights had a better chance of succeeding

The idea was to learn an initialization of $\W_\llp$, the weights of layer $l$.
- After having learned the weights $\W_{(l')}$ for all layers $l' \lt l$.


To initialize $\W_\llp$:
- Train an AE that takes $\x^\ip$ as input
- Using initialized weights $\W_{(l')}$ for all layers $l' \lt l$
- Produces $\tx^\ip$ at layer $l$'s output $\y_\llp$

So weight initializations were learned layer by layer.

Note that the labels $\y^\ip$ *were not used* !
- wouldn't be useful for the shallow NN

It was thought
- to be easier to learn the structure of the input $\x$ independent of the labels
- to be easier to learn $\W_\llp$ incrementally

One the weights $\W_\llp$ were initialized via AE's
- training of the Supervised Learning task had a better chance of succeeding
- compared to any other initialization

## Autoencoders and Transfer Learning
Today, autoencoders are useful for another purpose: Transfer Learning.

If we can train an AE network to create features that are useful for reconstruction
- it is possible
that these features are useful for solving more complicated tasks.

This was in essence what
- Our dimension reduction example (replace the head) was doing
- Layerwise Pre-training was attempting.

So it is not uncommon to approach a complicated problem
- by first constructing an autoencoder to
come up with an alternate (and smaller) representation of the input space.

Note that Autoencoders are *unsupervised*: they don't take labels.  

So the encodings they produce
stress syntactic similarity, rather than semantic similarity.

Their use in Transfer Learning depends on the hope that inputs that are syntactically similar also
have the same labels.

# Denoising

Very much like dimension reduction but with the assumption that
- "less important" features are just random noise that is added to the true example

<table>
    <tr>
        <th><center>Autoencoder: Denoising</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_denoising.png" width=1200></td>
    </tr>
</table>

# Generative Artificial Intelligence

A less obvious use of AE (using the Decoder rather than the Encoder) is to *generate* examples.

Most of the Machine Learning we have studied thus far is *discriminative*
- $\pr{\hat{\y}^\ip | \x^\ip )}$
    - e.g., classifier: discriminate among the possible classes $\y^\ip$, given example $\x^\ip$

We can use the Decoder on *arbitrary* $\z$ to *generate* a completely  new $\x$:
- $\pr{ \x^{(i')} | \z^{(i')} }$ for some $i'$ not in training
- *generate* a new example $i'$, in the domain of $\x$, that was not encountered during training

<table>
    <tr>
        <th><center>Generator</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_decoder.jpg" width=800</td>
    </tr>
</table>

# Autoencoder (AE): Details

The *task* that trains an Autoencoder
- Given input $\x^\ip$
- Output of Encoder: $\z^\ip = E(\x^\ip)$
- Output of Decoder: $\tx^\ip = D(\z^\ip)$
- "Target": $\x^\ip$

Both the Encoder and Decoder are parameterized (learnable parameters)
- Goal: find the parameters such that 
$$
\begin{array}[llll] \\
 \tx^\ip = D(E(\x))  & \approx & \x \\
\end{array}
$$

$\z^\ip = E(\x^\ip)$ is the latent representation of $\x^\ip$.

## Loss function

The obvious loss functions compare the original $\x^\ip$ and reconstructed $\tilde\x^\ip$ feature by feature:


### Mean Squared Error (MSE)
$$
\loss^\ip = \sum_{j=1}^{|\x|} { (\x^\ip_j - \tx^\ip_j)^2 }
$$

### Binary Cross Entropy

For the special case where *each* original feature is in the range $[0,1]$ (e.g., an image)

$$
\loss^\ip = \sum_{j=1}^{|\x|} {  \left( \x^\ip_j    \log(\tx^\ip_j) + ( 1 - \x^\ip_j ) \log(1 - \tx^\ip_j) \right) }
$$

# Variational Autoencoder (VAE): Generative ML

Observe that the Decoder part of the "vanilla" AE $D( \z^\ip )$
- has been trained to produce "realistic" $\tx^\ip$ *only* for a $\z^\ip = E(\x^\ip)$
    - i.e., "realistic": appears to come from the distribution of training $\X$
- there is no guarantee that $D( \z^{(i')} )$ for some $i'$ not in training is realistic

That is: the AE has not been trained to *extrapolate* beyond the training inputs.

A VAE is able generate outputs 
- that *could have* come from the training distribution from a latent representation $\z^{(i')}$ 
- but that *did not* come from $\X$.

Our goal is constructing a **Decoder** that can extrapolate.


<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.png"></td>
    </tr>
</table>

The Decoder will take a *latent vector* $\z$ and produce $D(\z)$, just as in a vanilla AE.


The difference is that $\z$ will be sampled from a *distribution* rather than being a 
unique mapping of a training example.

This will be done by modifying the Encoder
- It will *indirectly* create $\z^\ip$
- It will compute *variables* $\mu^\ip$ and $\sigma^\ip$
    - $\z^\ip$ will be *sampled* from a distribution with mean $\mu^\ip$ and standard deviations $\sigma^\ip$
    
As long as $\z$ is sampled from this distribution, the decoder will produce a "realistic" output.

**Note**

$\mu$ and $\sigma$ are 
- vectors
- computed values (and hence, functions of $\x$) and **not** parameters
- so training learns a *function* from $\x^\ip$ to $\mu^\ip$ and $\sigma^\ip$


To train a VAE:
- pass input $\x^\ip$ through the Encoder, producing $\mu^\ip, \sigma^\ip$
    - use $\mu^\ip, \sigma^\ip$ to sample a latent representation $\z^\ip$ from the distribution
- pass the sampled $\z^\ip$ through the decoder, producing $D(\z^\ip)$
- measure the reconstruction error $\x^\ip - D(\z^\ip)$, just as in a vanilla AE
- back propagate the error, updating all weights and $\mu, \sigma$

Essentially, each input $\x^\ip$ has *many* latent representations (with different probabilities):
any sample from the distribution.


**Training**

Encoder produces

$$
\begin{array}[llll] \\
E(\x) & = &  q_\phi(\z|\x) & \approx & p_\theta(\z|\x) \\
\end{array}
$$

We sample from
$$
\begin{array}[llll] \\
\hat{\z} & \sim & q_\phi(\z|\x) \\
\end{array}
$$

Decoder produces

$$
\begin{array}[llll] \\
D(\hat{\z})  & = & p_\theta (\x|\z)\\
\end{array}
$$

Each time (epoch) that we encounter the same training example, we select another random element from the distribution.

So the VAE learns to represent the same example from multiple latents.


**Generative**
- sample $\hat{\z} \sim \hat{p}(\z)$
- use decoder to produce output $p_\theta (\x|\z)$
 
This means we can feed in a $\z$ 
- that doesn't correspond to any training example
- and perhaps get an output that *resembles* something from the training set, rather than noise.

## Which came first: the VAE architecture or the loss function ?

So far, we haven't told you the Loss function that is optimized during training.

It's a bit complicated so we'll save it for the end (for those who are interested).

The really interesting thing:
- The Loss function drove the architecture of the VAE, not vice-versa !
- As we've said many times in this course: 
    - Deep Learning is about creating Loss functions that reflect our objectives

# Conditional VAE

Once a VAE is trained, $D(\z)$ should produce a realistic output, for any $\z$ from the distribution.

However, if the distribution of $\X$ includes examples from many classes 
- Assuming we have labels as auxilliary information (not used in training)
    - e.g., the 10 digits
- The VAE can't control *which class* the output will come from

A *Conditional VAE* allow our generator (Decoder) to control the class $c$ of the output $\tx$.

<table>
    <tr>
        <th><center>Conditional VAE (CVAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_CVAE.jpg"" width=800></td>
    </tr>
</table>

The class label $c$
- is given as part of *training*
    - So the Encoder produces a distribution that is conditioned on *both* $\x$ and $c$.
- is an *additional parameter* of the Decoder
    - So the output class can be controlled
$$
\tx^\ip = D(\z^\ip, c)
$$



So now we 
- create a latent $\z$
- append a class label $c$
- and presumably have the decoder produce an output from the desired class.

- The encoding distribution is now conditional on class label $c$: $q_\phi(z|x,c)$ 
- So is the decoding distribution $p_\theta(x|z,c)$ 

Again, by restricting the functional form of the prior distribution $\hat{p}$ we can simplify the math.

# Detour: Autoencoder notebook on Colab

Let's examine some Keras code that implements several types of Autoencoders
- Vanilla
- Denoising
- VAE

We will write our AE's using the Keras *Functional* API rather than the *Sequential* model
- We *could* write the complete AE using the Sequential API
- **But**
    - we want to extract the Encoder and Decoder parts as *separate models*
    - we can do this with the Functional API

We will now switch to a notebook running on Google Colab
 [Autoencoder example from github](https://colab.research.google.com/github/kenperry-public/ML_Fall_2019/blob/master/Autoencoder_example.ipynb)

# VAE: Probabilistic formulation

**Note**: Advanced material

Let's pretend: we don't already know that we will represent $\mathbf{z}$ as a function of $\mathbf{\mu}_\theta(\x)$ and $\mathbf{\sigma}_\theta(\x)$
- this derivation will show **why** we made that choice

The mathematical derivation of a VAE is quite detailed
- it is interesting but not absolutely necessary to understand
- this is where we define the Loss function

The interested reader is refered to a highly recommended [VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf).

We will try to give the essence in the following slides.

<div class="alert alert-block alert-warning">
    <b>TL;DR</b> 
    <br>
    <ul>
        <li>The VAE has a very interesting <b>two part</b> Loss Function</li>
        <ul>
            <li>Reconstruction Loss, as in the Vanilla AE</li>
            <li>Divergence Loss
        </ul>
        <li>The Reconstruction Loss is not sufficient</li>
        <ul>
            <li>Issues of intractability arise</li>
            <li>The Divergence Loss skirts intractability</li>
            <ul>
                <li>By constraining the Encoder to produce a tractable distribution</li>
            </ul>
        </ul>
    </ul>
</div>

From the description of the VAE, observe that we are now dealing with distributions rather
than deterministic values for
- the encoding (latent representation) $\z$
- the output

So we will need to describe these distributions

We posit a  joint probabiity distribution $p(\x, \z)$ of examples and latents.

These are *empirical* distributions:
- they are defined by the data
- no closed form

We will approximate $p$, as usual, with a NN that we will parameterize with $\theta$.

That is: we will train a model to learn $\theta$.

So henceforth we write $p_\theta$ to denote the dependence of $p$ on parameters $\theta$.

We motivate these distributions as they relate to the VAE:
- the encoder produces $p_\theta(\z|\x)$
- the decoder produces $p_\theta(\x|\z)$


<table>
    <tr>
        <th><center>VAE derivation: 1</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE_derivation_A.png" width=1000></td>
    </tr>
</table>

Note that there are *many* reconstructions $\tilde\x^\ip$ of $\x^\ip$
- depending on the sampled $\tilde\z^\ip$ drawn from $p_\theta(\z | \x)$

## The loss function: first attempt

Let's try to create an optimization objective function against which we choose our model weights.

We will use Maximum Likelihood as the objective
- Given the weights: how likely is the model to produce the training distribution ?

Since our practice is to minimize Loss (rather than maximize an objective function)
we write our loss as (negative of log) likelihood
$$
\begin{array}[llll] \\
\loss & = & - \log( p_\theta(\x) )
\end{array}
$$

Minimizing $\loss$ is equivalent to maximizing likelihood.
T

## Intractability

It turns out that things are not so simple: 
- Some of the distributions we need to deal with 
may not be *tractable*

- They have no closed form, just empirical distributions
- Higher dimensional distributions may  pose computational issues


The problem is that
$p_\theta( \z | \x)$ is assumed to be intractable

That is: for a given $\x^\ip$, we don't know which samples from $\z$ can reconstruct $\x^\ip$.
    
$p_\theta(\z|\x) = \frac{p_\theta(\x|\z) p(\z)}{p_\theta(\x)}$ (by Bayes rule)
- this represents the distribution for the encoder
- $\z$ is from an unknown distribution


The solution is *approximate* the intractable $p_\theta(\z|\x)$ with a tractable $q_\phi(\z|\x)$
- where $q_\phi(\z|\x)$ is a function of learnable parameters $\phi$.

We use KL divergence as a measure of the difference between two distributions.

So we want to minimize

$$
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x))
$$



<table>
    <tr>
        <th><center>VAE derivation: 2</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE_derivation_B.png"></td>
    </tr>
</table>

We add the KL divergence to our loss function
$$
\begin{array}[lll]\\
\loss  & = & - \log(p_\theta(\x)) + \KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) \\
& = & \loss_R + \loss_D
\end{array}
$$

which now has two objectives
- Reconstruction loss $\loss_R$: maximize the likelihood (by minimizing the negative likelihood)
- Divergence constraint $\loss_D$: $q_\phi(\z|\x)$ must be close to $p_\theta(\z | \x))$

$$
\begin{array}[llll] \\
\loss_R & = & - \log( p_\theta(\x ) ) \\
\loss_D & = & \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z | \x) \right) \\
\end{array}
$$

We will show (in the next section: lots of algebra !) that
$$
\loss = - \E_{z \sim q_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right) + \KL(q_\phi(\z|\x)  \; ||\;  p_\theta(\z) )
$$

This is *almost* identical to $\loss$ except
- Re-write 
$$\log(p_\theta(\x)) = 
\E_{z \sim q_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right)
$$
- the KL term becomes
$$
 \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z) \right)
$$
rather than the original
$$
\KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z | \x) \right)
$$



That is, our  new loss $\loss$ function has two components
- $\loss_R$
    - the reconstruction loss, as before, but using the encoder $q_\phi(\z|\x)$ instead of $p_\theta(\z|\x)$

- $\loss_D$
    - the "KL divergence" loss which constrains the approximate $q_\phi(\z|\x)$ 

### Advanced: Obtain $\loss$ by rewriting $\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)$

Let's derive a simpler expression for $\loss$ by manipulating $\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x))$:

$
\begin{array}[llll]\\
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) &  = & \sum_z{ q_\phi(\z|\x )(\log(q_\phi(\z|\x) - \log(p_\theta(\z | \x)) } & \text{def. of KL} \\
&  = & \E_{z \sim q_\phi(\z|\x) } \left( \log(q_\phi(\z|\x) - \log(p_\theta(\z | \x)) \right) & \text{def. of }\E \\
&  = & \E_{z \sim q_\phi(\z|\x) } ( \; \log(q_\phi(\z|\x)) \\ & & -\left( \; \log( p_\theta(\x | \z)) + \log(p_\theta(\z)) - \log(p_\theta(\x) \right)    \,   )  \;\;)&  \text{Bayes theorem on } \\
 & & & \log(p_\theta(\z | \x)) \\
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) \\ - \log(p_\theta(\x)) & = & \E_{z \sim q_\phi(\z|\x) } \left( \; \log(q_\phi(\z|\x))  - \left( \log( p_\theta(\x | \z) ) + \log( p_\theta(\z) ) \right) \;\right) & \text{ move } \log(p_\theta(\x)) \text{ to LHS} \\
 & = & \E_{z \sim q_\phi(\z|\x) } \left( \; - \log( p_\theta(\x | \z) ) + ( \; \log(q_\phi(\z|\x))  - \log( p_\theta(\z) ) \; )     \; \right) & \text{re-arrange terms} \\
 & = & - \E_{z \sim q_\phi(\z|\x) } \left( \log( p_\theta(\x | \z) ) \right) + \KL(q_\phi(\z|\x) \; ||\;  p_\theta(\z) ) & \text{def. of KL} \\
 \loss & = & - \E_{z \sim q_\phi(\z|\x) } \left( \log( p_\theta(\x | \z) ) \right) + \KL(q_\phi(\z|\x) \; ||\;  p_\theta(\z) ) & \text{since LHS} = \loss \\
\end{array}
$

The LHS cannot be optimized via SGD (recall the tractability issue with  $p_\theta(\z|\x)$).

**But the RHS can be made tractable** giving a tractable choice of $p_\theta(\z)$.



### Choosing $p_\theta(\z)$

So what distribution should we use for the prior $p_\theta(\z)$ ?
- It should be differentiable, since we use Gradient Descent for optimization
- It would be advantageous if it had a tractable closed form (such as a normal)
- If we choose $p_\theta(\z)$ as normal, we can require $q_\phi( \z | \x )$ to be normal too
    - The KL divergence between two normals is an easy to compute function of their means and standard deviations.
    - See [VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf) Section 2.2


So it may be fair to say that the idea for architecture of the VAE was obtained from the Loss function,
rather than vice-versa.


## Re-parameterization trick

There is still one more problem for training:
- sampling $\z  \sim  q_\phi(\z|\x)$
which appears in the reconstruction loss $\loss_R$ term of the loss $\loss$:
$$
\loss_R = \E_{z \sim q_\phi(\z|\x) } \left( \log( p_\theta(\x | \z) ) \right)
$$

This is not a problem in the forward pass.

But optimization using Gradient Descent requires the ability to differentiate the loss with respect to trainable paramters.
- And we don't know how to back propagate through a stochastic node

In particluar
$$
\frac{\loss_R}{\partial \phi}
$$

involves a random sample $z \sim q_\phi(\z|\x)$ from a function of $\phi$.

How do we take the derivative of a node involving a random choice ?

The "reparameterization trick"
- redefines $\z$
- as the transformation of a distribution $p(\z)$
    - (Looking ahead: which will wind up as unit normal distribution $N(0,1)$)
- by scaling it by a standard deviation and adding a mean

$$
\begin{array}[llll] \\
\mathbf{z}  & = & \mathbf{\mu}_\theta(\x) + \mathbf{\sigma}_\theta(\x) \mathbf{\epsilon} \\
\mathbf{\epsilon} & \sim & p(\mathbf{z}) \\
\end{array}
$$

<table>
    <tr>
        <th><center>Reparameterization trick</center></th>
    </tr>
    <tr>
        <td><img src="images/Reparameterization_trick.png"></td>
    </tr>
</table>

The key is that the mean $\mathbf{\mu}_\theta(\x)$ and standard deviation $\mathbf{\sigma}_\theta(\x)$
- are *functions* of input $\x^\ip$
    - so the encoding depends on the input
- and a function of trainable parameters $\theta$ and not $\phi$
    - so neither affects $\frac{\loss_R}{\partial \phi}$

We still can't take derivatives of $L_R$ with respect to $\epsilon$, but we don't need to !
- it's not a trainable parameter

We only need to take derivates of $L_R$ with respect to $\phi, \theta, \mu, \sigma$, which we can do.

In evaluating derivatives, the $\epsilon$ that appears in the result (e.g., derivative wrt $\sigma$) can be treated as a constant.

- For a particular example, we can remember the  drawn $\epsilon$ in the forward pass and use it in the backward pass 

This gets us to the (near) final picture of the VAE:

<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.png"></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Reparameterization trick</center></th>
    </tr>
    <tr>
        <td><img src="images/Reparameterization_trick.jpg" width=800></td>
    </tr>
</table>

# ELBo lower bound
To summarize
- we still have an intractable term (appears as another $\KL$ divergence after re-writing loss
    - this term appears as an additive term
    - by definition of $\KL$, it is positive
- so we can't evaluate the full loss function
    - but, ignoring the intractable positive part, the remainder is a *lower bound* on the loss
        - so we optimize the lower bound
        - called the *ELBO* term (LB is lower bound)
        

## Loss function: discussion

Let's examine the role of $\loss_R$ and $\loss_D$ in the loss function $\loss$.

- What would happen if we dropped $\loss_D$ ?
    - We would wind up with a deterministic $\z$ and collapse to a vanilla VAE
    
- What would happen if we dropped $\loss_R$ ?
    - the encoding approximation $q_\phi(\z|\x)$ would be close to the empirical $p_\theta(\z | \x)$ *in distribution*
    - but two variables with the same distribution are not necessarily the same ?
        - e.g., get a distribution $p$ by flipping a coin
            - let distribution $q$ be a relabelling of $p$ by changing Heads to Tails and vice-versa
            - $p$ and $q$ are equal in distribution but clearly different !
    


In [3]:
print("Done")

Done
