# Deep Learning

Autoencoders
------------

Credits:
[Jeremy Jordan's article on autoencoders](https://jeremyjordan.me/autoencoders/)

The main idea behind autoencoders is to learn a representation of the data that is more compact than the original data. This is done by training the network to reconstruct the input from the output. The network is forced to learn the most important features of the data, since it has to reconstruct the input from the output. The network is trained **to minimize the reconstruction error.**

Autoencoders "encode" the input vectors into a latent vector space, and then "decode" the latent vector back into the original input space. The latent vector space is a continuous space, and the autoencoder learns to map the input space to the latent space and back again.

The **bottleneck** is a key attribute of our network design; without the presence of an information bottleneck, our network could easily learn to simply memorize the input values by passing these values along through the network 



**Types of autoencoders:**
- Undercomplete autoencoders: The bottleneck layer 
- Sparse autoencoders: Penalize the network for having too many active units
  

**Uses:**
- Dimensionality reduction
- Denoising
- Anomaly detection



Variational Autoencoders
---

Instead of learning a single latent vector, the encoder learns the parameters of the distribution of all the features of the input. The decoder then samples from this distribution to generate the output. This allows the decoder to generate new samples that are similar to the training data.

Our loss function for this network will consist of two terms, one which penalizes reconstruction error (which can be thought of maximizing the reconstruction likelihood as discussed earlier) and a second term which encourages our learned distribution q(z|x) to be similar to the true prior distribution p(z), which we'll assume follows a unit Gaussian distribution, for each dimension j of the latent space.

$$ L = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \beta \cdot KL(q(z|x) || p(z)) $$

where KL is the Kullback-Leibler divergence, which measures the difference between two distributions. The KL divergence is a measure of how one probability distribution is different from a second, reference probability distribution. The KL divergence is non-symmetric, so KL(p||q) is not necessarily equal to KL(q||p).


The main advantage of VAEs is that **smooth latent state representations** of the input data can be learned. 
If only KL divergence is used, the latent space will be very sparse, and the decoder will only be able to generate a small number of samples. If only the reconstruction error is used, the latent space will be very dense, and the decoder will be able to generate a large number of samples, but they will not be very similar to the training data.





Reparameterization trick:
- The reparameterization trick is a way to backpropagate through a stochastic node in a computational graph. This is useful for sampling from a distribution in a neural network, since we can't backpropagate through a sampling operation. The trick is to replace the sampling operation with a deterministic operation that is differentiable. This is done by sampling from a unit Gaussian distribution, and then scaling and shifting the result by the parameters of the distribution we want to sample from.
  
**Uses:**
- Generative modeling
 

Transformers 
---

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data.


- Attention treats each word's representation as a query to access and incorporate information from a set of values. 



*Resources*
- CS224N Lecture 9 - [Link](https://youtu.be/ptuGllU5SQQ)
- CS224N Slide 9 - [Link](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/slides/cs224n-2021-lecture09-transformers.pdf)


> **Step by Step - Transformers**
1. Encode position information 
    1. Compute word embeddings
    2. Compute positional embeddings
    3. Combine embeddings 1 and 2 into matrix $X$
2. Take 3 linear layers and feed X into them to get key $K$, value $V$, and $Q$.
3. Compute Attention weights $A$
   1. Pairwise Dot Product between $Q$ and $K$ (how much does each query vector match each key vector?)
4. $A = softmax(A)$ : Entries in the matrix, words that are related to each other, are weighted more. 
5. $ A \dot V $ : Multiply the attention weights with the value vectors to get the **self-attention head**.

The self-attention head is the building block of the Transformer. There are few barriers to using the Transformer just with self-attention.

| Barries | Solution |
| --- | --- |
| Stacking attention layers gives weighted averages | Use FFN layers to introduce non-linearity and process the output of the attention layer|
| In Machine Translation, ensure "we don't look at the future" | Use a mask to prevent the attention head from looking at the future by setting the attention weights to -infy for the padded positions |

---
> 
> Few more tricks to use the Transformer:
> 

- Single vs Multi-Head Attention: 
  - Single head: 
    - Each head is a separate attention head.
    - Each head has its own weights and biases.
    - Each head has its own output.
  - Multi-head: 
    - Each head is a combination of multiple attention heads.
    - Each head has its own weights and biases.
    - Each head has its own output.
- Residual Connection: 
  - Instead of $$X_i = Layer(X_{i-1})$$, we use $$X_i = Layer(X_{i-1}) + X_{i-1}$$.