# T-SNE (2008)

T-stochastic neighbotshood embedding.

References:
- [paper](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf)
- [video](https://www.youtube.com/watch?v=RJVL80Gg3lA)
- [widget](https://distill.pub/2016/misread-tsne/)

T-SNE it's a relatively new but quite popular approach that is used for high-dimensioanl data visualization. It was described in 2008 by Hinton.

The idea behind this method is pretty standard - we want to find a projection onto latent space that preserves distances between neighboring points as much as possible.

### Original space

We start by describing local distances in the original space.

For each data point we calculate Gaussian similarity to its nearest neighbors. 

$$P(i,j)=e^{−||x_i−x_j||^2/2σ_i}$$

By using Gaussian we show that we care only about nearest neighbors. Distances to other points converge to zero quite fast.

For now sigma (paramerter that defines specific Gaussian distriburtion), is some kind of parameter.

To make it a proper probability distribution we also normalize them.

$$P(i,j)=\frac{e^{-||x_i -x_j||^2 / 2\sigma_i}}{\sum_j e^{-||x_i -x_j||^2 / 2\sigma_i}}$$

There is a problem though. Similarity is not symmetric $P(i,j) != P(j,i)$. So to make it symmetric we average its counterparts:

$$P(i,j) =  0.5 * (P(i|j) + P(j|i))$$

Note that denominator in formula remains the same, so it's still a probability distribution.

### Latent space

Next, we construct the same distribution in a latent space. If projection is good, P and Q will be close to each other.

The only difference - we use Student t-distriubtion instead of Gaussian. Why - because it has fatter tails and practice shows that it works better. That's why **T**-SNE.

Дело в том, что при проекцировании в пространство меньшей размерности расстояния должны растягиваться (может быть полно ситуаций когда расстояние 10 должно стать 100 в проекции). 

А плотность Стьюдента становится равна плотности Гаусса как раз намного дальше от центра, за счет этого далеко отстоящие точки в проекции разъезжаются ещё больше, как мы и хотим.

Так вот, ассиметричность расстояния позвоялет фокусировать внимание на сохранении именно локальных расстояний (то есть как раз там, где мы наблюдаем высокую P), не придавая большого значения тому, что там на хвостах.


### Mapping

Now how to measure closeness? Which metric calculates distance between two distributions? Right, Kulback-Liebler.

$$cost = \sum_i DKL(P_i||Q_i)=P_i log\frac{P_i}{Q_i}$$

It's a pretty stadrard optimization problem. By solving it for y we are getting optimal points allocation in a low-dimensional space.

### Parameter estimation

When calculating similarities for each point the method chooses sigma acroding to desired **Perplexy**.

Perpelxy =  $2^H$. H = Entropy - measure of randomness. It measures how universal distribution is. Distriubtions that have lots of peaks (or high probabilty variation) have lower entropy.

- High perplexity = Gaussian tails become longer and we take more neighbors into account
- Low perplexity = Gaussian tails become longer and we take less neighbors into account

By setting specific perplexity we regulate locality of the algorithm.

### Implementations

One of the implementations is Barnes-Hut approximation. Idea is simple - on each step of gradient descent close points averaged and perceived as one.

### Pros & Cons

- On standard datasets (MNIST) t-SNE works significantly better than PCA
- Requires more computational resources
- Because of its complexity and existence of a hyperparameter, this method is way harder to tune. Each run can lead to different result.