t-Distributed Stochastic Neighbor Embedding (t-SNE) 
---

<img src="http://www.trivial.io/content/images/2015/12/tsne-8.png" style="width: 400px;"/>

By The End Of This Session You Should Be Able To:
----

- Explain why t-sne is the preferred algorithm for high dimensional visualization
- Explain how t-sne works
- List the limitations of t-sne

What is t-sne?
----

A tool to visualize high-dimensional data. 

Embed high-dimensional data into a space of 2-3 dimensions, which can then be visualized in a scatter plot. 

Specifically, it models each high-dimensional object by a 2-3 dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

How does t-sne do its dark magic?
-----

It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. 

[demo 1](https://github.com/oreillymedia/t-SNE-tutorial)  
[demo 2](http://cs.stanford.edu/people/karpathy/tsnejs/)

t-SNE steps
----

1. t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an infinitesimal probability of being picked.

2. t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. 

[Source](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)

Kullback–Leibler divergence
----

A way to compare two probability distributions:

![](images/kl_form.png)

- Always nonnegative
- Zero if and only if p = q
- Not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality. 

Nonetheless, it is often useful to think of KL as a “distance” between distributions.

In [None]:
P: a:1/2, b:1/4, c:1/4
Q: a:4/12, b:3/12, d:3/12, 2/12

In [13]:
# There are 4 outcomes

# 1st distribution has the following probablities for each outcome
p = [.5, .23, .23, .04] 

# 2nd distribution has the following probablities for each outcome
q = p # The same distributions
# q = [.4, .4, .1 , .1] # A closer distrbution
# q = [.001, .001, .001, .98] # Futher distribution

In [14]:
from numpy import log2

KL = 0

for index, _ in enumerate(p):
    KL += p[index] * log2(p[index]/q[index])
    
print(KL)

7.907229172


KL divergence can be a method to compress probability distributions
------

The Kullback-Leibler divergence is the penalty you'll have to pay if you try to compress data from one distribution using a scheme optimised for another.

More precisely: if your data really comes from probability distribution P, but you use a compression scheme optimised for Q, the divergence D(P||Q) is the number of extra bits you'll require to store a record of each sample from P.

[Source](https://www.quora.com/What-is-a-good-laymans-explanation-for-the-Kullback-Leibler-Divergence)

KL in continuous space
-----

![](images/kl.png)

![](images/kl_density.png)

Really go learn Information Theory
-----

![](http://cosmicfingerprints.com/wp-content/uploads/2009/07/noise_info_entropy.jpg)

[Elements of Information Theory Second Edition](http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471241954.html) is a good staring place. It describe Information Theory in terms of statistics with lots of worked examples and good exercises.

Check for understanding
-----

Can you reverse t-sne (i.e., go from low dimensions to meaningful high dimensions)? Why or why not?

__No.__

There are many possible high dimensional space that don't "lose" information but don't have "meaning".

Okay - I want to use t-SNE
----

[t-sne code](https://lvdmaaten.github.io/tsne/) 

[Implementation in python](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)

disclaimer - t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

<br>
<br> 
<br>

----