### t-SNE: a Brief Introduction

* t-SNE, or t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique that is particularly well suited for the visualization of high-dimensional datasets 
  * Ideal for plotting things down to two dimensions

* Like PCA, t-SNE takes a high dimensional dataset and reduces it to two dimensions.

  * Emphasizes retaining the "neighborhoods."
    * No guarantees about very distant points
  
  


![](https://www.dropbox.com/s/lw7sy9xda3pfoaj/t-SNE_Intro.png?dl=1)


* Observe that there is no clear axis on which we can project while maintaining the variance 

### t-SNE Key Idea

* Prior to t-SNE, the was SNE (Stochastic Neighbor Embedding)

* SNE preserves the distances between the nearest neighbors in high-dimensional space by making them nearest neighbors in low-dimensional space.
 
  * this is an approach that is common to many non-linear dimensionality reductions 
    * Unlike PCA, not concerned by the variance (distance between faraway points)
    
  * Makes them ideal and robust for highlighting clusters similarity in lower-dim space
  
* Issue of crowding
   



![](https://www.dropbox.com/s/u0q26ntkj1gata4/shuffle_data.png?dl=1)

![](https://www.dropbox.com/s/tk1yxknnszl0ewd/t-SNE_intuition.png?dl=1)

###  SNE Key Idea - Cont'd

* SNE models similarity between instances using probability
  * Converts the Euclidean distance into a probability
  * You can think of it as the probability that $x_i$ and $x_j$ are actually neighbors
    * Probability is easier to work with here because distance does not necessarily imply proximity in lower-dimensional space.
    
    
* The similarity between points of the original dataset $x_j$ and $x_i$ is the conditional probability of picking $x_j$ having selected $x_i$, 

<img src="https://www.dropbox.com/s/cli98l323uq125b/dist_xi_xj.png?dl=1" alt="drawing" style="width:300px;"/>


* This is, somewhat similar to weight computed from the k-Nearest Neighbors.



### SNE Key Idea - Cont'd

* In the low dimensional space (ex. $d=2$) we define the distance between the points $i$ and $j$ as

$$
q_{j|i} = \frac{e^{-|y_i -y_j|^2}}{\sum_{k \ne i} e^{-|y_i -y_k|^2}}
$$

* This is similar to the distance in the high-dimensional space, except that variance is constant and the same across all the points.

* Ideally, we want to preserve distance as much as possible such $p_{i,j}$ an $q_{i,j}$ are as close as possible



## SNE Key Idea - Cont'd


* We want to reconcile the distances in high-dim space and low-dimensional space. 
  * We want to infer a new distance in space $y$. 
  * Which distances should we pick so that distances in low-dimensional space make sense?

* We propose a cost function that is used to minimize the distances in both coordinate systems.
  * We use the KL divergence for that.
  
  
$$
    KL(p||q) \sum_{ij} p_{j|i} log\frac{p_{j|i}}{q_{j|i}}
$$



## SNE Key Idea - Cont'd

* Kullback–Leibler divergence is simply a measure of how one probability distribution is different from a second, reference probability distribution

* What we need to do is find the sets of y_{ij} which minimize the KL divergence.
  * Those are the distances in 2D space.
    
![](https://www.dropbox.com/s/0zvh0avtsasszpr/KL_divergence.png?dl=1)

### Why Does this work?

$$
    KL(p||q) \sum_{ij} p_{j|i} log\frac{p_{j|i}}{q_{j|i}}
$$

* Recall that: 
  * $p_{j|i}$ is close to 1 if $i$ is very close to $j$ 
  * $p_{j|i}$ is close to 0 if $i$ is very distant from $j$ 
  



### Why Does this work?

$$
    KL(p||q) \sum_{ij} p_{j|i} log\frac{p_{j|i}}{q_{j|i}}
$$

* If $p_{j|i}$ is close to 1 and $q_{j|i}$ is close to 1, then  $KL(p||q)$ is close to 0
   * This generalizes to  $p_{j|i} \approx q_{j|i}$  



* If $p_{j|i}$ is large and $q_{j|i}$ small, then  $KL(p||q)$ is high
  * That is not a good solution



* If $p_{j|i}$ is small and $q_{j|i}$ large, then  $KL(p||q)$ small
  * It's okay to put two points close to each other in 2-D, although they were distant in high dim.
  
* SNE focuses on preserving the local structure of the data  


### How is t-SNE different from SNE?

1 Turning $p_{j|i}$ into a symmetric probability makes it easier to compute
  * No need to compute $p_{j|i}$ and $p_{i|j}$
  * this can be easily achieved by assuming that $\sigma$ is constant in higher dimensional space.
  
$$  
p_{ij} = \frac{e^\frac{{-|x_i -x_j|^2}}{2\sigma^2}}{\sum_{k \ne i} e^\frac{{-|x_i - x_k|^2}}{2\sigma^2}}
$$
  
2. Change the distribution in the low dim. space
  * This is necessary due to the crowding problem (point tend to cluster in a much smaller area in low dimensional space)

$$
q_{ij} = \frac{ \frac{1}{1+|y_i-y_j|^2}}{\sum_{k\ne i}\frac{1}{1+|y_i-y_j|^2}}
$$
  
  
* Uses a $t$-distribution, rather than Gaussian
  * More value to distant points spread out the area in 2-d space    

### Advantages and Limitation



* Limitations
  * Quadratic run time; Not scalable for extremely large datasets
    * There exist variations that can handle large datasets 
    
  * The KL-Divergence function we are trying to minimize is not convex
    * no guarantees as to the resulting solutions
  * Also probabilistic and can yield differnet solutions with ech run.
  
  * Good visualization may require tweaking the param 
    * The perplexity, which is related to the sigma used computing the distnace
    * Number of iterations: how log to keep tweaking $q$ to make it simialr to $p$
  * Does not transform the space: geenrate an embedding into a new space
  * Concerned with local structure, as opposed to maintaining faraway point far.
    
* Advantages
  * Works impressively well, yielding to stunning representations in lowe dimensional space
  * Yields, in most cases, less crowded representations of the data

