## Nonlinear Dimensionality Reduction
G. Richards (2016, 2018, 2020) based on materials from Connolly, Miller, Leighly, VanderPlas, and Ivezic Chapter 7.

Today we will talk about the concepts of 
* manifold learning
* nonlinear dimensionality reduction

Specifically using the following algorithms
* local linear embedding (LLE)
* isometric mapping (IsoMap)
* t-distributed Stochastic Neighbor Embedding (t-SNE)

Let's start by my echoing the brief note of caution given in Adam Miller's notebook: "astronomers will often try to derive physical insight from PCA eigenspectra or eigentimeseries, but this is not advisable as there is no physical reason for the data to be linearly and orthogonally separable".  Moreover, physical components are (generally) positive definite.  So, PCA is great for dimensional reduction, but for doing physics there are generally better choices.

While NMF "solves" the issue of negative components, it is still a linear process.  For data with non-linear correlations, an entire field, known as [Manifold Learning](http://scikit-learn.org/stable/modules/manifold.html) and [nonlinear dimensionality reduction]( https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction), has been developed, with several algorithms available via the [`sklearn.manifold`](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold) module. 

For example, if your data set looks like this:
![IvezicFigure7p8a.png](attachment:IvezicFigure7p8a.png)

Then PCA is going to give you something like this.  

![IvezicFigure7p8b.png](attachment:IvezicFigure7p8b.png)

Clearly not very helpful!

What you were really hoping for is something more like the results below.  For more examples see
[Vanderplas & Connolly 2009](http://iopscience.iop.org/article/10.1088/0004-6256/138/5/1365/meta;jsessionid=48A569862A424ECCAEECE2A900D9837B.c3.iopscience.cld.iop.org)

![IvezicFigure7p8cd.png](attachment:IvezicFigure7p8cd.png)

## Local Linear Embedding

[Local Linear Embedding](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html#sklearn.manifold.LocallyLinearEmbedding) attempts to embed high-$D$ data in a lower-$D$ space.  Crucially it also seeks to preserve the geometry of the local "neighborhoods" around each point.  In the case of the "S" curve, it seeks to unroll the data.  The steps are

Step 1: define local geometry
- local neighborhoods determined from $k$ nearest neighbors.
- for each point calculate weights that reconstruct a point from its $k$ nearest
neighbors via
$$
\begin{equation}
  \mathcal{E}_1(W) = \left|X - WX\right|^2,
\end{equation}
$$
where $X$ is an $N\times K$ matrix and $W$ is an $N\times N$ matrix that minimizes the reconstruction error.

Essentially this is finding the hyperplane that describes the local surface at each point within the data set. So, imagine that you have a bunch of square tiles and you are trying to tile the surface with them.


Step 2: embed within a lower dimensional space
- set all $W_{ij}=0$ except when point $j$ is one of the $k$ nearest neighbors of point $i$.  
- $W$ becomes very sparse for $k \ll N$ (only $Nk$ entries in $W$ are non-zero). 
- minimize
>$\begin{equation}
  \mathcal{E}_2(Y) = \left|Y - W Y\right|^2,
\end{equation}
$

with $W$ fixed to find an $N$ by $d$ matrix ($d$ is the new dimensionality).

Step 1 requires a nearest-neighbor search.

Step 2 requires an
eigenvalue decomposition of the matrix $C_W \equiv (I-W)^T(I-W)$.


LLE has been applied to data as diverse as galaxy spectra, stellar spectra, and photometric light curves.   It was introduced by [Roweis & Saul (2000)](https://www.ncbi.nlm.nih.gov/pubmed/11125150).

Skikit-Learn's call to LLE is as follows, with a more detailed example already being given above.

In [None]:
import numpy as np
from sklearn.manifold import LocallyLinearEmbedding
X = np.random.normal(size=(1000,2)) # 1000 points in 2D
R = np.random.random((2,10)) # projection matrix
X = np.dot(X,R) # now a 2D linear manifold in 10D space
k = 5 # Number of neighbors to use in fit
n = 2 # Number of dimensions to fit
lle = LocallyLinearEmbedding(n_neighbors=k,n_components=n)
lle.fit(X)
proj = lle.transform(X) # 100x2 projection of the data

See what LLE does for the digits data, using the 7 nearest neighbors and 2 components.

In [None]:
# Execute this cell to load the digits sample
%matplotlib inline
import numpy as np
from sklearn.datasets import load_digits
from matplotlib import pyplot as plt
digits = load_digits()
grid_data = np.reshape(digits.data[0], (8,8)) #reshape to 8x8
plt.imshow(grid_data, interpolation = "nearest", cmap = "bone_r")
print(grid_data)
X = digits.data
y = digits.target

In [None]:
#LLE
from sklearn.manifold import LocallyLinearEmbedding
k = ____ # Number of neighbors to use in fit
n = ____ # Number of dimensions to fit
lle = LocallyLinearEmbedding(n_neighbors=____,n_components=____)
lle.fit(X)
X_reduced = lle.transform(X)

plt.scatter(X_reduced[:,0], X_reduced[:,1], c=y, cmap="nipy_spectral", edgecolor="None")
plt.colorbar()

## Isometric Mapping

is based on multi-dimensional scaling (MDS) framework.  It was introduced in the same volume of *Science* as the article above.  See [Tenenbaum, de Silva, & Langford (2000)](https://www.ncbi.nlm.nih.gov/pubmed/?term=A+Global+Geometric+Framework+for+Nonlinear+Dimensionality+Reduction).
Geodestic curves are used to recover non-linear structure.

In Scikit-Learn [IsoMap](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html) is implemented as follows:

In [None]:
# Execute this cell
import numpy as np
from sklearn.manifold import Isomap
XX = np.random.normal(size=(1000,2)) # 1000 points in 2D
R = np.random.random((2,10)) # projection matrix
XX = np.dot(XX,R) # X is a 2D manifold in 10D space
k = 5 # number of neighbors
n = 2 # number of dimensions
iso = Isomap(n_neighbors=k,n_components=n)
iso.fit(XX)
proj = iso.transform(XX) # 1000x2 projection of the data

Try 7 neighbors and 2 dimensions on the digits data.

In [None]:
# IsoMap
from sklearn.manifold import Isomap
k = 7 # Number of neighbors to use in fit
n = 2 # Number of dimensions to fit
iso = Isomap(n_neighbors=k,n_components=n)
iso.fit(X)
X_reduced = iso.____(____)

plt.scatter(____[:,0], ____[:,1], c=y, cmap="nipy_spectral", edgecolor="None")
plt.colorbar()

## t-SNE

[t-distributed Stochastic Neighbor Embedding (t-SNE)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) is not discussed in the book, Scikit-Learn does have a [t-SNE implementation](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) and it is well worth mentioning this manifold learning algorithm too.  SNE itself was developed by [Hinton & Roweis](http://www.cs.toronto.edu/~fritz/absps/sne.pdf) with the "$t$" part being added by [van der Maaten & Hinton](http://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf).  It works like the other manifold learning algorithms. 

 Try it on the digits data.  You'll need to import `TSNE` from `sklearn.manifold`, instantiate it with 2 components, then do a `fit_transform` on the original data.

In [None]:
# t-SNE
from sklearn.manifold import TSNE
tsne = ___(___=___,learning_rate=200)
X_reduced = ___.___(___)

plt.scatter(___, ___, c=y, cmap="nipy_spectral", edgecolor="None")
plt.colorbar()

You'll know if you have done it right if you understand Adam Miller's comment "Holy freakin' smokes.  That is magic.  (It's possible we just solved science)."

Personally, I think that some exclamation points may be needed in there!

What's even more illuminating is to make the plot using the actual digits to plot the points.  Then you can see why certain digits are alike or split into multiple regions.  Can you explain the patterns you see here?

In [None]:
# Execute this cell
from matplotlib import offsetbox

#----------------------------------------------------------------------
# Scale and visualize the embedding vectors
def plot_embedding(X):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure(figsize=(10, 10))
    ax = plt.subplot(111)
    for i in range(X.shape[0]):
      #plt.text(X[i, 0], X[i, 1], str(digits.target[i]), color=plt.cm.Set1(y[i] / 10.), fontdict={'weight': 'bold', 'size': 9})
      plt.text(X[i, 0], X[i, 1], str(digits.target[i]), color=plt.cm.nipy_spectral(y[i]/9.))


    shown_images = np.array([[1., 1.]])  # just something big
    for i in range(digits.data.shape[0]):
        dist = np.sum((X[i] - shown_images) ** 2, 1)
        if np.min(dist) < 4e-3:
            # don't show points that are too close
            continue
        shown_images = np.r_[shown_images, [X[i]]]
        imagebox = offsetbox.AnnotationBbox(offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r), X[i])
        ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    
plot_embedding(X_reduced)
plt.show()

We often use dimensionality reduction to help with data visualization, so it seems appropriate to talk about some different ways to improve our data visualization using plotting tools. The first example is [seaborn.pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html).

In [None]:
import seaborn as sns
penguins = sns.load_dataset("penguins")
sns.pairplot(penguins)

The second example is an illustration of the power of [Bokeh](https://bokeh.org/).

In [None]:
import numpy as np
from bokeh.plotting import *
from bokeh.models import ColumnDataSource

# prepare some data
N = 300
x0 = penguins['bill_depth_mm']
x1 = penguins['flipper_length_mm']
y = penguins['body_mass_g']
species = penguins['species']

# output to static HTML file
output_file("linked_brushing.html")

# NEW: create a column data source for the plots to share
source = ColumnDataSource(data=dict(x0=x0, x1=x1, y=y, species=species))

TOOLS = "pan,wheel_zoom,box_zoom,reset,save,box_select,lasso_select"

TOOLTIPS = [
    ("index", "$index"),
    ("(x,y)", "($x, $y)"),
    ("species", "@species"),
]

# create a new plot and add a renderer
left = figure(tools=TOOLS, tooltips=TOOLTIPS, width=350, height=350, title=None,\
             x_axis_label ="bill_depth_mm",y_axis_label ="body_mass_g")
left.circle('x0', 'y', source=source)

# create another new plot and add a renderer
right = figure(tools=TOOLS, tooltips=TOOLTIPS, width=350, height=350, title=None, \
              x_axis_label ="flipper_length_mm",y_axis_label ="body_mass_g")
right.circle('x1', 'y', source=source)

# put the subplots in a gridplot
p = gridplot([[left, right]])

# show the results
show(p,browser="chrome")

Finally for today -- if irises, penguins, or hand-written digits aren't your thing, then have a look through some of these public data repositories:

- [https://github.com/caesar0301/awesome-public-datasets?utm_content=buffer4245d&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer](https://github.com/caesar0301/awesome-public-datasets?utm_content=buffer4245d&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)
- [http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A318739](http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A318739)
- [http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html](http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html)