In [None]:
# Reload modules on change
%load_ext autoreload
%autoreload 2

In [None]:
# Numpy
import numpy as np
import numpy.random as random

# Pandas
import pandas as pd

# Plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode()

# Scikit-learn
from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import MDS, LocallyLinearEmbedding

# Custom plotting
from plotly_util import scatter_matrix, scatter_matrix_3d

# Umbrella
from umbrella import Umbrella

# Dataset-specific
from timecourse_util import timecourse_marker

# Dimensionality reduction
![dimensionality reduction](img/Dimensionality-reduction-1.png)

Dimensionality reduction refers to a family of mathematical techniques that represent high-dimensional objects in lower dimensions.

Dimensionality reduction is performed for two main reasons - to explore the relationships within the data and to reduce the number of components for further analysis.

Say we have a dataset with $n=50$ features (columns). If we want to only use $n'=2$ of them to make a plot, we can start taking pairs of columns in our data, and plotting each one in term. In this case, we would need $\frac{n(n-1)}{2}$ plots to show all the combinations. Each such plot is a projection of the data into 2-dimensional space, or simply its 'shadow'. 

Can we do better?

# Stochastic Umbrellas

To start with an example, let us consider a probabilistic model of an umbrella.

![umbrella](img/bright-rainbow-umbrella.jpg)

In [None]:
umbrella = Umbrella(1000)

In [None]:
umbrella.plot()

# PCA - Principal component analysis

_Find axes with maximum variance_


PCA is the simplest technique for dimensionality reduction. One way to think about PCA is as a rotation and scaling of our data.

PCA finds a new set of axes, along which the variance is maximized. It also scales the data along these axes. 

# PCA 

[Explained Visually - Eigenvectors](http://setosa.io/ev/eigenvectors-and-eigenvalues/)

[Explained Visually -PCA](http://setosa.io/ev/principal-component-analysis/)

* Find the covariance matrix of the data
* Find the eigenvectors of the covariance matrix

In [None]:
umbrella_pca = PCA(n_components=3).fit_transform(umbrella.matrix)
scatter_matrix(umbrella_pca, marker = umbrella.marker, 
               title="Principal component analysis", 
               x_label="PC1", y_label="PC2")

In [None]:
scatter_matrix(umbrella_pca, dims=[1,2], marker = umbrella.marker, 
               title="Principal component analysis", 
               x_label="PC2", y_label="PC3")

# PCA components

In [None]:
#Let's get all the prinicpal components
umbrella_pca = PCA(n_components=3).fit(umbrella.matrix)

umbrella_pca.components_

Notice above that `PC1` is close to $(1, 0, 0)$. This happens because the handle of the umbrella contains most variance, and becomes the first principal component. `PC2` and `PC3` are along the canopy ($z$ component is close to zero). Notice that they are orthogonal (perpendicular vectors). This orthogonality is one of the key properties of PCA - it does not reshape the original object (only rotates and scales it). 

In [None]:
# Check how much variance is caprtured
umbrella_pca.explained_variance_ratio_

In [None]:
# Check orthogonality
print(np.dot(umbrella_pca.components_[:,0], umbrella_pca.components_[:,1]))
print(np.dot(umbrella_pca.components_[:,1], umbrella_pca.components_[:,2]))

# PCA discussion

## Pros

* Very fast
* Does not deform the data

## Cons

* Variance captured may be small
* Orthogonality

# MDS - Multidimensional scaling

_Minimize stress of space embedding_

$$Stress = \sqrt{\frac{\sum(f(x)-d)^2}{\sum d^2}}$$

The idea behind MDS is to take the distances between points in the dataset and try to represent them in smaller number of dimensions. Since it is not possible to perfectly represent an object in lower dimensions, we need a measure that allows us to tell how badly we do.

One such measure is _stress_. It expresses how much the object is deformed by projecting it.

Above, $d$ is some distance between points, and $f(x)$ is the transformation of our data. So this formula is the sum of the distance differences between the two spaces. 

## Euclidean distances

$$d_{ij}=\sqrt{\sum_q(q_i-q_j)^2}$$


$$Stress_D(x_1, x_2, ..., x_N) = \sqrt{ \sum_{i \ne j = 1,...,N}(d_{ij} - \|x_i - x_j\|)^2}$$

In the equation above $d_{ij}$ is the distance between points $x_i$ and $x_j$ (in the high-dimensional space), and $\|x_i-x_j\|$ is the low-dimensional distance. 

Our task is to find an arrangement of points in low-dimensional space that minimize the stress function. In case of Euclidean distances, we can use linear algebra to find solutions efficiently. Usually, we can use standard optimization algorithms to solve this problem.

In [None]:
umbrella_mds = MDS(n_components=2).fit_transform(umbrella.matrix)
scatter_matrix(umbrella_mds, umbrella.marker, 
               title="Multidimensional scaling")

# MDS discussion

## Pros

* Less assumptions about the data
* Better for visualization

## Cons

* Slower
* Harder to interpret

# ICA - Independent Component Analysis

_Separate non-Gaussian components_

![ICA](img/ICA.png)

# Facial recognition

[Face Recognition by Independent Component Analysis](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2898524/)

Bartlett, Movellan, Sejnowski, 2002

# PCA components

![pca-faces](img/pca-faces.png)

# ICA components

![pca-faces](img/ica-faces.png)

There multiple implementations of the ICA algorithm. Here, we focus on FastICA, which is the de-facto standard.

FastICA maximizes non-Gaussianity of the components, which is very close to independence. Intuitively, the join Gaussian distribution is completely symmetric, so one can not tease apart the components by any linear transformation.

Essentially, FastICA maximizes the approximation 'negentropy' property. Negnetropy is the difference of information content between a variable and its Gaussian counterpart. The Gaussian equivalent is a Gaussian random variable that has the same covariance matrix as the original variable.

Negentroy is hard to calculate directly, so it is found by kurtosis-based approximations. Kurtosis of a normally distributed variable is 0. Non-Gaussian variable have non-zero kurtosis.

In [None]:
umbrella_ica = FastICA(n_components = 2).fit_transform(umbrella.matrix)
scatter_matrix(umbrella_ica, umbrella.marker, title="Independent component analysis")

# ICA discussion

## Pros

* Detects descriptive features

## Cons

* Assumes additive interactions

## Further reading

* [FastICA Paper](https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf)

# LLE - Locally Linear Embedding

_Minimize stress of embedding neighborhoods_

![LLE](img/LLE.png)

LLE reconstructs the high-dimensional space with patches of low dimension. We can think of this a cutting up small fragments of a sphere and fitting them together on a sheet of paper, while trying to keep the angles unchanged.

As a result, we can "unravel" the different sections of the underlying geometry as separate parts.

1. Start with finding $k$ nearest neighbors

2. Minimize the familiar embedding function:

    $\epsilon(W) = \sum_i \bigg\rvert X_i - \sum_j W_{ij} X_j \bigg\rvert ^2$

    In this case, we add two important constraints:

    * $W$ only has entries for the neighboors
    * Rows of $W$ sum to $1$
    

3. While keeping $W$ fixed, minimize the embedding function in a lower dimension for _all_ the points

In [None]:
umbrella_lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10).fit_transform(umbrella.matrix)
scatter_matrix(umbrella_lle, umbrella.marker,
               title="Locally linear embedding")

# LLE Discussion

## Pros

* Fast
* Robust

## Cons

* Requires $k$ to be specified
* Can perform poorly with large $k$ and small $D$

# Working with data

![paper](img/paper-front.png)

![time-course-rna-seq](img/trapnell.png)

In [None]:
%%bash
du -h ../data/expression_matrix.csv
head ../data/expression_matrix.csv | cut -d',' -f 1-3

In [None]:
expression = pd.read_csv("../data/expression_matrix.csv", index_col=0)
expression.info()

In [None]:
expression.head()

In [None]:
expression_marker = timecourse_marker(expression)

In [None]:
expression_pca = PCA(n_components=2).fit_transform(expression)
scatter_matrix(expression_pca, expression_marker, title="Cell expression profile PCA")

In [None]:
expression_mds = MDS(n_components=2).fit_transform(expression)
scatter_matrix(expression_mds, expression_marker, title="Cell expression profile MDS")

In [None]:
expression_ica = FastICA(n_components=2).fit_transform(expression)
scatter_matrix(expression_ica, expression_marker, title="Cell expression profile ICA")

In [None]:
expression_mlle = LocallyLinearEmbedding(n_neighbors=10, 
                                         n_components=2).fit_transform(expression)

scatter_matrix(expression_mlle, expression_marker, title="Cell expression profile LLE")