<a href="https://colab.research.google.com/github/NinaMaz/mlss-tutorials/blob/master/solomon-embeddings-tutorial/riemannian_opt_for_ml_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a tutorial notebook on Riemannian optimization for machine learning, prepared for the Machine Learning Summer School 2019 (MLSS-2019, http://mlss2019.skoltech.ru) in Moscow, Russia, Skoltech (http://skoltech.ru).

Copyright 2019 by Alexey Artemov and ADASE 3DDL Team. Special thanks to Alexey Zaytsev for a valuable contribution.

## Riemannian optimization for machine learning

The purpose of this tutorial is to give a gentle introduction into the practice of Riemannian optimization. You will learn to: 

 1. Reformulate familiar optimization problems in terms of Riemannian optimization on manifolds.
 2. Use a Riemannian optimization library `pymanopt`.

## Index

1. [Recap and the introduction: linear regression](#Recap-and-the-introduction:-linear-regression).
2. [Introduction into ManOpt and pymanopt](#Intoduction-into-ManOpt-package-for-Riemannian-optimization).
3. [Learning the shape space of facial landmarks](#Learning-the-shape-space-of-facial-landmarks): 
 - [Problem formulation and general reference](#Problem-formulation-and-general-reference).
 - [Procrustes analysis for the alignment of facial landmarks](#Procrustes-analysis-for-the-alignment-of-facial-landmarks).
 - [PCA for learning the shape space](#PCA-for-learning-the-shape-space).
4. [Analysing the shape space of facial landmarks via MDS](#Analysing-the-shape-space-of-facial-landmarks-via-MDS).
5. [Learning the Gaussian mixture models for word embeddings](#Learning-the-Gaussian-mixture-models-for-word-embeddings).

Install the necessary libraries

In [0]:
!pip install --upgrade git+https://github.com/mlss-skoltech/tutorials.git#subdirectory=geometric_techniques_in_ML

In [0]:
!pip install pymanopt autograd
!pip install scipy==1.2.1 -U

In [0]:
import pkg_resources

DATA_PATH = pkg_resources.resource_filename('riemannianoptimization', 'data/')

## Recap and the introduction: linear regression

_NB: This section of the notebook is for illustrative purposes only, no code input required_

#### Recall the maths behind it:

We're commonly working with a problem of finding the weights $w \in \mathbb{R}^n$ such that
$$
||\mathbf{y} - \mathbf{X} \mathbf{w}||^2_2 \to \min_{\mathbf{w}},
$$
with $\mathbf{x}_i \in \mathbb{R}^n$, i.e. features are vectors of numbers, and $y_i \in \mathbb{R}$.
$\mathbf{X} \in \mathbb{R}^{\ell \times n}$ is a matrix with $\ell$ objects and $n$ features.

A commonly computed least squares solution is of the form: 
$$
\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}.
$$

We could account for the non-zero mean case ($\mathrm{E} \mathbf{y} \neq 0$) by either adding and subtracting the mean, or by using an additional feature in $\mathbf{X}$ set to all ones.

The solution could simply be computed via:

In [0]:
def compute_weights_multivariate(X, y):
    """
    Given feature array X [n_samples, 1], target vector y [n_samples],
    compute the optimal least squares solution using the formulae above.
    For brevity, no bias term!
    """
    # Compute the "inverting operator"
    R = np.dot(
        np.linalg.inv(
            np.dot(X.T, X)
        ), X.T
    )
    # Compute the actual solution
    w = np.dot(R, y)
    return w

#### Recall the gradient descent solution:

Let us view
$$
L(\mathbf{y}, \mathbf{X} \mathbf{w}) = \frac{1}{\ell} ||\mathbf{y} - \mathbf{X} \mathbf{w}||^2_2 
     \to \min_{\mathbf{w}},
$$
as pure unconstrained optimization problem of the type 
$$
f(\mathbf{w}) \to \min\limits_{\mathbf{w} \in \mathbb{R}^n}
$$
with $f(\mathbf{w}) \equiv L(\mathbf{y}, \mathbf{X} \mathbf{w})$.

To use the gradient descent, we must 
* initialize the weights $\mathbf{w}$ somehow,
* find a way of computing the __gradient__ of our quality measure $L(\mathbf{y}, \widehat{\mathbf{y}})$ w.r.t. $\mathbf{w}$,
* starting from the initialization, iteratively update weights using the gradient descent: 
$$
\mathbf{w}^{(i+1)} \leftarrow \mathbf{w}^{(i)} - \gamma \nabla_{\mathbf{w}} L,
$$
where $\gamma$ is step size.

Since we choose $L(\mathbf{y}, \widehat{\mathbf{y}}) \equiv \frac 1 \ell ||\mathbf{y} - \mathbf{X} \mathbf{w} ||^2$, our gradient is $ \frac 2 \ell \mathbf{X}^T (\mathbf{y} - \mathbf{X} \mathbf{w})  $.

The solution is coded by:

In [0]:
from sklearn.metrics import mean_squared_error

def compute_gradient(X, y, w):
    """
    Computes the gradient of MSE loss 
    for multivariate linear regression of X onto y 
    w.r.t. w, evaluated at the current w.
    """
    prediction = np.dot(X, w)  # [n_objects, n_features] * [n_features] -> [n_objects]
    error = prediction - y  # [n_objects]
    return 2 * np.dot(error, X) / len(error)  # [n_objects] * [n_objects, n_features] -> [n_features]


def gradient_descent(X, y, w_init, iterations=1, gamma=0.01):
    """
    Performs the required number of iterations of gradient descent.
    Parameters:
        X [n_objects, n_features]: matrix of featues
        y [n_objects]: responce (dependent) variable
        w_init: the value of w used as an initializer
        iterations: number of steps for gradient descent to compute
        gamma: learning rate (gradient multiplier)
    """
    costs, grads, ws = [], [], []
    w = w_init
    for i in range(iterations):
        # Compute our cost in current point (before the gradient step)
        costs.append(mean_squared_error(y, np.dot(X, w)) / len(y))
        # Remember our weights w in current point
        ws.append(w)
        # Compute gradient for w
        w_grad = compute_gradient(X, y, w)
        grads.append(w_grad)
        # Update the current weight w using the formula above (see comments)
        w = w - gamma * w_grad
    # record the last weight
    ws.append(w)
    return costs, grads, ws

## Intoduction into ManOpt package for Riemannian optimization

#### `ManOpt` and `pymanopt`

The Matlab library `ManOpt` (https://www.manopt.org) and its Python version `pymanopt` (http://pymanopt.github.io) are versatile toolboxes for optimization on manifolds. 

The two libraries are built so that they separate the _manifolds_, the _solvers_ and the _problem descriptions_. For basic use, one only needs to:
 * pick a manifold from the library, 
 * describe the cost function (and possible derivatives) on this manifold, and 
 * pass it on to a solver. 

_NB: The purpose of the following is to get familiar with pymanopt and to serve as a reference point when coding your own optimization problems._

To start working with `pymanopt`, you'll need the following 

 1. Import the necessary backend for automatic differentiation

```python
import autograd.numpy as np```
but theano and TensorFlow backends are supported, too. 

We will also require importing `pymanopt` itself, along with the necessary submodules:
```python
import pymanopt as opt
import pymanopt.solvers as solvers
import pymanopt.manifolds as manifolds```

 2. Define (or rather, select) the manifold of interest. `pymanopt` provides a [large number](https://pymanopt.github.io/doc/#manifolds) of predefined manifold classes (however, a lot less than the [original ManOpt Matlab library](https://www.manopt.org/tutorial.html#manifolds)). E.g., to instantiate a manifold $V_{2}(\mathbb {R}^{5}) = \{X \in \mathbb{R}^{5 \times 2} : X^TX = I_2\}^k$ of orthogonal projection matrices from $\mathbb{R}^5$ to $\mathbb{R}^2$ you will write:

```python
manifold = manifolds.Stiefel(5, 2)```

Available manifolds include [Steifel](https://pymanopt.github.io/doc/#module-pymanopt.manifolds.stiefel) ([wiki](https://en.wikipedia.org/wiki/Stiefel_manifold)), Rotations or SO(n) ([wiki](https://en.wikipedia.org/wiki/Orthogonal_group)), [Euclidean](https://pymanopt.github.io/doc/#module-pymanopt.manifolds.euclidean), [Positive Definite](https://pymanopt.github.io/doc/#pymanopt.manifolds.psd.PositiveDefinite) ([wiki](https://en.wikipedia.org/wiki/Definiteness_of_a_matrix)), and [Product](https://pymanopt.github.io/doc/#pymanopt.manifolds.product.Product), along many others.

 3. Define the **scalar** cost function (here using `autograd.numpy`) to be minimized by the 
```python
def cost(X):  return np.sum(X)```

Note that the scalar `cost` python function **will have access to objects defined elsewhere in code** (which allows accessing $X$ and $y$ for optimization).

 4. Instantiate the `pymanopt` problem
```python
problem = opt.Problem(manifold=manifold, cost=cost, verbosity=2)```
The keyword `verbosity` controls hwo much output you get from the system (smaller values mean less output).

 5. Instantiate a `pymanopt` solver, e.g.:
```python
solver = solvers.SteepestDescent()```
The library has a lot of solvers implemented, including SteepestDescent, TrustRegions, ConjugateGradient, and NelderMead objects.

 6. Perform the optimization in a single blocking function call, obtaining the optimal value of the desired quantity:
```python
Xopt = solver.solve(problem)```

#### Linear regression using `pymanopt`
_The purpose of this section is to get the first hands-out experience using `pymanopt`. We compare its output with hand-coded gradient descent and the analytic solution._

In [0]:
import pymanopt as opt
import pymanopt.solvers as solvers
import pymanopt.manifolds as manifolds

# Import the differentiable numpy -- this is crucial, 
# as `np` conventionally imported will not provide gradients.
# See more at https://github.com/HIPS/autograd
import autograd.numpy as np

In [0]:
# Generate random data
X = np.random.randn(200, 3)
y = np.random.randint(-5, 5, (200))

**Exercise:** program the linear regression using manifold optimization

**Hint:** create `Euclidean` manifold and the `SteepestDescent` solver. 

**Hint:** write down the formula for the cost. Remember it has the access to `X` and `y` defined above.

In [0]:
import autograd.numpy as np  # import again to avoid errors 

# Cost function is the squared error. Remember, cost is a scalar value!
def cost(w):
     return # <your code here>

# A simplest possible solver (gradient descent)
solver = # <your code here>

# R^3
manifold = # <your code here>

# Solve the problem with pymanopt
problem = opt.Problem(manifold=manifold, cost=cost)
wopt = solver.solve(problem)

print('The following regression weights were found to minimise the '
      'squared error:')
print(wopt)

Compute the linear regression solution via numerical optimization using steepest descent over the Euclidean manifold $\mathbb{R}^3$, _only using our handcrafted gradient descent_.

In [0]:
gd_params = dict(w_init=np.random.rand(X.shape[1]),
                 iterations=20,
                 gamma=0.1)
costs, grads, ws = gradient_descent(X, y, **gd_params)
print(" iter\t\t   cost val\t    grad. norm")
for iteration, (cost, grad, w) in enumerate(zip(costs, grads, ws)):
    gradnorm = np.linalg.norm(grad)
    print("%5d\t%+.16e\t%.8e" % (iteration, cost, gradnorm))

print('\nThe following regression weights were found to minimise the '
      'squared error:')
print(w)

Finally, use the analytic formula.

In [0]:
print('The closed form solution to this regression problem is:')

compute_weights_multivariate(X, y)

Recall that you can always look what's inside by either reading the [developer docs](https://pymanopt.github.io/doc/) or simply examining the code via typing:
```python
solvers.SteepestDescent??```

Compare the code there with our hand-crafted gradient descent.

## Learning the shape space of facial landmarks

#### Problem formulation and general reference

In this part, we will create the shape space of facial landmarks. Building such a shape space is of great interest in computer vision area, where numerous applications such as face detection, facial pose regression, and emotion recognition depend heavily on such models. Here are the basics of what one needs to know to proceed with this tutorial.

1. [Active Shape Models](https://en.wikipedia.org/wiki/Active_shape_model) are a class of statistical shape models that can iteratively deform to fit to an example of the object in a image. They are commonly build by analyzing variations in points distributions and _encode plausible variations, allowing one to discriminate them from unlikely ones_.
2. One great reference for all ASMs is Tim Cootes' paper: _Cootes, T., Baldock, E. R., & Graham, J. (2000)._ [An introduction to active shape models](https://person.hst.aau.dk/lasse/teaching/IACV/doc/asm_overview.pdf). _Image processing and analysis, 223-248._ It includes motivation, math, and algorithms behind the ASM.
3. Nice reference implementations of the Active Shape Model for faces include, e.g., [this Matlab code](https://github.com/johnwmillr/ActiveShapeModels) and [this one, featuring additionally dental image analysis](https://github.com/LennartCockx/Python-Active-shape-model-for-Incisor-Segmentation).
4. Production libraries such as [dlib](http://dlib.net) implement their own ASMs of facial landmarks.

![Example of facial landmarks](https://neerajkumar.org/databases/lfpw/index_files/image002.png) (image taken from [Neeraj Kumar's page on LPFW](https://neerajkumar.org/databases/lfpw/))

We will (1) [look at the data](#Obtain-and-view-the-dataset),
(2) [align shapes](#Procrustes-analysis-for-the-alignment-of-facial-landmarks),
and (3) [compute the shape space](#PCA-for-learning-the-shape-space).

### Obtain and view the dataset
_The goal of this section is to examine the dataset._

In [0]:
from riemannianoptimization.tutorial_helpers import load_data, plot_landmarks
landmarks = load_data(DATA_PATH)

View a random subset of the data. Run the cell below multiple times to view different subsets.

You can set `draw_landmark_id` and `draw_landmarks` to 0 to turn them off.

In [0]:
import matplotlib.pyplot as plt

idx = np.random.choice(len(landmarks), size=6) # sample random faces

fig, axs = plt.subplots(ncols=6, nrows=1, figsize=(18, 3))
for ax, image in zip(axs, landmarks[idx]):
    plot_landmarks(image, ax=ax, draw_landmark_id=1, draw_landmarks=1)

### Procrustes analysis for the alignment of facial landmarks
_The purpose of this section is to learn how to use manifold optimization for shape alignment_.

One thing to note is that the landmarks are annotated in images with different resolution and are generally **misaligned**. One can easily understand this by observing landmark scatterplots. Subtracting the mean shape or standardizing the points doesn't help.

In [0]:
fig, (ax1, ax2, ax3) = plt.subplots(figsize=(15, 5), ncols=3)
ax1.scatter(landmarks[:, 0::2], -landmarks[:, 1::2], alpha=.01)

# compute the mean shape 
mean_shape = np.mean(landmarks, axis=0)
landmarks_centered = landmarks - mean_shape
ax2.scatter(landmarks_centered[:, 0::2], -landmarks_centered[:, 1::2], alpha=.01)

# compute additionally the standard deviation in shape
std_shape = np.std(landmarks, axis=0)
landmarks_standardized = landmarks_centered / std_shape
ax3.scatter(landmarks_standardized[:, 0::2], -landmarks_standardized[:, 1::2], alpha=.01);

**Q:** Why such variation? Why we don't see separate clusters of  "average keypoints", like average eye1, eye2, and etc."?

We must _align_ shapes to a _canonical pose_ to proceed with building the ASM.

This will be done in a simple way via [Procrustes analysis](https://en.wikipedia.org/wiki/Procrustes_analysis). In its simplest form, Procrustes analysis aligns each shape so that the sum of distances of each shape to the mean $D = \sum\limits_i ||\mathbf{x}_i − \mathbf{\overline{x}}||^2_2)$ is minimised:
1. Translate each example so that its center of gravity is at the origin.
2. Choose one example as an initial estimate of the mean shape and scale.
3. Record the first estimate as $\overline{x}_0$ to define the default orientation.
4. Align all the shapes with the current estimate of the mean shape.
5. Re-estimate the mean from aligned shapes.
6. Apply constraints on scale and orientation to the current estimate of the mean by aligning it with x ̄0 and scaling so that $|\overline{x}| = 1$.
7. If not converged, return to 4.
(Convergence is declared if the estimate of the mean does not change
significantly after an iteration)

 
![Procrustes](https://upload.wikimedia.org/wikipedia/commons/f/f5/Procrustes_superimposition.png)

In [0]:
# A small helper function we will need 
# to center the shape at the origin and scale it to a unit norm.
def standardize(shape):
    # shape must have the shape [n_landmarks, 2], e.g. [35, 2]
    shape -= np.mean(shape, 0)
    shape_norm = np.linalg.norm(shape)
    shape /= shape_norm
    return shape

In [0]:
# A large helper function that we will employ to align
# the *entire collection* of shapes -- skip for now.
def align_landmarks(landmarks, mean_shape=None, aligner=None, n_iterations=1):
    """
    Aligns landmarks to an estimated mean shape.
    In this function, `landmarks` are always assumed to be array of shape [n, 35, 2].
    
    aligner: a function getting two arguments (mean_shape and shape), returning
             the transformation from shape to mean_shape
    """

    # Translate each example so that its center of gravity is at the origin.
    landmarks -= np.mean(landmarks, axis=1, keepdims=True)
    
    # Choose one example as an initial estimate of the mean shape and scale 
    # so that |x ̄| = 􏰆x ̄21 + y ̄12 + x ̄2 . . . = 1.
    mean_shape = np.mean(landmarks, axis=0)
    mean_shape = standardize(mean_shape)

    # Record the first estimate as x0 to define the default orientation.
    mean_shape_0 = mean_shape[:]
    
    def align_to_mean(landmarks, mean_shape, aligner=None):        
        aligned_landmarks = []
        for shape in landmarks:
            shape = standardize(shape)
            shape = aligner(mean_shape, shape)
            aligned_landmarks.append(shape)
        return np.array(aligned_landmarks)

    print(" iter\t     cost val.\t    mean diff.")
    for iteration in range(n_iterations):
        # Align all the shapes with the current estimate of the mean shape.
        aligned_landmarks = align_to_mean(landmarks, mean_shape, aligner=aligner)

        mean_shape_prev = mean_shape
        # Re-estimate the mean from aligned shapes.
        mean_shape = np.mean(aligned_landmarks, axis=0)
    
        # Apply constraints on scale and orientation to the current 
        # estimate of the mean by aligning it with x ̄0 and scaling so that |x ̄| = 1.
        mean_shape = aligner(mean_shape_0, mean_shape)
        mean_shape /= np.linalg.norm(mean_shape)
        
        cost = np.sum(
            np.linalg.norm(aligned_landmarks - mean_shape, axis=(1, 2))
        )
        mean_shape_diff = np.linalg.norm(mean_shape - mean_shape_prev)
        print("%5d\t%+.8e\t%.8e" % (iteration, cost, mean_shape_diff))

    # If not converged, return to 4. 
    # (Convergence is declared if the estimate of the mean does not change significantly after an iteration)
    return np.array(aligned_landmarks), mean_shape

In [0]:
landmarks = landmarks.reshape(-1, 35, 2)

One may naturally resort to [scipy.spatial.procrustes](https://docs.scipy.org/doc/scipy-1.2.1/reference/generated/scipy.spatial.procrustes.html), which computes an optimal alignment using a scale vector $\mathbf{s}$ and a rotation matrix $\mathbf{R}$, solving [orthogonal Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem).

**Exercise:** Using `scipy.spatial.procrustes`, write a default aligner function for our `align_landmarks`. This function must accept two shapes and return the second one aligned to the first one.

In [0]:
from scipy.spatial import procrustes

def default_procrustes(target_shape, source_shape):
    """Align the source shape to the target shape.
    For standardized shapes, can skip translating/scaling 
    aligned source by target's parameters.
    
    target_shape, source_shape: ndarrays of shape [35, 2]
    
    return ndarray of shape [35, 2]
    """
    # <your code here>
    

In [0]:
# Try aligning a single shape 
mean_shape = np.mean(landmarks, axis=0)
mean_shape = standardize(mean_shape)

shape_std = standardize(landmarks[400])

aligned_shape = default_procrustes(mean_shape, shape_std)

In [0]:
fig, (ax1, ax2, ax3) = plt.subplots(figsize=(15, 5), ncols=3)
plot_landmarks(mean_shape, ax=ax1)
ax1.set_title('Mean shape')

# compute the mean shape 
plot_landmarks(mean_shape, ax=ax2, color_landmarks='grey', color_contour='grey', alpha=0.5)
plot_landmarks(shape_std, ax=ax2)
ax2.set_title('Another shape, distance = {0:.3f}'.format(np.linalg.norm(mean_shape - shape_std)))

# compute additionally the standard deviation in shape
plot_landmarks(mean_shape, ax=ax3, color_landmarks='grey', color_contour='grey', alpha=0.5)
plot_landmarks(aligned_shape, ax=ax3)
ax3.set_title('Aligned shapes, distance = {0:.3f}'.format(np.linalg.norm(mean_shape - aligned_shape)));

In [0]:
# Align the entire dataset to a mean shape
aligned_landmarks, mean_shape = align_landmarks(landmarks, aligner=default_procrustes, n_iterations=3)

In [0]:
fig, (ax1, ax2) = plt.subplots(figsize=(10, 5), ncols=2)
ax1.scatter(aligned_landmarks[:, :, 0], -aligned_landmarks[:, :, 1], alpha=.01)
ax1.set_title('Aligned landmarks cloud')

# compute the mean shape 
plot_landmarks(mean_shape, ax=ax2)
ax2.set_title('Mean landmarks');

#### But let's do the same using Riemannian optimization!


**Q:** Why we need to optimize anything by hand, if we have the procrustes implemented in scipy?

In [0]:
import pymanopt as opt
import pymanopt.manifolds as manifolds
import pymanopt.solvers as solvers

Recall that the orthogonal Procrustus problem seeks for:
$$
R=\arg \min _{\Omega }\|\Omega A-B\|_{F}\quad \mathrm {subject\ to} \quad \Omega ^{T}\Omega =I,
$$
i.e. $R$ belongs to the Stiefel manifold. One can optimize that, however, it might be more reasonable to optimize using rotations + scaling.

In here, $A$ and $B$ are our shapes, and $\Omega$ is our seeked transform.

**Exercise:** program the variants of the Procrustes alignment using the following variants:
 * $R \in \text{Stiefel}(2, 2)$, i.e. we seek a projection matrix using `Stiefel` object 
 * $R \in \text{SO}(2)$, i.e. we seek a rotation matrix using `Rotations` object 
 * $R \in \text{SO}(2)$ and $s \in R^2$, i.e. we seek a rotation + scaling transform using `Product` of `Rotations` and `Euclidean` manifolds, see example [here](https://github.com/pymanopt/pymanopt/blob/master/examples/regression_offset_autograd.py))

In [0]:
import autograd.numpy as np  # import here to avoid errors

def riemannian_procrustes_projection(mean_shape, shape):
    """Align the source shape to the target shape using projection.

    target_shape, source_shape: ndarrays of shape [35, 2]    
    return ndarray of shape [35, 2]
    """
    def cost(R):
        return # <your code here>
    solver = solvers.SteepestDescent()
    manifold = # <your code here>manifolds.Stiefel(2, 2)
    problem = opt.Problem(manifold=manifold, cost=cost, verbosity=0)
    R_opt = solver.solve(problem)
    return # <your code here>


def riemannian_procrustes_rotation(mean_shape, shape):
    """Align the source shape to the target shape using rotation.

    target_shape, source_shape: ndarrays of shape [35, 2]    
    return ndarray of shape [35, 2]
    """
    def cost(R):
        return # <your code here>
    solver = solvers.SteepestDescent()
    manifold = # <your code here>
    problem = opt.Problem(manifold=manifold, cost=cost, verbosity=0)
    R_opt = solver.solve(problem)
    return # <your code here>
    
    
def riemannian_procrustes_rotation_scaling(mean_shape, shape):
    """Align the source shape to the target shape using a combination rotation and scaling.

    target_shape, source_shape: ndarrays of shape [35, 2]    
    return ndarray of shape [35, 2]
    """
    def cost(Rs):
        R, s = Rs
        return # <your code here>
    solver = solvers.SteepestDescent()
    manifold = # <your code here>
    problem = opt.Problem(manifold=manifold, cost=cost, verbosity=0)
    Rs_opt = solver.solve(problem)
    R_opt, s_opt = Rs_opt
    return # <your code here>

In [0]:
# Stiefel
aligned_landmarks, mean_shape = align_landmarks(landmarks, aligner=riemannian_procrustes_projection, n_iterations=3)

In [0]:
fig, (ax1, ax2) = plt.subplots(figsize=(10, 5), ncols=2)
ax1.scatter(aligned_landmarks[:, :, 0], -aligned_landmarks[:, :, 1], alpha=.01)
ax1.set_title('Aligned landmarks cloud')

# compute the mean shape 
plot_landmarks(mean_shape, ax=ax2)
ax2.set_title('Mean landmarks');

In [0]:
# Rotations
aligned_landmarks, mean_shape = align_landmarks(landmarks, aligner=riemannian_procrustes_rotation, n_iterations=3)

In [0]:
fig, (ax1, ax2) = plt.subplots(figsize=(10, 5), ncols=2)
ax1.scatter(aligned_landmarks[:, :, 0], -aligned_landmarks[:, :, 1], alpha=.01)
ax1.set_title('Aligned landmarks cloud')

# compute the mean shape 
plot_landmarks(mean_shape, ax=ax2)
ax2.set_title('Mean landmarks');

In [0]:
# Rotations + scale
aligned_landmarks, mean_shape = align_landmarks(landmarks, aligner=riemannian_procrustes_rotation_scaling, n_iterations=3)

In [0]:
fig, (ax1, ax2) = plt.subplots(figsize=(10, 5), ncols=2)
ax1.scatter(aligned_landmarks[:, :, 0], -aligned_landmarks[:, :, 1], alpha=.01)
ax1.set_title('Aligned landmarks cloud')

# compute the mean shape 
plot_landmarks(mean_shape, ax=ax2)
ax2.set_title('Mean landmarks');

### PCA for learning the shape space
_The goal of this section is to learn how to program the simple but powerful PCA linear dimensionality reduction technique using Riemannian optimization._

The typical way of learning the shape space is to find a low-dimensional manifold controlling most of the variability in shapes in a (hopefully) interpretable way. Such a manifold is commonly found using [PCA method](https://en.wikipedia.org/wiki/Principal_component_analysis).

We will apply PCA to a matrix $\mathbf{X} \in \mathbb{R}^{n \times 70}$ of aligned shapes.

A common way of learning PCA is using SVD implemented in the [`sklearn.decomposition.PCA` class](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [0]:
aligned_landmarks = aligned_landmarks.reshape(-1, 70)

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(aligned_landmarks)

In [0]:
d0 = pca.inverse_transform(
    pca.transform(aligned_landmarks)
)

In [0]:
data_scaled_vis = d0.reshape((-1, 35, 2))

plt.scatter(data_scaled_vis[:200, :, 0], -data_scaled_vis[:200, :, 1], alpha=.1)

#### Do the same using Riemannian optimization

Recall that PCA finds a low-dimensional linear subspace by searching for a corresponding orthogonal projection. Thus, PCA searches for an orthogonal projection $M$ such that:
$$
M = \arg \min _{\Omega }
    \|X - \Omega \Omega^{\intercal} X\|^2_{F}
    \quad 
    \mathrm {subject\ to} \quad \Omega ^{T}\Omega = I,
$$
i.e. $\Omega$ belongs to the Stiefel manifold $\mathcal{O}^{d \times r}$. 

The value $\|X - M M^{\intercal} X\|^2_{F}$ is the reconstruction error from projecting $X$ to $r$-dimensional subspace and restoring back to $d$-dimensional (original) one. 

**Exercise:** program the PCA by finding an orthogonal projection from 70-dimensional onto 2-dimensional subspace, using `pymanopt`.

**Hint:** use `Stiefel(70, 2)` manifold and the reconstruction error cost as described above.

In [0]:
# Cost function is the reconstruction error
def cost(w):
    return # <your code here>

solver = solvers.TrustRegions()
manifold = # <your code here>
problem = opt.Problem(manifold=manifold, cost=cost)
wopt = solver.solve(problem)

print('The following projection matrix was found to minimise '
      'the squared reconstruction error: ')
print(wopt)

Now construct a low-dimensional approximation of $X$, by projecting to $r$-dimensional parameter space and back.

In [0]:
aligned_landmarks_r = np.dot(wopt, np.dot(wopt.T, aligned_landmarks.T)).T
aligned_landmarks_r = aligned_landmarks_r.reshape((-1, 35, 2))

In [0]:
plt.scatter(aligned_landmarks_r[:200, :, 0], -aligned_landmarks_r[:200, :, 1], alpha=.1)

#### Exploring the lower-dimensional linear manifold parameterizing landmarks
_The purpose of this part is to understand how the coordinate values in the lower-dimensional space influences the landmark shape_. 

Coordinates along principal components _parameterize_ the shape, i.e. smooth walk along these directions should result in interpolation between shapes.

**Exercise:** explore the lower-dimensional linear manifold parameterizing landmarks:
 * Show samples _from the data_ with different coordinated along PC\#1 (hint: use `reconstructions_sorted_along_pc` below)
 * Show _synthetic_ samples obtained by moving in the data manifold along PC\#1 (hint: modify `reconstructions_sorted_along_pc` below into `vary_on_manifold`)

In [0]:
def reconstructions_sorted_along_pc(landmarks, w, pc=1, n_shapes=6):
    # project to r-dimensional manifold
    projected_landmarks = np.dot(w.T, landmarks.T).T
    
    # sort along dimension selected by pc
    pc_idx = np.argsort(projected_landmarks[:, pc])
    
    # reconstruct several shapes with varying degree
    # of expressiveness in parameter pc
    idx = np.linspace(0, len(landmarks), n_shapes).astype(int)
    idx[-1] = idx[-1] - 1
    shapes_to_reconstruct = projected_landmarks[pc_idx[idx]].T
    reconstructions = np.dot(w, shapes_to_reconstruct).T
    reconstructions = reconstructions.reshape((-1, 35, 2))
    
    return reconstructions


def plot_variability_along_pc(landmarks, w, pc=1, n_shapes=6):
    reconstructions = reconstructions_sorted_along_pc(landmarks, w, pc=pc, n_shapes=n_shapes)
    
    fig, axs = plt.subplots(ncols=6, nrows=1, figsize=(18, 3))
    for ax, image in zip(axs, reconstructions):
        plot_landmarks(image, ax=ax)


In [0]:
plot_variability_along_pc? # <your code here> 

**Q:** Would this variability necessary be exactly like the PCA?

In [0]:
# PC2
def vary_on_manifold(landmarks, id, w, pc=1, n_shapes=6):
    projected_landmarks = np.dot(w.T, landmarks.T).T
    min_pc_value = # <your code here>
    max_pc_value = # <your code here>
    pc_values = # <your code here>
    
    the_one_projection = projected_landmarks[id][None]
    shapes_to_reconstruct = np.tile(the_one_projection, (n_shapes, 1))
    shapes_to_reconstruct[:, pc] = pc_values
    
    reconstructions = np.dot(w, shapes_to_reconstruct.T).T
    reconstructions = reconstructions.reshape((-1, 35, 2))
    
    fig, axs = plt.subplots(ncols=n_shapes, nrows=1, figsize=(3 * n_shapes, 3))
    for ax, image in zip(axs, reconstructions):
        plot_landmarks(image, ax=ax)

        
vary_on_manifold(aligned_landmarks, 0, wopt, pc=1, n_shapes=30)

### Analysing the shape space of facial landmarks via MDS

#### Compute embedding of the shape space into 2D, preserving distances between shapes

Classic multidimensional scaling (MDS) aims to find an orthogonal mapping $M$ such that:
$$
M = \arg \min _{\Omega } 
    \sum_i \sum_j (d_X (\mathbf{x}_i, \mathbf{x}_j) - 
        d_Y (\Omega^{\intercal}\mathbf{x}_i, \Omega^{\intercal}\mathbf{x}_j))^2
    \quad 
    \mathrm {subject\ to} \quad \Omega ^{T}\Omega = I,
$$
i.e. $\Omega$ belongs to the Stiefel manifold $\mathcal{O}^{d \times r}$ where $d$ is the dimensionality of the original space, and $r$ is the dimensionality of the compressed space.

In other words, consider distances $d_X (\mathbf{x}_i, \mathbf{x}_j)$ between ech pair $(i, j)$ of objects in the original space $X$. MDS aims at projecting $\mathbf{x}_i$'s to a linear subspace $Y$ such that each distance $d_Y (M^{\intercal}\mathbf{x}_i, M^{\intercal}\mathbf{x}_j)$ approximates $d_X (\mathbf{x}_i, \mathbf{x}_j)$ as closely as possible.

In [0]:
aligned_landmarks = aligned_landmarks.reshape((-1, 70))

In [0]:
# a slightly tricky way of computing pairwise distances for [n, d] matrixes of objects, 
# see https://stackoverflow.com/questions/28687321/computing-euclidean-distance-for-numpy-in-python

def calculate_pairwise_distances(points):
    return ((points[..., None] - points[..., None].T) ** 2).sum(1)

In [0]:
euclidean_distances = calculate_pairwise_distances(aligned_landmarks)

**Exercise:** program MDS dimensionality reduction method using `pymanopt`. Project from 70-dimensional to 2-dimensional space.

**Hint:** to compute distances, use `calculate_pairwise_distances` above.

**Hint:** use `Stiefel(70, 2)` manifold

In [0]:
import autograd.numpy as np

def cost(w):
    # <your code here>
    

solver = solvers.TrustRegions()
manifold = # <your code here>
problem = opt.Problem(manifold=manifold, cost=cost)
wopt = solver.solve(problem)

print('The following projection matrix was found to minimise '
      'the squared reconstruction error: ')
print(wopt)

In [0]:
projected_shapes = np.dot(wopt.T, aligned_landmarks.T).T

In [0]:
from riemannianoptimization.tutorial_helpers import prepare_html_for_visualization

In [0]:
from IPython.display import HTML

HTML(prepare_html_for_visualization(projected_shapes, aligned_landmarks, scatterplot_size=[700, 700],
                                    annotation_size=[100, 100], floating_annotation=True))

## Learning the Gaussian mixture models for word embeddings

This part of the tutorial is in a separate notebook, `riemannian_opt_gmm_embeddings.ipynb`.

## Bibliography


This tutorial is in part inspired by the work _Cunningham, J. P., & Ghahramani, Z. (2015). [Linear dimensionality reduction: Survey, insights, and generalizations.](http://www.jmlr.org/papers/volume16/cunningham15a/cunningham15a.pdf) The Journal of Machine Learning Research, 16(1), 2859-2900._ Reading this work in full will help you greatly broaden your understanding of linear dimensionality reduction techniques, systematize your knowledge of optimization setups involved therein, and get an overview of this area.

_Townsend, J., Koep, N., & Weichwald, S. (2016). [Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation](http://jmlr.org/papers/volume17/16-177/16-177.pdf). The Journal of Machine Learning Research, 17(1), 4755-4759._

_Boumal, N., Mishra, B., Absil, P. A., & Sepulchre, R. (2014). [Manopt, a Matlab toolbox for optimization on manifolds](http://www.jmlr.org/papers/volume15/boumal14a/boumal14a.pdf). The Journal of Machine Learning Research, 15(1), 1455-1459._

This tutorial uses data and annotations from the two works
 _Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). [Localizing parts of faces using a consensus of exemplars](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.227.8441&rep=rep1&type=pdf). IEEE transactions on pattern analysis and machine intelligence, 35(12), 2930-2940._
 and 
_Huang, G. B., Mattar, M., Berg, T., & Learned-Miller, E. (2008, October)._ [Labeled faces in the wild: A database forstudying face recognition in unconstrained environments](https://hal.inria.fr/docs/00/32/19/23/PDF/Huang_long_eccv2008-lfw.pdf).
