# Geometrical Methods in Machine Learning

## Homework 2

### Task 1: Intrinsic dimension estimation

#### Correlation dimension

Given a set $X_n = \{x_1, \dots, x_n\}$ in a metric space, the _correlation dimension_ is defined

$$C_n(r) = \frac{2}{n(n -1)} \sum_{i=1}^n \sum_{j=i+1}^n \mathbf{1}\{\|x_i - x_j\| < r\}.$$

The correlation dimension is then estimated by plotting $\log C_n(r)$ against $log(r)$ and estimating the slope of the linear part of the graph.

You are asked to implement one of the following intrinsic dimension estimation methods:

- correlation dimension, or
- projection angle (Lecture 5 slides, pp. 30-33)

Evaluate the obtained method and compare the estimates with global and local PCA and maximum likelihood estimation method for the airfoils, digits and faces from [Seminar 3](https://githib.com/oleg-kachan/GMML2020/seminar3) datasets and summarize the obtained results in a table. Conclude.

Feel free to reuse the code from Seminars 1 and 3.

In [None]:
# your code here

### Task 2: Manifold learning methods

Obtain `Extended Yale B` face dataset ([download](http://vision.ucsd.edu/extyaleb/CroppedYaleBZip/CroppedYale.zip)) which is comprised of 100x100 pixels images of 38 persons times 64 illumination conditions. Resize images to 32x32 pixels. You can do it using `Pillow` ([link](https://pillow.readthedocs.io/), tested) or any other image processing library of your choice.

Estimate the intrinsic dimensionality with a method of your choice and perform dimensionality reduction to entrinsic dimension $\hat{d}$ and dimensions 2 and/or 3 for visualization purposes using manifold learning methods of your choice.

Compute NPR (neigborhood preservation ratio, see [Seminar 4](https://githib.com/oleg-kachan/GMML2020/seminar4)) of algorithms you have used for 2-3 different values of $d = \{2$ and/or $3, \hat{d} \}$ and fixed number of nearest neighbors $k$. 

Explore the embedding space of size 2 and/or 3 for clusters and meaningful interpretations, comment the possible meaning of the new coordinates.

Alternatively, you can perform this task on sklearn's `Olivetti faces` dataset.

In [None]:
# your code here

### Extra (+1 point)

Implement the Laplacian eigenmaps algoritm. Test it on the Swiss roll dataset.

### Laplacian eigenmaps

#### Algorithm

Given a dataset $\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_n \} \in \mathbb{R}^{D \times n}$ in $D$-dimensional space,

1. Estimate the neighborhood $\mathcal{N}_i$ of each point, either with $\varepsilon$-ball or take $k$-nearest neighbors graph (scikit-learn's `NearestNeighbors().fit(X).kneighbors_graph(mode="distance")` will give you the matrix of pairwise distances $\|\mathbf{x}_i - \mathbf{x}_j\|$).

2. Build the adjancency graph, with adjacency matrix $\mathbf{A}$ s.t.

$$a_{ij} = \exp(-\lambda \| \mathbf{x}_i - \mathbf{x}_j \|^2),~\textrm{if}~\mathbf{x}_j \in \mathcal{N}_i \\
a_{ij} = 0,~\textrm{otherwise},$$

where $\lambda$ is the scalar parameter, understood as a width of the kernel.

3. Compute the graph Laplacian matrix $\mathbf{L} = \mathbf{D} - \mathbf{A} \in \mathbb{R}^{n \times n}$, where $\mathbf{D}$ s.t.

$$d_{ii} = \sum_i^n A_{ij}$$

and solve the generalized eigenvalue problem (use SciPy's implementation of `scipy.linalg.eigh`, consider the matrix $\mathbf{L}$ as the parameter $a$ and the matrix $\mathbf{D}$ as the parameter $b$): 

$$\mathbf{Lv} = \lambda \mathbf{Dv},\\
\mathbf{v} \in \mathbb{R}^n$$

The first eigenvector $\mathbf{v}_1$, corresponding to the _smallest_ eigenvalue will be zero vector $\mathbf{0}$. Given $d$ eigenvectors $\{ \mathbf{v}_2, \dots, \mathbf{v}_{d+1} \} \in \mathbb{R}^{n \times d}$, corresponding to the $(d+1)$-th smallest eigenvalues, compute the $d$-dimensional emdedding $\mathbf{Z}$:

$$\{\mathbf{z}_1, \dots, \mathbf{z}_n \} := \{ \mathbf{v}_2, \dots, \mathbf{v}_{d+1} \}^T \in \mathbb{R}^{d \times n}$$

In [None]:
# your code here

#### Grading:
8/10 points are awarded for completing all the tasks and giving proper answers to questions.  
2/10 points are awarded for the quality of presentation, be sure to give explanations and comments to your solutions.  
+1 extra point may be awarded for the extra work performed, be creative.