This document contains feedback for dimensionality reduction (Q9, 10), smoothness (11) and graph neural networks (19).

# Q9: Laplacian eigenmaps (4 points)

## What does the algorithm preserve? (1pt)

Almost all teams explained that Laplacian eigenmaps aim at preserving locality. However, it is important to also explain the motivation for this: Laplacian eigenmaps are based on the assumption that the data lies on a possibly non-linear manifold. By preserving only the local geometry of the data, it is possibly to capture the intrisic geometry of this manifold.

## Code (1pt):

In general, Laplacian eigenmaps should skip the eigenvectors corresponding to eigenvalue zero. The best way to do it is to use the function compute_number_connected_components implemented earlier. If there are several connected components, they may be more than one eigenvector to skip.
However, since it was stated in the assignment that graphs should be connected, this error was penalized very lightly.

## Plots (2pt):

It is important to color each data point by its label. Otherwise, results are much harder to interpret. 
Laplacian eigenmaps don't work great on MNIST (compared to T-SNE), 

The most common mistake was to have plots looking almost like two or three straight lines. This is due to the graph construction. If the graph is not connected and that the code of Laplacian eigenmaps does not take it into account, each connected component will be constant along one axis. Plots where most points are *almost* constant along one axis may be due to a graph that is very weakly connected.

# Q10: Other dimensionality reduction techniques (2pts):

There were very few mistakes. The few ones usually came from a very approximate explanation of PCA. PCA is a linear method that finds an orthogonal projection of the data such that the variance of the projected point cloud is maximized. Almost all teams noticed that t-SNE performs much better on MNIST.

# Q11: Frequencies (2pt):

All teams have understood that smoothness is associated with low-eigenvalues. However, there were more mistakes in the way to measure smoothness:
  * A few answers were based on the mean value of the signal. They are totally incorrect.
  * More groups proposed to measure the variance of the signal on the graph. Whereas this is a better idea, it is not satisfactory. Different signals with the same variance can have different smoothness depending on the graph. For example, on a path graph of size 4, the signal 5 - 5 - 0 - 0 is more smooth than the signal 5 - 0 - 5 - 0, whereas they have the same variance.
  * Another incorrect answer was to "look at the eigenvalues of the signal". A *graph* has eigenvalues that can be computed using the eigendecomposition of its Laplacian. However, a *signal on a graph* does not have eigenvalues. What is possible instead is to *represent the signal in the graph Fourier domain* by projecting the signal on each eigenvector. This leads us to the correct answer, which is:
  * Compute the quadratic form of the Laplacian $x^T L x$. Note that this value corresponds to a "non-smoothness" (a value of smoothness could be its inverse). This operation has several interpretations:
    * It can be seen as the computation of the square norm of the graph gradient, i.e. a weighted sum of the square norm of the signal difference at the extremities of each edge.
    * It can also be seen as a quantity computed using the projection $p_\lambda(x)$ of the signal on each eigenspace: $x^T L x = \sum_{\lambda \in \textit{eig}(G)} \lambda \| p_\lambda(x) \|^2 $
    
# Q12: Graph Fourier Transform (2 pt)

All teams solved this question right.

# Q13: Graph filters

## Ideal Tikhonov spectral response (1 pt)

All teams got this right.

## Ideal graph filter (2 pt)

Most answers were right. However, some teams performed filtering directly on the vertex domain by multiplying the signal with the spectral response and that is totally incorrect.

## Relationship between filtering and spectral decomposition (0.5 pt)

Most teams understood the interpretation of graph filtering as an operation that scales the coordinates of a graph signal in the basis given by the spectral decomposition of the laplacian. In this sense, a low pass filter only preserves the components associated with the smallest eigenvalues (and hence it smoothens the signal), a high pass filter preserves the components associated with the largest eignevalues (and hence it produces signals with rapid spatial variations), and a band pass filter preserves the components in between (and produces a mildly smooth signal).

Looking at the spectral response of the Tikhonov filter we see that it weights down the components associated with large eigenvalues, and preserves the low frequencies. We thus say that this is a low pass filter.

Most erroneous answers came from teams who did not have this concepts clear and that ended up writing too much, and unfortunately, showed some conceptual mistake. 

# Q14: Polynomial graph filters

## Fit polynomial (3 pt)

Most people had this right, and solved this question either by solving the least squares problem numerically using `np.linalg.lstsq`

$$\alpha = \arg\min_\alpha||V\alpha - h||_2^2$$


or using the explicit solution in terms of the Moore-Penrose pseudo inverse

$$\alpha = V^\dagger h = (V^TV)^{-1}V^Th$$


However, some teams did not fully understand what they were doing and solved a combination of the two

$$\alpha = \arg\min_\alpha||V^TV\alpha - V^Th||_2^2$$

Even if the final solution is the same, the extra multiplication of every term by $V^T$ is unnecessary and it shows a lac of understanding in what you are doing. We removed 1.5 pts for this mistake.

## Polynomial filter spectral response (2 pt)

Almost everybody had this right

## Polynomial filter (2 pt)

The main recurrent mistake was to perform filtering in the spectral domain instead of explicitly computing the filter as a matrix polynomial. In this regard, please note that the power of a matrix is NOT the power of its elements, i.e.
$$A^k=A\cdot A^{k-1}=A\cdot A\cdots A$$
    
## Order (0.5 pt)

We accepted any answer between 3 and 10. For polynomial filters, the order allows you to set the right trade-off between computational complexity and accuracy. The lower the order the faster to compute the filter (basically it is the number of times you need to multiply the laplacian with itself) and the higher the order the better the polynomial can fit the ideal response.


# Q15: ARMA Filter

## ARMA filter (1 pt)

Everybody had this right

## Implement filtering operation (1 pt)

Almost all groups had this right. The only right answer was `x_tk_polynomial = g_tk @ x_noisy`. If the filtering was performed in the spectral response we granted 0 pts.

# Q19: Graph neural networks (3pt):

The goal of this part was not to achieve the best performance possible. Instead, it was to show that a neural network can be seen as a filtering process followed by a logisitic regression, the difference with standard machine learning being that the filters are trained instead of hand-picked. 

It was important to understand that in the two implementations of this section (one with Pytorch and one with Scikit learn), the model is the same: it is a Laplacian polynomial followed by a logistic regression. Even in the "graph neural network" (in Pytorch), there is no non-linearity nor anything "deep".

Any solution with a "correct performance" (> 75% accuracy on the test set) was accepted. The groups which achieved a lower performance had either picked very bad hyperparameters for the logistic regression, or chosen a polynomial order $K$ too big.

The last question was the most difficult, two elements were expected in the answer:
  * The main difference is that the Pytorch model is "end-to-end", that is, the logistic regression and the filters are trained simultaneously. In scikit learn, it is a two-step process.
  * Then, there are some differences in the way the model is optimized: the regularizer is not the same, and the optimization procedure neither. You were not expected to spot all the differences, one of them was enough.