# MDS

Multidimensional scaling is a concept that appeared in statistics in 1950s. Its purpose was to <u>preserve pairwise distances</u> between points after mapping from high-dimensional to low-dimensional space. 

In statistics low-dimensional space is sometimes referred to as [ordination](https://en.wikipedia.org/wiki/Ordination_(statistics)) space. Most of the time it was used for data visualization, so the new dimensionality was often chosen to be 2 or 3.

### Common problem statement

We don't have original dataset coordinates, we only have some [dissimilarity](https://en.wikipedia.org/wiki/Distance_matrix) matrix

#### TBD: change notation

$$D^{old}_{ij} = dis(x_i,x_j)$$

Suppose we did some mapping and came to configuration X. We then compute dissimilarities in a new space. Usually dissinilarities in a new space are standard euclidean distances (it does not make much sense to use more complex distances):

$$D^{new}_{ij}(X) = {||x_i - x_j ||}^2 $$

The task is to preserve distances => we need to find such configuration X that minimizes loss function:

$$X^{opt} = \underset{X}{\operatorname{{arg\,min}}} \sum_{i \ne j} {\left( D^{new}_{ij}(X) - D^{old}_{ij}\right)}^2$$

The scheme of the process is described below:

<img src = "img/mds.png" width=500>

Classical MDS (assumption = D_old - euclidean distances):
1. We compute a decentered distance matrix of all distances between all pairs of points in a dataset
2. We do eigendecomposition of that matrix which gives us a set of eigenvectors of that matrix
3. We project dataset onto the first K components

### Proof

1. [Centering matrix](https://en.wikipedia.org/wiki/Centering_matrix) is a concise way to subtract mean from matrix

    If we denote $I_n$ an identity matrix; $J_n$ - matrix of all ones, then
$$C_n =  I_n - \tfrac{1}{n}J_n$$

    For example:

$$C_3 = \left[ \begin{array}{rrr}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 
\end{array} \right] -  \frac{1}{3}\left[ \begin{array}{rrr}
1 & 1 & 1 \\
1 & 1 & 1 \\
1 & 1 & 1 
\end{array} \right]
 = \left[ \begin{array}{rrr}
\frac{2}{3} & -\frac{1}{3} & -\frac{1}{3} \\
-\frac{1}{3} & \frac{2}{3} & -\frac{1}{3} \\
-\frac{1}{3} & -\frac{1}{3} & \frac{2}{3} 
\end{array} \right]$$


Right multiplication <u>subtracts row mean</u> from each row 
$$XC$$

Left multiplication <u>subtracts column mean</u> from each column
$$CX$$

Double centering = applying demeaning matrices twice
$$CXC$$

In that case each row and column will have an average of zero.

-----

2. Let's describe distance matrix D through (unknown) coordinates matrix X

$$D^X =  Z - 2X^TX + Z^T$$

Here $Z$ and $Z^T$ contain squared coordinates

-----


3. Let's apply double decentering to our distance matrix $D^X$

    $Z$ and $Z^T$ are row/column constant => they vanish out

    So we get:

$$CD^XC = -2\hat{X}^T\hat{X}$$

Note that $X^T$ and $X$ were also decentered and became $\hat{X}^T$ and $\hat{X}$ - decentered coordinate matrices.

Let's denote this decentered distance matrix by $B^X$:

$$B^X = -\frac{1}{2}CD^XC = \hat{X}^T\hat{X}$$

-----


4. Now we need to find best rank-k approximation of decentered distance matrix $\hat{X}^T\hat{X}$. 

$$B^y = \underset{B^Y}{argmin} = ||B^X - B^Y||^2 = ||\hat{X}^T\hat{X} - Y^TY||^2$$

Notice that Y's that we are seeking for, will also represent the decentered coordinates, not the original ones

Notice that here we use Frobenius norm - it is the same as minimizaing mean squared error.

-----


5. This optimization is done by SVD decomposition.

    Suppose we deal with matrix A, then
<img src="img/svd_theorem.png" width=300>


Those Y optimally approximate B^X:
$$Y^TY = (UD^{1/2})(D^{1/2}U)$$

### Repeat
1. decentralize distance matrix
2. decompose using SVD
3. coordinates in singluar basis = new coordinates

#### Relation to PCA

MDS is similar to PCA in a way that it also uses the eigendecomposition to get reduced coordinates. It's even refered to as PCoA.

But the formulation is a bit different
- for PCA = to preserve data variation after mapping
- for MDS = to preserve pairwise distances after mapping

Also in MDS we work with distance matrix instead of covariance matrix. It makes this approach a bit more general.

####  Flavours of MDS
- Classic MDS
- metric MDS
- non-metruc MDS

### References

[SVD](https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/book-chapter-4.pdf)


