# MDS with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Code
```python
from sklearn.manifold import MDS
model = MDS(<parameters>)
X_new = model.fit_transform(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html)

## Parameters
- `n_components`: target dimension
- `n_init`: the function will run the SMACOF algorithm `n_init` times  
to get a best performance
- `dissimilarity`: `"euclidean"` or `"precomputed"`  
if `"euclidean"`, use the Euclidean distance matrix as the dissimilarity.  
If `"precomputed"`, you have to pass your precomputed dissimilarity matrxi to `model.fit` .

## Attributes
- `n_components_`: target dimension
- `embedding_`: the embedding in the target dimension stored in an array of shape `(n_samples, n_components)`
- `dissimilarity_matrix_`: the dissimilarity matrix in used of shape `(n_samples, n_sample)` 
- `stress_`: $\sum_{i<j}(d_{ij}(X_{\rm new}) - \delta_{ij})^2$, where $d_{ij}(X)$ is the distance between the $i$-th row and the $j$-th row of $X_{\rm new}$, and $\delta_{ij}$ is the distance between the $i$-th row and the $j$-th row of $X$.

[Wikipedia: Stress majorization](https://en.wikipedia.org/wiki/Stress_majorization)

In [None]:
from sklearn.manifold import MDS

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)
```
Let `X_new` be the result of MDS on `X` .

###### 1(a)
Plot points (rows) in `X` .  
Plot points (rows) in `X_new` .  

In [None]:
### your answer here
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = MDS(2)
X_new = model.fit_transform(X)

plt.axis('equal')
plt.scatter(*X.T)
plt.scatter(*X_new.T)

##### 1(b)
Obtain `X_new` several times and redo 1(a).  
Is the results all similar or it can be quite different?

In [None]:
### your answer here
X_new = model.fit_transform(X)

plt.axis('equal')
plt.scatter(*X.T)
plt.scatter(*X_new.T)

##### Exercise 2
Let  
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```
This data has all its points lie in a two-dimensional plane embedded in a much higher dimension.  
Can you find out what does this data say?

In [None]:
### your answer here
X = np.genfromtxt('hidden_text.csv', delimiter=',')

In [None]:
model = MDS(2)
X_new = model.fit_transform(X)

plt.axis('equal')
plt.scatter(*X_new.T)

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
```

###### 3(a)
Let `X_new` be the result of applying MDS to `X` with `n_components=2` .  
Plot the points (rows) in `X_new` with `c=y` .  
Print `model.stress_` .

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
### your answer here
X = digits.data
y = digits.target

model = MDS(2)
X_new = model.fit_transform(X)
model.stress_

In [None]:
plt.axis('equal')
plt.scatter(*X_new.T, c=y)

###### 3(b)
Let `X_new` be the result of applying MDS to `X` with `n_components=3` .  
Plot the points (rows) in `X_new` with `c=y` .  
Print `model.stress_` .  
Is is lower than what you did with in 3(a)?

In [None]:
### your answer here
model = MDS(3)
X_new = model.fit_transform(X)
model.stress_

In [None]:
%matplotlib notebook
ax = plt.axes(projection='3d')
ax.scatter(*X_new.T, c=y)

##### Exercise 4
For the following `precom`, input it as the precomputed dissimilarity and obtain `X_new` .  
Try to guess the answer beforehand.

###### 4(a)
Let  
```python
precom = np.array([[0,1,1],
                   [1,0,1],
                   [1,1,0]])
```
Apply MDS with `n_components=2` .  
Plot the points (rows) in `X_new` .

In [None]:
### your answer here
precom = np.array([[0,1,1],
                   [1,0,1],
                   [1,1,0]])

model = MDS(n_components=2, dissimilarity="precomputed")
X_new = model.fit_transform(precom)

plt.axis('equal')
plt.scatter(*X_new.T)

###### 4(b)
Let  
```python
precom = np.array([[0,1,1,1],
                   [1,0,1,1],
                   [1,1,0,1],
                   [1,1,1,0]])
```
Apply MDS with `n_components=2` .  
Plot the points (rows) in `X_new` .

In [None]:
### your answer here
precom = np.array([[0,1,1,1],
                   [1,0,1,1],
                   [1,1,0,1],
                   [1,1,1,0]])

model = MDS(n_components=2, dissimilarity="precomputed")
X_new = model.fit_transform(precom)

plt.axis('equal')
plt.scatter(*X_new.T)

###### 4(c)
Let  
```python
precom = np.array([[0,1,1,1],
                   [1,0,1,1],
                   [1,1,0,1],
                   [1,1,1,0]])
```
Apply MDS with `n_components=3` .  
Plot the points (rows) in `X_new` .

In [None]:
### your answer here
%matplotlib notebook

precom = np.array([[0,1,1,1],
                   [1,0,1,1],
                   [1,1,0,1],
                   [1,1,1,0]])

model = MDS(n_components=3, dissimilarity="precomputed")
X_new = model.fit_transform(precom)

ax = plt.axes(projection='3d')
ax.scatter(*X_new.T)

###### 4(d)
Let  
```python
precom = np.array([[0,1,2,1,1,2,3,2],
                   [1,0,1,2,2,1,2,3],
                   [2,1,0,1,3,2,1,2],
                   [1,2,1,0,2,3,2,1],
                   [1,2,3,2,0,1,2,1],
                   [2,1,2,3,1,0,1,2],
                   [3,2,1,2,2,1,0,1],
                   [2,3,2,1,1,2,1,0]])
precom = np.sqrt(precom)
```
Apply MDS with `n_components=2` .  
Plot the points (rows) in `X_new` .  

In [None]:
### your answer here
precom = np.array([[0,1,2,1,1,2,3,2],
                   [1,0,1,2,2,1,2,3],
                   [2,1,0,1,3,2,1,2],
                   [1,2,1,0,2,3,2,1],
                   [1,2,3,2,0,1,2,1],
                   [2,1,2,3,1,0,1,2],
                   [3,2,1,2,2,1,0,1],
                   [2,3,2,1,1,2,1,0]])
precom = np.sqrt(precom)

model = MDS(n_components=2, dissimilarity="precomputed")
X_new = model.fit_transform(precom)

plt.axis('equal')
plt.scatter(*X_new.T)

###### 4(e)
Let  
```python
precom = np.array([[0,1,2,1,1,2,3,2],
                   [1,0,1,2,2,1,2,3],
                   [2,1,0,1,3,2,1,2],
                   [1,2,1,0,2,3,2,1],
                   [1,2,3,2,0,1,2,1],
                   [2,1,2,3,1,0,1,2],
                   [3,2,1,2,2,1,0,1],
                   [2,3,2,1,1,2,1,0]])
precom = np.sqrt(precom)
```
Apply MDS with `n_components=3` .  
Plot the points (rows) in `X_new` .  

In [None]:
### your answer here
%matplotlib notebook

precom = np.array([[0,1,2,1,1,2,3,2],
                   [1,0,1,2,2,1,2,3],
                   [2,1,0,1,3,2,1,2],
                   [1,2,1,0,2,3,2,1],
                   [1,2,3,2,0,1,2,1],
                   [2,1,2,3,1,0,1,2],
                   [3,2,1,2,2,1,0,1],
                   [2,3,2,1,1,2,1,0]])
precom = np.sqrt(precom)

model = MDS(n_components=3, dissimilarity="precomputed")
X_new = model.fit_transform(precom)

ax = plt.axes(projection='3d')
ax.scatter(*X_new.T)

## Experiments

##### Exercise 5
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = MDS(2)
X_new = model.fit_transform(X)
```

###### 5(a)
Print `X_new` and `model.embedding_` and check if they are the same.

In [None]:
### your answer here
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = MDS(2)
X_new = model.fit_transform(X)

np.isclose(X_new, model.embedding_).all(), X_new, model.embedding_

###### 5(b)
Calculate the distance matrix `dist` between the rows of `X` and the rows of `X` .  
Compare `dist` and `model.dissimilarity_matrix_` and check if they are the same.  

In [None]:
### your answer here
dist = np.linalg.norm(X[None,:,:] - X[:,None,:], axis=2)
np.isclose(dist, model.dissimilarity_matrix_).all()

###### 5(c)
Calculate the distance matrix `dist_new` between the rows of `X_new` and the rows of `X_new` .

In [None]:
### your answer here
dist_new = np.linalg.norm(X_new[None,:,:] - X_new[:,None,:], axis=2)
dist_new

###### 5(d)
Calculate the stress $\sum_{i<j}(d_{ij}(X_{\rm new})^2 - \delta_{ij})^2$ and compare it with `model.stress_` .  

In [None]:
### your answer here
np.power(np.subtract(dist_new, dist)[np.less(*np.meshgrid(*[range(len(X))] * 2))], 2).sum(), model.stress_

#### Remark
It seems that `model.stress_` is always slightly higher than the stress you found by $X_{\rm new}$.  
You may check the code by running:  
```python
from sklearn.manifold import _mds
_mds._smacof_single??
```
In the `for` loop, the stress is computed for $X=X_k$ and then $X$ is updated by $X_{k+1}$.  
The code returns the stress of $X_k$ and the embedding $X_{k+1}$, which has lower stress.  
This seems a bug.