# KMeans with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [3]:
import numpy as np
import matplotlib.pyplot as plt

## Code
```python
from sklearn.cluster import KMeans
model = KMeans(<parameters>)
y_new = model.fit_predict(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

## Parameters
- `n_clusters`: number of clusters
- `init`: `"k-means++"`, `"random"`, or an array of shape `(n_clusters, n_features)`
- `n_init`: the function will run the $k$-means algorithm `n_init` times to get a best performance
- `algorithm`: `"full"` or `"elkan"`

## Attributes
- `cluster_centers_`: an array of shape `(n_clusters, n_features)` whose rows are the cluster centers
- `labels_`: labels of each point, the predict of the original data
- `inertia_`: $\sum_{i,j} \|{\bf x}_i - \mu_j\|^2$ where the summation runs through all pairs $(i,j)$ such that ${\bf x}_i$ is in the $j$-th cluster with center $\mu_j$.

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)
```
Let `X_new` be the result of MDS on `X` .

###### 1(a)
Plot points (rows) in `X` .  
Plot points (rows) in `X_new` .  

In [None]:
### your answer here

##### 1(b)
Obtain `X_new` several times and redo 1(a).  
Is the results all similar or it can be quite different?

In [None]:
### your answer here

##### Exercise 2
Let  
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```
This data has all its points lie in a two-dimensional plane embedded in a much higher dimension.  
Can you find out what does this data say?

In [None]:
### your answer here

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
```

###### 3(a)
Let `X_new` be the result of applying MDS to `X` with `n_components=2` .  
Plot the points (rows) in `X_new` with `c=y` .  
Print `model.stress_` .

In [None]:
### your answer here

###### 3(b)
Let `X_new` be the result of applying MDS to `X` with `n_components=3` .  
Plot the points (rows) in `X_new` with `c=y` .  
Print `model.stress_` .  
Is is lower than what you did with in 3(a)?

In [None]:
### your answer here

##### Exercise 4
For the following `precom`, input it as the precomputed dissimilarity and obtain `X_new` .  
Try to guess the answer beforehand.

###### 4(a)
Let  
```python
precom = np.array([[0,1,1],
                   [1,0,1],
                   [1,1,0]])
```
Apply MDS with `n_components=2` .  
Plot the points (rows) in `X_new` .

In [None]:
### your answer here

###### 4(b)
Let  
```python
precom = np.array([[0,1,1,1],
                   [1,0,1,1],
                   [1,1,0,1],
                   [1,1,1,0]])
```
Apply MDS with `n_components=2` .  
Plot the points (rows) in `X_new` .

In [None]:
### your answer here

###### 4(c)
Let  
```python
precom = np.array([[0,1,1,1],
                   [1,0,1,1],
                   [1,1,0,1],
                   [1,1,1,0]])
```
Apply MDS with `n_components=3` .  
Plot the points (rows) in `X_new` .

In [None]:
### your answer here

###### 4(d)
Let  
```python
precom = np.array([[0,1,2,1,1,2,3,2],
                   [1,0,1,2,2,1,2,3],
                   [2,1,0,1,3,2,1,2],
                   [1,2,1,0,2,3,2,1],
                   [1,2,3,2,0,1,2,1],
                   [2,1,2,3,1,0,1,2],
                   [3,2,1,2,2,1,0,1],
                   [2,3,2,1,1,2,1,0]])
precom = np.sqrt(precom)
```
Apply MDS with `n_components=2` .  
Plot the points (rows) in `X_new` .  

In [None]:
### your answer here

###### 4(e)
Let  
```python
precom = np.array([[0,1,2,1,1,2,3,2],
                   [1,0,1,2,2,1,2,3],
                   [2,1,0,1,3,2,1,2],
                   [1,2,1,0,2,3,2,1],
                   [1,2,3,2,0,1,2,1],
                   [2,1,2,3,1,0,1,2],
                   [3,2,1,2,2,1,0,1],
                   [2,3,2,1,1,2,1,0]])
precom = np.sqrt(precom)
```
Apply MDS with `n_components=3` .  
Plot the points (rows) in `X_new` .  

In [None]:
### your answer here

## Experiments

##### Exercise 5
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = MDS(2)
X_new = model.fit_transform(X)
```

###### 5(a)
Print `X_new` and `model.embedding_` and check if they are the same.

In [None]:
### your answer here

###### 5(b)
Calculate the distance matrix `dist` between the rows of `X` and the rows of `X` .  
Compare `dist` and `model.dissimilarity_matrix_` and check if they are the same.  

In [None]:
### your answer here

###### 5(c)
Calculate the distance matrix `dist_new` between the rows of `X_new` and the rows of `X_new` .

In [None]:
### your answer here

###### 5(d)
Calculate the stress $\sum_{i<j}(d_{ij}(X_{\rm new})^2 - \delta_{ij})^2$ and compare it with `model.stress_` .  

In [None]:
### your answer here

#### Remark
It seems that `model.stress_` is always slightly higher than the stress you found by $X_{\rm new}$.  
You may check the code by running:  
```python
from sklearn.manifold import _mds
_mds._smacof_single??
```
In the `for` loop, the stress is computed for $X=X_k$ and then $X$ is updated by $X_{k+1}$.  
The code returns the stress of $X_k$ and the embedding $X_{k+1}$, which has lower stress.  
This seems a bug.