# KMeans from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X` (whose `N` rows are samples and `d` columns are features)
- `k`: number of clusters
- `init`: "random" or an array of shape (k,d)  
if "random", `k` points are chosen randomly from X as the initial cluster centers  
if an array, the array is used as the initial cluster centers

**Output:**  
A tuple `(y_new, centers)`.  
- `y_new`: an array of shape `(N,)` that records the labels in `(0, ..., k-1)` of each sample 
- `centers`: an array of shape `(k,d)` that records the cluster centers

**Steps:**
1. Initialize a collection of centers $\mu_0,\ldots,\mu_{k-1}$:  
    - if `init` is an array, the centers are the rows of `init` . 
    - if `init=="random"`, the centers are chosen as `k` random rows of `X` .  
2. Label each sample ${\bf x}_i$ by $j$ if $\mu_j$ is the closed center to ${\bf x}_i$.
3. Call the points with label $j$ as group $j$.  Update $\mu_j$ as the center of points in group $j$.  
4. Repeat Steps 2 and 3 until `y_new` do not change anymore.

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 
    2. 
    3. ...

## Code

In [None]:
### your answer here

## Test
Take some sample data from [MDS-with-scikit-learn](MDS-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

In [None]:
### results with your code

In [None]:
### results with existing packages

## Comparison

##### Exercise 1
Try to turn `verbose=2` .  
Check if the stress is decreasing.

In [None]:
### your answer here

##### Exercise 2
Let  
```python
import scipy.linalg as LA
arr = np.random.randn(10,10)
Q,R = LA.qr(arr)
```
Let $X_k$ be the output of applying your MDS function to the `hidden_text.csv` data with `r=2` .  
Plot the points (rows) in $X_k$.  
Plot the points (rows) in $X_kQ$.  
Compute the stress of $X_k$ and the stress of $X_kQ$.   
(Some rotation do not change the stress.)

In [None]:
### your answer here

##### Exercise 3
Apply your MDS function to the `hidden_text.csv` data with `r=2` .  
How low can the stress be?

In [None]:
### your answer here