# KMeans with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Code
```python
from sklearn.cluster import KMeans
model = KMeans(<parameters>)
y_new = model.fit_predict(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

## Parameters
- `n_clusters`: number of clusters
- `init`: `"k-means++"`, `"random"`, or an array of shape `(n_clusters, n_features)`
- `n_init`: the function will run the $k$-means algorithm `n_init` times to get a best performance

## Attributes
- `cluster_centers_`: an array of shape `(n_clusters, n_features)` whose rows are the cluster centers
- `labels_`: labels of each point, the predict of the original data
- `inertia_`: $\sum_{i,j} \|{\bf x}_i - \mu_j\|^2$ where the summation runs through all pairs $(i,j)$ such that ${\bf x}_i$ is in the $j$-th cluster with center $\mu_j$.

## Sample data

##### Exercise 1
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
```

###### 1(a)
Apply the $k$-means algorithm to `X` with $k=2$ and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.inertia_` .

In [None]:
mu1 = np.array([2.5,0])   #random.multivariate_normal(mean,cov)
cov1 = np.array([[1.1,-1],  #mean 1D array, cov 2D array
         [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
         [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
from sklearn.cluster import KMeans
model = KMeans(2,n_init=10)
y_new = model.fit_predict(X)
plt.axis('equal')
plt.scatter(*X.T, c=y_new)
model.inertia_

###### 1(b)
Apply the $k$-means algorithm to `X` with $k=3$ and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.inertia_` .

In [None]:
mu1 = np.array([2.5,0])   #random.multivariate_normal(mean,cov)
cov1 = np.array([[1.1,-1],  #mean 1D array, cov 2D array
         [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
         [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
from sklearn.cluster import KMeans
model = KMeans(3,n_init=10)
y_new = model.fit_predict(X)
plt.axis('equal')
plt.scatter(*X.T, c=y_new)
model.inertia_

###### 1(c)
Run  
```python
ins = [KMeans(k).fit(X).inertia_ for k in range(1,6)]
plt.plot(np.arange(1,6), ins)
```
What does it means?  
What is a good guess of the number of clusters?

In [None]:
ins = [KMeans(k,n_init=10).fit(X).inertia_ for k in range(1,6)]
plt.plot(np.arange(1,6), ins)
#This is using the elbow method to choose the appropriate K for clustering
#We want to pick the value(K) which is the elbow of the curve
#it seems k = 2 is a good guess

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```

###### 2(a)
Apply the $k$-means algorithm to `X` with $k=2$ and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
(Each row is in $\mathbb{R}^4$.  
Just pick arbitrary two coordinates to plot the points.)  
Print `model.inertia_` .

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.cluster import KMeans
model = KMeans(2,n_init=10)
y_new = model.fit_predict(X)
plt.axis('equal')
plt.scatter(X[:,0].T,X[:,3].T,c=y_new)
model.inertia_

###### 2(b)
Apply the $k$-means algorithm to `X` with $k=3$ and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
(Each row is in $\mathbb{R}^4$.  
Just pick arbitrary two coordinates to plot the points.)  
Print `model.inertia_` .

In [None]:
from sklearn.cluster import KMeans
model = KMeans(3,n_init=10)
y_new = model.fit_predict(X)
plt.axis('equal')
plt.scatter(X[:,0].T,X[:,3].T,c=y_new)
model.inertia_

###### 2(c)
Run  
```python
ins = [KMeans(k).fit(X).inertia_ for k in range(1,6)]
plt.plot(np.arange(1,6), ins)
```
What does it means?  
What is a good guess of the number of clusters?

In [None]:
ins = [KMeans(k,n_init=10).fit(X).inertia_ for k in range(1,6)]
plt.plot(np.arange(1,6), ins)
#This is using the elbow method to choose the appropriate K for clustering
#We want to pick the value(K) which is the elbow of the curve
#it seems k = 2 is a good guess

##### Jephian:
  
Indeed, `k=2` looks good.

However, this is a data with three categories, so the elbow method is only to provide a clue of your choice of k .

##### Exercise 3
Let  
```python
arr = plt.imread('incrediville-side.jpg')
m,n,c = arr.shape
X = arr.reshape(-1,3)
```

###### 3(a)
Print `X.shape` .  
What are the rows in `X` .

In [None]:
arr = plt.imread('incrediville-side.jpg')
m,n,c = arr.shape
X = arr.reshape(-1,3)
print(X[:,:])
X.shape
#rows in X is the RGB values of a pixel

###### 3(b)
Apply the $k$-means algorighm to `X` and obtain `y_new` .  
Let  
```python
img = (y_new == 0).reshape(m, n)
plt.imshow(img, cmap='Greys')
```
Change `0` to `1` or `2` .  
What do these pictures mean?  
Is the black region always connected?

In [None]:
arr = plt.imread('incrediville-side.jpg')
m,n,c = arr.shape
X = arr.reshape(-1,3)
model = KMeans(3,n_init=10)
y_new = model.fit_predict(X)
img = (y_new == 0).reshape(m, n)
plt.imshow(img, cmap='Greys')

In [None]:
model = KMeans(3,n_init=10)
y_new = model.fit_predict(X)
img = (y_new == 1).reshape(m, n)
plt.imshow(img, cmap='Greys')

In [None]:
model = KMeans(3,n_init=10)
y_new = model.fit_predict(X)
img = (y_new == 2).reshape(m, n)
plt.imshow(img, cmap='Greys')

We can observe that the black region is not always connected.

And we separate pixels into 3 parts and using 'Greys' color to show it.





## Experiments

##### Exercise 4
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
```

###### 4(a)
Apply the $k$-means algorithm to `X` and obtain the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Plot the points (rows) in `model.cluster_centers_` with `c='r'` .

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
         [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
          [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
        np.random.multivariate_normal(mu2, cov2, 100)])
model = KMeans(2,n_init=10)
y_new = model.fit_predict(X)
plt.scatter(*X.T,c=y_new)
plt.scatter(model.cluster_centers_[:,0].T,model.cluster_centers_[:,1].T,c='r')

###### 4(b)
Check if `y_new` and `model.labels_` are the same.

In [None]:
from sklearn.cluster import KMeans
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
         [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
          [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
        np.random.multivariate_normal(mu2, cov2, 100)])
model = KMeans(2,n_init=10)
y_new = model.fit_predict(X)
np.all(y_new == model.labels_)

###### 4(c)
Compute the inertia and compare your answer with `model.inertia_` .  
Recall that the inertia of a clustered data is  
$$\sum_{i,j} \|{\bf x}_i - \mu_j\|^2,$$ 
where the summation runs through all pairs $(i,j)$ such that ${\bf x}_i$ is in the $j$-th cluster with center $\mu_j$.

In [None]:
for z in range(1,6): 
  model = KMeans(z, n_init=10)
  y_new = model.fit_predict(X)
  inertia = np.sum((X - model.cluster_centers_[y_new])**2)
  #print(inertia)
  x = KMeans(z,n_init=10).fit(X).inertia_
  print(np.around(x) == np.around(inertia))

##### Veronica:

```python
    kmean_inertia = np.sum((X - model.cluster_centers_[y_new])**2)
    kmean_inertia == model.inertia_
```  

##### Exercise 5
Let  
```python
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
```
Apply the $k$-means algorithm to `X` with $k=2$  
(or other $k$ if you wish)  
and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Plot the cluster centers in red.  
Is it a good clustering?  
Why it is good?  Or why it does not work well?

In [None]:
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
plt.scatter(*X.T)
model = KMeans(2,n_init=10)
y_new = model.fit_predict(X)
plt.axis('equal')
plt.scatter(*X.T,c=y_new)
plt.scatter(model.cluster_centers_[:,0].T,model.cluster_centers_[:,1].T,c = 'red')
#since the graph is consist of two circles, if we want to separate data into two circles,
#KMeans algorithm seems not a good choice because it's boundary always a line