<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Issues-with-K-Means" data-toc-modified-id="Issues-with-K-Means-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Issues with K-Means</a></span><ul class="toc-item"><li><span><a href="#Demonstration-of-&quot;Incorrect&quot;-Clusters-are-Found" data-toc-modified-id="Demonstration-of-&quot;Incorrect&quot;-Clusters-are-Found-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Demonstration of "Incorrect" Clusters are Found</a></span><ul class="toc-item"><li><span><a href="#Example-1" data-toc-modified-id="Example-1-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Example 1</a></span><ul class="toc-item"><li><span><a href="#Solution" data-toc-modified-id="Solution-1.1.1.1"><span class="toc-item-num">1.1.1.1&nbsp;&nbsp;</span>Solution</a></span></li></ul></li><li><span><a href="#Example-2" data-toc-modified-id="Example-2-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Example 2</a></span><ul class="toc-item"><li><span><a href="#Solution" data-toc-modified-id="Solution-1.1.2.1"><span class="toc-item-num">1.1.2.1&nbsp;&nbsp;</span>Solution</a></span></li></ul></li></ul></li><li><span><a href="#In-Summary" data-toc-modified-id="In-Summary-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>In Summary</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Datasets</a></span></li></ul></li><li><span><a href="#Other-Clustering-Algorithms" data-toc-modified-id="Other-Clustering-Algorithms-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Other Clustering Algorithms</a></span></li></ul></div>

# Issues with K-Means

- Initial parameters matter: (hillclimbing)
- Clusters are not always what we expect (because of local minimum)

## Demonstration of "Incorrect" Clusters are Found

> Can I find a local minimum where the "obvious" groups are incorrectly identified?

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(27)

### Example 1

In [None]:
sigma_x = sigma_y = 0.1
n = 10
all_pts = []

for c_x,c_y in [[0,4],[0,2],[1,3]]:
    a_x = np.random.normal(c_x, sigma_x, n)
    a_y = np.random.normal(c_y, sigma_y, n)

    pts = np.array([(x,y) for (x,y) in zip(a_x,a_y)])
    all_pts.append(pts)
    
    plt.scatter(pts[:,0], pts[:,1])
plt.show()

> Using **3 centroids** find a local minimum where the two groups on the left are one group and another group on the right.

#### Solution

In [None]:
for pts in all_pts:
    plt.scatter(pts[:,0], pts[:,1], alpha=0.55)
    
plt.scatter([0,1,1],[3,3.2,2.8], marker='*', s=200)
plt.show()

### Example 2

In [None]:
sigma_x = 0.1
sigma_y = 0.5
n = 15

centers = [[0,2],[1,2]]
all_pts = []

for c_x,c_y in centers:
    a_x = np.random.normal(c_x, sigma_x, n)
    a_y = np.random.normal(c_y, sigma_y, n)

    pts = np.array([(x,y) for (x,y) in zip(a_x,a_y)])
    all_pts.append(pts)
    
    plt.scatter(pts[:,0], pts[:,1])
plt.show()

> Using **2 centroids** find a local minimum where the two groups on the left are one group and another group on the right.

#### Solution

In [None]:
for pts in all_pts:
    plt.scatter(pts[:,0], pts[:,1], alpha=0.55)
    
plt.scatter([0.5,0.5],[2,1.5], marker='*', s=200, color='red')
plt.show()

## In Summary

> K-means relies on distance centers (lines of data) → looks for "blob" data

## Datasets

The clustering algorithm's performance can depend on the initial data's structure/distribution

- Random
- Crescents
- Cocentric circles
- Clusters
- Clusters w/ density
- lined densities

![https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_0011.png](images/sklearn_clustering_dataset_comparisons.png)

# Other Clustering Algorithms

Leads us to other algorithms to deal with the short comings to k-means: [Heirarchical Clustering](heirarchical_clustering.ipynb)