# K-means clustering 
Clustering is a computational technique that divides the points in a data set into groups. A successful clustering results in groups that contain points that are related to one another. Whether those relationships are meaningful generally requires human verification.

In clustering, the group (a.k.a. cluster) that a data point belongs to is not predetermined, but instead is decided during the run of the clustering algorithm. In fact, the algorithm is not guided to place any particular data point in any particular cluster by presupposed information. For this reason, clustering is considered an unsupervised method within the realm of machine learning. You can think of unsupervised as meaning not guided by foreknowledge.

K-means is a clustering algorithm that attempts to group data points into a certain predefined number of clusters, based on each point’s relative distance to the center of the cluster. In every round of k-means, the distance between every data point and every center of a cluster (a point known as a centroid) is calculated. Points are assigned to the cluster whose centroid they are closest to. Then the algorithm recalculates all of the centroids, finding the mean of each cluster’s assigned points and replacing the old centroid with the new mean. The process of assigning points and recalculating centroids continues until the centroids stop moving or a certain number of iterations occurs. 

Here is our k-means clustering algorithm:

- Initialize all of the data points and “k” empty clusters.
- Normalize all of the data points.
- Create random centroids associated with each cluster.
- Assign each data point to the cluster of the centroid it is closest to.
- Recalculate each centroid so it is the center (mean) of the cluster it is associated with.
- Repeat steps 4 and 5 until a maximum number of iterations is reached or the centroids stop moving (convergence).


In [1]:
from __future__ import annotations
from typing import Iterator, Tuple, List, Iterable
from math import sqrt

class DataPoint:
    def __init__(self, initial: Iterable[float]) -> None:
        self._originals: Tuple[float, ...] = tuple(initial)
        self.dimensions: Tuple[float, ...] = tuple(initial)

    @property
    def num_dimensions(self) -> int:
        return len(self.dimensions)

    def distance(self, other: DataPoint) -> float:
        combined: Iterator[Tuple[float, float]] = zip(self.dimensions, other.dimensions)
        differences: List[float] = [(x - y) ** 2 for x, y in combined]
        return sqrt(sum(differences))

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, DataPoint):
            return NotImplemented
        return self.dimensions == other.dimensions

    def __repr__(self) -> str:
        return self._originals.__repr__()

zscores() converts a sequence of floats into a list of floats with the original numbers’ respective z-scores relative to all of the numbers in the original sequence.

In [2]:
from typing import TypeVar, Generic, List, Sequence
from copy import deepcopy
from functools import partial
from random import uniform
from statistics import mean, pstdev
from dataclasses import dataclass

def zscores(original: Sequence[float]) -> List[float]:
    avg: float = mean(original)
    std: float = pstdev(original)
    if std == 0: # return all zeros if there is no variation
        return [0] * len(original)
    return [(x - avg) / std for x in original]

In [3]:
# We will implement a class for maintaining state and running the algorithm
Point = TypeVar('Point', bound=DataPoint)

class KMeans(Generic[Point]):
    
    @dataclass
    class Cluster:
        points: List[Point]
        centroid: DataPoint
    
    def __init__(self, k: int, points: List[Point]) -> None:
        if k < 1: # k-means can't do negative or zero clusters
            raise ValueError("k must be >= 1")
        self._points: List[Point] = points
        self._zscore_normalize()
        # initialize empty clusters with random centroids
        self._clusters: List[KMeans.Cluster] = []
        for _ in range(k):
            rand_point: DataPoint = self._random_point()
            cluster: KMeans.Cluster = KMeans.Cluster([], rand_point)
            self._clusters.append(cluster)

    @property
    def _centroids(self) -> List[DataPoint]:
        return [x.centroid for x in self._clusters]
    
    #a convenience method that can be thought of as returning a column of data
    def _dimension_slice(self, dimension: int) -> List[float]:
        return [x.dimensions[dimension] for x in self._points]
    
    #replaces the values in the dimensions tuple of every data point with its z-scored equivalent
    def _zscore_normalize(self) -> None:
        zscored: List[List[float]] = [[] for _ in range(len(self._points))]
        for dimension in range(self._points[0].num_dimensions):
            dimension_slice: List[float] = self._dimension_slice(dimension)
            for index, zscore in enumerate(zscores(dimension_slice)):
                zscored[index].append(zscore)
        for i in range(len(self._points)):
            self._points[i].dimensions = tuple(zscored[i])
            
    #method to create the initial random centroids for each cluster
    def _random_point(self) -> DataPoint:
        rand_dimensions: List[float] = []
        for dimension in range(self._points[0].num_dimensions):
            values: List[float] = self._dimension_slice(dimension)
            rand_value: float = uniform(min(values), max(values))
            rand_dimensions.append(rand_value)
        return DataPoint(rand_dimensions)
    
    # Find the closest cluster centroid to each point and assign the point to that cluster
    def _assign_clusters(self) -> None:
        for point in self._points:
            closest: DataPoint = min(self._centroids, key=partial(DataPoint.distance, point))
            idx: int = self._centroids.index(closest)
            cluster: KMeans.Cluster = self._clusters[idx]
            cluster.points.append(point)
            
    # Find the center of each cluster and move the centroid to there
    def _generate_centroids(self) -> None:
        for cluster in self._clusters:
            if len(cluster.points) == 0: # keep the same centroid if no points
                continue
            means: List[float] = []
            for dimension in range(cluster.points[0].num_dimensions):
                dimension_slice: List[float] = [p.dimensions[dimension] for p in cluster.points]
                means.append(mean(dimension_slice))
            cluster.centroid = DataPoint(means)
            
    def run(self, max_iterations: int = 100) -> List[KMeans.Cluster]:
        for iteration in range(max_iterations):
            for cluster in self._clusters: # clear all clusters
                cluster.points.clear()
            self._assign_clusters() # find cluster each point is closest to
            old_centroids: List[DataPoint] = deepcopy(self._centroids) # record
            self._generate_centroids() # find new centroids
            if old_centroids == self._centroids: # have centroids moved?
                print(f"Converged after {iteration} iterations")
                return self._clusters
        return self._clusters

In [4]:
point1: DataPoint = DataPoint([2.0, 1.0, 1.0])
point2: DataPoint = DataPoint([2.0, 2.0, 5.0])
point3: DataPoint = DataPoint([3.0, 1.5, 2.5])
kmeans_test: KMeans[DataPoint] = KMeans(2, [point1, point2, point3])
test_clusters: List[KMeans.Cluster] = kmeans_test.run()
    
for index, cluster in enumerate(test_clusters):
    print(f"Cluster {index}: {cluster.points}")

Converged after 1 iterations
Cluster 0: [(3.0, 1.5, 2.5)]
Cluster 1: [(2.0, 1.0, 1.0), (2.0, 2.0, 5.0)]


## Clustering governors by age and longitude 
looking at each state by its longitude, perhaps we can find clusters of states with similar longitudes and similar-age governors

In [5]:
from typing import List

class Governor(DataPoint):
    def __init__(self, longitude: float, age: float, state: str) -> None:
        super().__init__([longitude, age])
        self.longitude = longitude
        self.age = age
        self.state = state

    def __repr__(self) -> str:
        return f"{self.state}: (longitude: {self.longitude}, age: {self.age})"

We will run k-means with k set to 2. It cannot be emphasized enough that your results with k-means using random initialization of centroids will vary. Be sure to run k-means multiple times with any data set. 

In [6]:
governors: List[Governor] = [Governor(-86.79113, 72, "Alabama"),
     Governor(-152.404419, 66, "Alaska"),
                 Governor(-111.431221, 53, "Arizona"), Governor(-92.373123,
     66, "Arkansas"),
                 Governor(-119.681564, 79, "California"), Governor(-
     105.311104, 65, "Colorado"),
                 Governor(-72.755371, 61, "Connecticut"), Governor(-
     75.507141, 61, "Delaware"),
                 Governor(-81.686783, 64, "Florida"), Governor(-83.643074,
     74, "Georgia"),
                 Governor(-157.498337, 60, "Hawaii"), Governor(-114.478828,
     75, "Idaho"),
                 Governor(-88.986137, 60, "Illinois"), Governor(-86.258278,
     49, "Indiana"),
                 Governor(-93.210526, 57, "Iowa"), Governor(-96.726486, 60,
     "Kansas"),
                 Governor(-84.670067, 50, "Kentucky"), Governor(-91.867805,
     50, "Louisiana"),
                 Governor(-69.381927, 68, "Maine"), Governor(-76.802101, 61,
     "Maryland"),
                 Governor(-71.530106, 60, "Massachusetts"), Governor(-
     84.536095, 58, "Michigan"),
                 Governor(-93.900192, 70, "Minnesota"), Governor(-89.678696,
     62, "Mississippi"),
                 Governor(-92.288368, 43, "Missouri"), Governor(-110.454353,
     51, "Montana"),
                 Governor(-98.268082, 52, "Nebraska"), Governor(-117.055374,
     53, "Nevada"),
                 Governor(-71.563896, 42, "New Hampshire"), Governor(-
     74.521011, 54, "New Jersey"),
                 Governor(-106.248482, 57, "New Mexico"), Governor(-
     74.948051, 59, "New York"),
                 Governor(-79.806419, 60, "North Carolina"), Governor(-
     99.784012, 60, "North Dakota"),
                 Governor(-82.764915, 65, "Ohio"), Governor(-96.928917, 62,
     "Oklahoma"),
                 Governor(-122.070938, 56, "Oregon"), Governor(-77.209755,
     68, "Pennsylvania"),
                 Governor(-71.51178, 46, "Rhode Island"), Governor(-
     80.945007, 70, "South Carolina"),
                 Governor(-99.438828, 64, "South Dakota"), Governor(-
     86.692345, 58, "Tennessee"),
                 Governor(-97.563461, 59, "Texas"), Governor(-111.862434, 70,
     "Utah"),
                 Governor(-72.710686, 58, "Vermont"), Governor(-78.169968,
     60, "Virginia"),
                 Governor(-121.490494, 66, "Washington"), Governor(-
     80.954453, 66, "West Virginia"),
                 Governor(-89.616508, 49, "Wisconsin"), Governor(-107.30249,
     55, "Wyoming")]
    
kmeans: KMeans[Governor] = KMeans(2, governors)
gov_clusters: List[KMeans.Cluster] = kmeans.run()
for index, cluster in enumerate(gov_clusters):
    print(f"Cluster {index}: {cluster.points}\n")

Converged after 4 iterations
Cluster 0: [Alaska: (longitude: -152.404419, age: 66), Arizona: (longitude: -111.431221, age: 53), California: (longitude: -119.681564, age: 79), Colorado: (longitude: -105.311104, age: 65), Hawaii: (longitude: -157.498337, age: 60), Idaho: (longitude: -114.478828, age: 75), Montana: (longitude: -110.454353, age: 51), Nevada: (longitude: -117.055374, age: 53), New Mexico: (longitude: -106.248482, age: 57), Oregon: (longitude: -122.070938, age: 56), Utah: (longitude: -111.862434, age: 70), Washington: (longitude: -121.490494, age: 66), Wyoming: (longitude: -107.30249, age: 55)]

Cluster 1: [Alabama: (longitude: -86.79113, age: 72), Arkansas: (longitude: -92.373123, age: 66), Connecticut: (longitude: -72.755371, age: 61), Delaware: (longitude: -75.507141, age: 61), Florida: (longitude: -81.686783, age: 64), Georgia: (longitude: -83.643074, age: 74), Illinois: (longitude: -88.986137, age: 60), Indiana: (longitude: -86.258278, age: 49), Iowa: (longitude: -93.21

## Clustering Michael Jackson albums by length
Michael Jackson released 10 solo studio albums. In the following example, we will cluster those albums by looking at two dimensions: album length (in minutes) and number of tracks. This example is a nice contrast with the preceding governors example because it is easy to see the clusters in the original data set without even running k-means. An example like this can be a good way of debugging an implementation of a clustering algorithm. 

In [7]:
class Album(DataPoint):
    def __init__(self, name: str, year: int, length: float, tracks: float) -> None:
        super().__init__([length, tracks])
        self.name = name
        self.year = year
        self.length = length
        self.tracks = tracks

    def __repr__(self) -> str:
        return f"{self.name}, {self.year}"

In [8]:
albums: List[Album] = [Album("Got to Be There", 1972, 35.45, 10),
    Album("Ben", 1972, 31.31, 10),
    Album("Music & Me", 1973, 32.09, 10),
    Album("Forever, Michael", 1975, 33.36, 10),
    Album("Off the Wall", 1979, 42.28, 10),
    Album("Thriller", 1982, 42.19, 9),
    Album("Bad", 1987, 48.16, 10), Album("Dangerous", 1991, 77.03, 14),
    Album("HIStory: Past, Present and Future, Book I", 1995, 148.58, 30), 
    Album("Invincible", 2001, 77.05, 16)]
    
kmeans: KMeans[Album] = KMeans(2, albums)
clusters: List[KMeans.Cluster] = kmeans.run()
for index, cluster in enumerate(clusters):
    print(f"Cluster {index} Avg Length {cluster.centroid.dimensions[0]} Avg Tracks {cluster.centroid.dimensions[1]}: {cluster.points}\n")

Converged after 1 iterations
Cluster 0 Avg Length -0.29445443944188343 Avg Tracks -0.31282342980986083: [Got to Be There, 1972, Ben, 1972, Music & Me, 1973, Forever, Michael, 1975, Off the Wall, 1979, Thriller, 1982, Bad, 1987, Dangerous, 1991, Invincible, 2001]

Cluster 1 Avg Length 2.650089954976951 Avg Tracks 2.815410868288747: [HIStory: Past, Present and Future, Book I, 1995]

