---
title: Clustering and the K-Means Algorithm
subject: Inner Products and Norms
subtitle: Grouping vectors together
short_title: Clustering
authors:
  - name: Nikolai Matni
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nmatni@seas.upenn.edu
license: CC-BY-4.0
keywords: Clutering, K-Means
math:
  '\vv': '\mathbf{#1}'
  '\bm': '\begin{bmatrix}'
  '\em': '\end{bmatrix}'
  '\R': '\mathbb{R}'
---

## Reading

Material related to this page, as well as additional exercises, can be found in VMLS Chapter 4.

## Learning Objectives

By the end of this page, you should know:
- What is the clustering problem?
- What is the centroid, or mean, of a group of vectors?
- What is the K-means algorithm?

# The Clustering Problem

## An Informal Example of Clustering (VMLS 4.1)

Suppose we have $N$ vectors $\vv{x_1}, ..., \vv{x_N} \in V$. The goal of clustering is to group the vectors into $k$ groups, $k$ groups or *clusters* of vectors that are close to each other, as measured by the [distance](#general_distance_defn) between pairs of them.

Normally, the number of groups $k$ is much smaller than the total number of vectors $N$. Typical values in practice for $k$ range from a handful (2-3) to hundreds, and $N$ ranges from hundreds to billions.

The figure below shows a simple example with $N = 300$ vectors in $\mathbb{R}^2$, shown as small circles. The right picture shows that is easily seen: these vectors can be clustererd into $k = 3$ groups in a way that "looks right" (we will quantify this idea soon).

:::{figure}../figures/04-3_clusters.png
:label:Grouping vectors into 3 clusters
:alt: A collection of $300$ vectors in $\mathbb{R}^n$ grouped into $3$ clusters by eye
:width: 500px
:align: center
:::

This is example is a bit silly; for vectors in $\mathbb{R}^2$, clustering is easy; just make a scatter plot and use your eyes. In almost all applications, vectors live in $\mathbb{R}^n$ with $n$ much bigger than 2. Another silly aspect is how cleanly points split into clusters; real data is messy, and often many points lie between clusters. Finally, in real examples, it is not always obvious how many clusters $k$ there are.

## Applications of Clustering

Despite all of this, we'll see clustering can still be incredibly useful in practice. Before we dive into more details, let's highlight a few common applications where clustering is useful:

* **Topic discovery.** Suppose $\vv{x_i}$ are word histograms associated with $N$ documents (a word histogram $\vv x$ has entries $x_i$ which count the number of times word $i$ appears in a document). Clustering will partition the $N$ documents into $k$ groups, which can be interpreted as groups of documents with the same or similar topics, genre, or author. This is sometimes called *automatic topic discovery*.

* **Customer market segmentation.** Suppose the vector $\vv{x_i} \in \mathbb{R}^n$ gives the dollar values of $n$ items purchased by customer $i$ in the past year. A clustering algorithm groups the customers into $k$ market segments, which are groups of customers with similar purchasing patterns.

Other examples include patient, zip code, student, and survey response clustering, as well as identifying weatehr zones, daily energy use patterns, and financial sectors. See pp. 70-71 of VLMS for more details.

## A Clustering Objective (VLMS 4.2)

Our goal now is to formalize the ideas described above, and introduce a quantitative measure of "how good a clustering" is.

### Specifying cluster assignments

We specify a clutering of vectors by assigning each vector to a group. We label the groups $1, ..., k$ and assign each of the $N$ vectors $\vv{x_1}, .., \vv{x_n}$ to a group via the vector $\vv c \in \mathbb{R}^N$, with $c_i$ being the group number that $\vv{x_i}$ has been assigned to. 

For example, if $N = 5$ and $k = 3$, then

\begin{align*}
    c = \bm 3\\1\\1\\1\\2 \em 
\end{align*}
assigns $\vv{x_1}$ to group 3; $\vv{x_2}, \vv{x_3}, \vv{x_4}$ to group 1; $\vv{x_5}$ to group 2.

We will also describe clusters by the sets of indices for each group, with $G_j$ being the set of indices associated with group $j$. For our simple example, we have

\begin{align*}
    G_1 = \{ 2, 3, 4 \}, \quad G_2 = \{5\}, \quad G_3 = \{1\}
\end{align*}

In general, we have that $G_j = \{ i \mid c_i = j\}$.

### Group representatives 

Each group is assigned a *group representative*