## Clustering (from computational point of view)

Input: 
"points" P  
D: distances between points  
$k \in \mathbb{N}$ number of clusters we want to obtain

Output:  
Partition of points P into k clusters $C_1, \dots, C_k $

- $D[x, y] \ge 0$  
- $D[x, x] = 0$  
- $D[x, y] = D[y, x]$ symmetric  
- $D[x,z] \le D[x,y] + D[y,z]$ triangle inequality  

Requirements:
1. Max-clustering

Maximize the minimum distance between two points from different clusters.

Max min $D[x,y]$, where $x \in C_i, y \in C_j, i\ne j$

2. Min-clustering

Minimize the maximal distance between any two points from the same cluster.

Min max $D[x,y]$, where $x \in C_i, y \in C_j, i = j$

## Maximising inter-cluster distance is easy

__Kruskal's algorithm__

For building a minimum spanning tree. Consider these points as a complete weighted graph, where the weights are given by these distances. 

First take the lightest edge, then you take the next lightest edge, then you take the third lightest edge, and you add these edges to the spanning tree that you are trying to build. Only here, we stop earlier. We stop when we have three connected components or more generally, when we have k connected components. We don't need to connect them together. It's Kruskal's algorithm with early stop. That's very nice because it's polynomial-time.

__Correctness proof__

Let $C_1, \dots, C_k $ be the clusters obtained.

Minimum distance between different points from different clusters. Let's denote it by $d*$.

Let $C^{'}_1, \dots, C'_k$ be another clustering.

Assume $C_i$ is different with $C^{'}_i$.

Let $x, y \in C_i$, $x \in C^{'}_j, y \notin C^{'}_j$. (x, y are two points that is in the same cluster in $C$, but in different cluster in $C^{'}$, such points exists because $C$ and $C^{'}$ are different)

Since we find $C$ by Kruskal's, and $x,y$ are in the same cluster. There must be a path $p$ from $x$ to $y$.

$x \in C^{'}_j, y \notin C^{'}_j$, which means the path $p$, start in $C^{'}_j$ and at some point, it leaves $C^{'}_j$, and end in another cluster $C^{'}_l$

Let $u$ be a node on path $p$ and is in $C^{'}_j$, $v$ be a node that is in path $p$ and is in $C^{'}_l$. There is an edge between $u$ and $v$ and such edge from $C^{'}_j$ to $C^{'}_l$. It is possible that $u=x$ and $v=y$, or they are different.

__Claim: the edge between $u$ and $v$ has weight less than $d^*$, $D[u,v]\le d^*$__

Becasue the Kruskal's hasn included $D[u,v]$ in the tree. But hasn't include $d^*$ in the tree. And Kruskal's ordered edge distance asendingly, so $D[u,v]\le d^*$.

Whcih means the minimum distance between two points in two different clusters in the new clustering cannot more than $d^*$. So $C$ is optimal solution.

## Minimising intra-cluster distance: An approximate solution

Let's say that the diameter of cluster $C_i$ is the maximal distance between any two points in the cluster $C_i$.

$D(C_i) = \text{max} D[x,y], x\in C_i, y\in C_i$

Algorithm:

Pick any node, set it as the center of cluster 1, namely $c_1$. Find a node that is farthest from $c_1$, let it be the center of cluster 2, namely $c_2$, Find a node that us farthest from $c_1$ and $c_2$, (by $Max (min(D[c_3, c_1], D(c_3, c_2)))$ ?), let it be center of cluster 3, namely $c_3$. and so on, until we have $c_1, \dots c_k$.

And assign the remaining nodes to its closest cluster center $c_i$. We have $k$ clusters.

1. Consider a point $p$ that is at the largest distance from all cluster centers. If we let our algorithm to run one more iteration, the algorithm would have chosen this vertex $p$, (as $c_{k+1}$).

Assume $p \in C_i$, $r := D[p, c_i]$

2. Let $u \in C_j$ and $v \in C_j$, then $D[u, c_j] \le r, D[v, c_j] \le r$ ($p$ is the furthest).  

$D[u,v] \le 2r$ (triangle inequality)

3. $\forall j < l$ (or just $\forall j\ne l$), $D[c_j, c_l] \ge r$ ($p$ is not a cluster centers, the distance of cluster centers must be larger than $r$)

4. Let $C^{'}_1, \dots, C^{'}_k$ be another clustering. 

Consider points $c_1, \dots, c_k, p$, there must be two points in this group is in the same cluster $C^{'}_l$ in clustering $C^{'}$. __Then the diameter of that cluster is not going to be less than the distance between these two points.__

If one of theses two points is $p$. then $D[c_i,p] = D(C^{'}_l) \ge r$, if both points are cluster centers, then $D[c_i,c_j] = D(C^{'}_l) \ge r$. It means $D(C^{'}_l) \ge r$.

And from 2, we know $D(C_j) \le 2r$. So we have a 2-approximation of Min intra-cluster distance.

## Better-quality approximation?

To prove Min-clustring is NP-hard, we know there is no polynomial solution unless P=NP.

Reduction from k-coloring, which is a NP-complete problem. k-coloring $\le_P$ Min-clustering.

_Notice that once you have colored vertices into k colors, you have essentially splitter them into k clusters. The relation between these two problems is obvious._

Let
$D[u,v] = \begin{cases}
0, u = v \\
1, u \ne v, (u,v) \notin \xi \;\;\;\;\text{(not connected, it's safe to color them in same color)} \\
2, u \ne v, (u,v) \in \xi \;\;\;\;\text{(connected, can't color them with same color)} \\
\end{cases}$

Is there a clustering $C_1, \dots, C_k$. s.t. $D(C_i) \le 1 \;\forall i$

It's very easy to see that, if the graph can be colored in k colors, then there is a clustering with this property, because we simply take vertices colored with the same color and declare them to be a cluster. And vice versa, if a clustering exists, then we can color vertices from the same cluster with the same color. This means that this problem is NP-hard, so it's not possible to solve it in polynomial time unless P equals NP. 

---

Suppose there is such an algorithm, an algorithm that computes a (2-$\varepsilon$)-approximation. for $\varepsilon > 0$

$OPT_{diameter} \le 1 \iff$ approximate diameter $\le 2-\varepsilon$.

But look at above cases, the distances are 0,1,2. so the above would mean approximate diameter $\le 1$, in other word, we solve it optimally. which is not possible in polynomial time unless P=NP. (the claim is weak, what if we set another distances, like 1.5?)