# DSCI 6003 7.1 Lecture


## An Introduction to Clustering

### By the End of This Lecture You Will:
1. Be familiar with what clustering is 
2. Be able to write the algorithm of kMeans clustering
3. Be able to describe the algorithm of Hierarchical clustering

### References
https://en.wikipedia.org/wiki/K-means_clustering  
https://en.wikipedia.org/wiki/K-means%2B%2B - kmeans++ for improved initialization  
https://www.youtube.com/watch?v=IuRb3y8qKX4 - video with visualization of training progress   
https://www.youtube.com/watch?v=cWSnFaSjgBU - more on visualization  
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html  

## Clustering: What is it?

We have already discussed clustering in a casual context. Most of the algorithms that you have learned so far are **supervised algorithms** in that all the training data is *labeled*. 

This means that the label is the supervisor! There is a way of dealing with data that is **not** labeled beforehand. This is clustering. It is the only method of **unsupervised learning**, and the moniker of "clustering" encompasses a vast array of complex algorithms designed to extract signal from data without knowing what the signal looks like beforehand.

![clustering-example](./images/clustering_example_3.png)


This has already come up - I.E. what happens if we need to build a model and there are no labels? 

This can happen if:
1. No labels have been provided and you need to set up an initial classifier based on the presumption of finite classes.
2. You are looking to determine if continuous data can be grouped into finite classes.

In both of these cases, you need to cluster the data. 

To be precise: Clustering is to divide data into groups (label) wherein all observations within each group meet a certain standard of similarity. 


### QUIZ: What are the two main challenges inherent to clustering?


## K-means clustering

![kmeans-example](./images/kmeans_example_2.png)

**Hypothesis:** We can determine what the clusters are by seeking to minimize within-cluster variation. This is obtained by the definition of the centroid belonging to each cluster. The user must define the k expected clusters beforehand.

**Cost:** We define the Within-Cluster-Variation: $WCV(C_k) = \dfrac{1}{C_k}\sum_{i, j \in C_k}d(x_{i}-x_{j})$, where $d(x_{i}-x_{j})$ is a distance metric of your choice (usually euclidean).

**Optimization:** Random search for best possible exemplar point. 



### K-means algorithm

    Asssign a characteristic number from 1 to K to each of N data points randomly
    While cluster assignments keep changing:
        For each of K clusters:
            Calculate cluster centroid
            For each of N points:
                determine point distance to centroid
        For each of N points:
            Assign point to centroid it is closest to

### QUIZ:
What is guaranteed to happen with this algorithm? What ways are there of overcoming this problem?

![kmeans-initialization](./images/kmeans_initialization.png)


#### ANSWER: 
The end result is totally determined by the first random initialization. It can be overcome by bagging or multiple initializations and picking lowest WCV. 

Obviously the number K is of immense importance at several levels. Choosing K is one of the greatest challenges in clustering and will be the topic of the next lecture.

## Issues with k-means
1.  Centers may get stuck sharing a cluster - More likely to happen with large number of well separated sets of points.
2.  Running time = (number of clusters) x (number of points) - Can use "rough" clustering (e.g. canopy clustering or locality sensitive hashing)
3.  Choosing k - best approach is to understand why you're doing the clustering.  ebay problem. 

## Hierarchical Clustering

Suppose we have clusters of both varying density and similar means. How can we address this problem?

![hierarchical-clustering](./images/hierarchical_clustering.png)

Hierarchical clustering is intended to overcome this problem by enabling the scope of the clustering to vary continuously. This is done by constructing a hierarchy tree. Hierarchical clustering is more of a concept rather than a single algorithm; we discuss a basic version of the algorithm here.


### Hierarchical clustering Algorithm

    Initialize all points to be individual clusters
    While n_clusters != 1:
        For all clusters
            Merge each cluster to its next closest cluster => New cluster
        Count n_clusters

### Reporting a Hierarchical Clustering

Since any set of relationships captured within the tree are technically valid, the question of which of these are to be reported sets the investigator quite a task. We typically report a tree with a **cut** at a given height. Note that the height gives some sense of separation between clusters, along with their geometry. Choosing this height is, however, is a matter of great interest (just as k is above) and will be discussed as above in the next lecture.

## Definitions of Distance

The matter of distance between clusters is also one under great research and actually varies in value from dataset to dataset. Distance is typically called **linkage** in the context of hierarchical clusterings. You still need to use a metric for determining distance between points, however, we also need a way for determining distance between clusters.

![cluster-distance](./images/cluster_distance.png)

In practice it is common to attempt hierarchical clusterings with several different distance metrics (implying some a priori understanding of the data).

![example-trees](./images/example_trees.png)

### Average linkage:

1) insensitive to outliers

2) good comprimise between single and complete linkage

### Complete linkage:

1) less sensitive to outliers

2) sometimes wind up with branches that overlap each other (sensitive to odd distributions)

### Single linkage:

1) More sensitive to outliers

2) Less sensitive to odd distributions