# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Clustering
What are our learning objectives for this lesson?
* Introduce clustering

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
Create a new folder called ClusteringFun. In ClusteringFun, open a main.py and code up an empty main(). Create a file called shirt_sizes.csv and paste the following data:

```
height(cm), weight(kg), size(t-shirt)
158, 58, M
158, 59, M
158, 63, M
160, 59, M
160, 60, M
163, 60, M
163, 61, M
160, 64, L
163, 64, L
165, 61, L
165, 62, L
165, 65, L
168, 62, L
168, 63, L
168, 66, L
170, 63, L
170, 64, L
170, 68, L
```

## Clustering
Given a collection of "objects" (i.e., instances with attributes), determine similar groups of objects ("clusters")

For example, for the following objects with two attributes
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U7-Unsupervised-Learning/master/figures/cluster_example1.png" width="450"/>


Q: What are the clusters?
* Possibly different ways to cluster, e.g., ...

<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U7-Unsupervised-Learning/master/figures/cluster_example2.png" width="450"/>

Like with $k$-nn, need a distance metric
* To determine how close instances so we can form clusters
* We'll use Euclidean distance

## Centroids
A centroid is the point in the center of a cluster. Using Euclidean distances, a cluster's centroid is its "average point" ...
* Specifically: each attribute value of the centroid is the average of the corresponding attribute value of the points in the cluster

### Lab Task 1
What is the centroid of the following cluster? Plot it with the points:

|att1 |att2|
|-|-|
|3 |4|
|6 |2|
|2 |1|
|5 |5|

## Cluster Quality
The quality of the cluster is given by an "objective function"
* i.e., a function we want to minimize (in this case)

We'll use the "Total Sum of Squares" (TSS)
* The sum of squared distances to the centroid from cluster instances
$$TSS = ((x_1 - \overline{x})^{2} + (y_1 - \overline{y})^{2}) + ... + ((x_n - \overline{x})^{2} + (y_n - \overline{y})^{2})$$

### Lab Task 2
Calculate the objective function for the previous example

$$TSS = ((3 - 4)^2 + (4 - 3)^2) + ((6 - 4)^2 + (2 - 3)^2) +((2 - 4)^2 + (1 - 3)^2) + ((5 - 4)^2 + (5 - 3)^2) = 20$$


Notes on the TSS:
* Can work well here since we use Euclidean distance
* Especially if we don't apply the square root function when calculating distances (can just add up the distances used)
* TSS also penalizes bigger distances more

## $k$-Means clustering algorithm
1. Pick a value of $k$
1. Select $k$ objects (arbitrarily) to use as initial centroids
1. Assign each instance to the cluster of its nearest centroid
1. Recalculate the centroids for the $k$ clusters
1. Repeat Steps 3-4 until the centroids no longer move (change)

Note that the resulting clusters depend on initial instances used as centroids
* e.g., starting with different instances can change the outcome
* One approach is to randomly pick $k$ instances as centroids

Q: What happens when $k$ = 1?
* We end up with only one cluster!

Q: What happens when $k = n$ for $n$ the number of instances?
* We end up with 1 cluster per instance (so, the original dataset)!

Can find good values for k experimentally ...
* In general, we want small values for $k$ (i.e., fewer clusters)
* Start with $k$ = 2, then use TSS to measure quality
* Move to $k$ = 3, $k$ = 4, and so on, until TSS begins to converge
* Select smallest $k$ close to the convergence
* See textbook for example

Why clustering?
* For prediction (e.g., determine instance's cluster, using voting)
* For data reduction (reduce dataset to one instance per cluster)
* For basic similarity search (e.g., find similar movies)
* For data exploration

### Lab Task 3
This problem uses the T-shirt dataset that is adapted from [this site](https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html).

1. Let's implement k-means clustering with k=2 using only the height and weight attributes
1. Is there a relationship between the clusters formed and T shirt size?
1. Let's say there is a new customer named 'Monica' has height 161cm and weight 61kg. What cluster does Monica belong to? Can we conclude anything about her T shirt size?

Note: because we use the Euclidean distance function, we don't want to forget to normalize the dataset before applying k means clustering!!