# 4. Clustering

This chapter and the next use unsupervised learning algorithms. The purpose of unsupervised learning algo is to learn some property of the data, to represent the structure of the features in a certain way. In the context of feature engineering for prediction, you could think of an unsupervised algorithm as a "feature discovery" technique.

Clustering simply means the assigning of data points to several groups based upon how similar the points are to each other.

When used for feature engineering, we could attempt to discover groups of customers representing a market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of cluster labels can help machine learning models untangle complicated relationships of space or proximity.

## 4.1 Cluster Labels as a Feature
Applied to a single real-valued feature, clustering acts like a traditional "binning" or "discretization"(https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html) transform. On multiple features, it's like "multi-dimensional binning" (sometimes called vector quantization).

![04.Clustering_binning.png](../img/04.Clustering_binning.png)

`Left`: Clustering a single feature. `Right`: Clustering across two features.

Added to a dataframe, a feature of cluster labels might look like this:

```text
Longitude	Latitude	Cluster
-93.619	42.054	3
-93.619	42.053	3
-93.638	42.060	1
-93.602	41.988	0
```

It's important to remember that this Cluster `feature` is **categorical**. Here, it's shown with a label encoding (that is, as a sequence of integers) as a typical clustering algorithm would produce; depending on your model, a one-hot encoding may be more appropriate.

The motivating idea for adding cluster labels is that the clusters will break up complicated relationships across features into simpler chunks. Our model can then just learn the simpler chunks one-by-one instead having to learn the complicated whole all at once. It's a "divide and conquer" strategy.


![04.Clustering_built_year.png](../img/04.Clustering_built_year.png)

In the above figure, you can notice Clustering the YearBuilt feature helps this linear model learn its relationship to SalePrice.

Because the curved relationship between the YearBuilt(feature) and SalePrice(label) is too complicated for the simple linear model. As a result, the model `underfits`.
If we use clustering feature (e.g. before 1940, between 1940 and 1980, after 1980) to replace the numeric year feature, **the relationship between the clustering feature and price is almost linear**, and that the model can learn easily.



## 4.2 k-Means Clustering
There are a great many clustering algorithms. They differ primarily in how they measure "similarity" or "proximity" and in what kinds of features they work with. The algorithm we'll use, k-means, is intuitive and easy to apply in a feature engineering context. Depending on your application another algorithm might be more appropriate.

K-means clustering measures similarity using ordinary straight-line distance (Euclidean distance, in other words). It creates clusters by placing a number of points, called centroids, inside the feature-space. Each point in the dataset is assigned to the cluster of whichever centroid it's closest to. The "k" in "k-means" is how many centroids (that is, clusters) it creates. You define the k yourself.

You could imagine each centroid capturing points through a sequence of radiating circles. When sets of circles from competing centroids overlap they form a line. The result is what's called a Voronoi tessallation. The tessallation shows you to what clusters future data will be assigned; the tessallation is essentially what k-means learns from its training data.

The clustering on the Ames dataset above is a k-means clustering. Here is the same figure with the tessallation and centroids shown.