# Clustering Techniques -- Beyond K-Means

Understanding the similarities and differences in data is a foundational challenge to any business's data strategy. Whether it's identifying customers with shared nuances and idiosyncrasies or determining homogeneous gene sequences for cancer research, several techniques exist for grouping data -- otherwise known as clustering to the data science world or segmentation to the business world.

One of the most commonly used methods by analytics practitioners is K-Means Clustering. Its easy-to-understand process and variety of applications make it a popular technique and probably the first (if not only) clustering method taught to statistics students. Yet for all that K-Means is good for, it isn't a panacea. Like all statistical modeling methods it has its pros, cons, and of course, underlying assumptions. Today we're going to explore alternative clustering techniques to K-Means for situations were its performance can be suboptimal.

## Assumptions of K-Means

For the sake of brevity, I'll assume the reader has an understanding of how K-Means works (for a refresher, go [here](https://en.wikipedia.org/wiki/K-means_clustering)). Some of the underlying assumptions that make K-Means an efficient clustering technique (especially on the very large data sets you'd expect to find in a productionized big data solution) also can be its greatest weaknesses. While there's no doubt that K-Means will cluster any data you feed into it, the resulting clusters may not realistically capture the underlying relationships an analyst seeks to find.

Let's explore three assumptions of K-Means that, if violated, can result in poor or downright inaccurate clusters:

* The number of clusters is known beforehand.
* Data is roughly spherical and easily separable.
* Clusters are approximately the same size.

It should be noted that even when these assumptions *are* met, the iterative nature and randomized placement of centroids at algorithm initialization can lead K-Means to cluster data differently with each run if not reach convergence at local optima rather than global. As with any statistical technique, never blindly apply and accept the results!

## The number of clusters is known beforehand

K-Means requires the number of clusters to be determined before the algorithm is run. This requires some level of understanding about the data structure that an analyst may not have knowledge of. For example,