# Clustering

Clustering is a branch of unsupervised machine learning where the goal is to identify groups or clusters in your data set without the use of labels. Clustering should not be considered the same as classification; we are not trying make predictions on observations from a set of classes. In clustering, you are identifying a set of similar data points and calling the resulting set a cluster.

Let's consider an example of clustering. You may have a data set characterizing your customers like demographic information and personal preferences. A supervised machine learning application would be to determine whether a person will buy a product. However, an unsupervised machine learning application would be to identify several groups or types of customers. With these groups identified, you can analyze the groups and build profiles describing the groups. For example, one group tends to include people from the ages 20 to 25 who like the outdoors. With these profiles, you can pass that information and analysis to the marketing team to create different advertisements to best attract each group.

# K-Means algorithm

The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset very quickly and efficiently, often in just a few interations

#### make_blobs 

make_blobs() is a function of Scikit-Learn, can be used to generate blobs of points with a Gaussian distribution. We can control how many blobs to generate and the number of samples to generate, as well as a group of other properties

Let's train a K-Means clusterer on this dataset. It will try to find each blob's center and assign each instance to the closest blob:

In [22]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, y = make_blobs(n_samples = 500, n_features = 2, centers = 5, random_state = 0)
print(X.shape)
print(y.shape)

(500, 2)
(500,)


In [15]:
from sklearn.cluster import KMeans

k = 5 #the number of clusters
kmeans = KMeans(n_clusters = k)
y_pred = kmeans.fit_predict(X)

Each instance was assigned to one of the five clusters. In the context of clustering, an instance's label is the index of the cluster that this instance gets assigned to by the algorithm: this is not to be confused with the class labels in classification. 

The KMeans instance preserves a copy of the labels of the instances it was trained on, available via the labels_ instance variable:

In [17]:
y_pred is kmeans.labels_

True

We can also take a look at the five centroids that the algorithm found:

In [18]:
kmeans.cluster_centers_

array([[-1.33625465,  7.73822965],
       [ 9.30286933, -2.23802673],
       [ 1.87544954,  0.76337636],
       [-1.78783991,  2.76785611],
       [ 0.87407478,  4.4332834 ]])

We can easily assign new instances to the cluster whose centroid is closest:

In [20]:
import numpy as np
X_new = np.array([[0,2],[3,2],[-3,3],[-3,2.5]])
kmeans.predict(X_new)

array([3, 2, 3, 3])

In the KMeans class, transform() method measures the distance from each instance to every centroid:

In [21]:
kmeans.transform(X_new)

array([[ 5.89176171, 10.22273194,  2.24645254,  1.9457581 ,  2.58551249],
       [ 7.19238374,  7.59519798,  1.67148191,  4.84902197,  3.23116483],
       [ 5.02183919, 13.3715189 ,  5.36399977,  1.23418915,  4.13070899],
       [ 5.49609848, 13.18368276,  5.17550673,  1.24140202,  4.32966975]])