# KMeans

Hello, welcome to this chapter of your book on machine learning with Scikit Learn. This covers unsupervised learning, that is, clustering and dimensionality reduction. The first lesson is about the k-Means algorithm.

The k-means algorithm is a clustering or data grouping method used to classify unlabeled datasets into groups or clusters based on the similarity of their characteristics.

It is considered unsupervised learning because it does not need labels to function.

In Scikit-learn, the implementation of k-means is found in the `**KMeans**` class.

In [None]:
from sklearn.cluster import KMeans


This book offers several configuration options, such as the number of clusters to be found, the initialization of centroids, and the maximum number of iterations. But before we look at them, I'll teach you its basic usage with the Iris dataset (remember that these are three types of flowers):

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data


To initialize `KMeans`, it is necessary to specify the number of clusters we want to obtain, and this is perhaps one of the algorithm's weaknesses: you have to specify in advance how many clusters you need – you already know that scikit-learn has default values for its arguments, the default value for this argument is 8, but we are going to leave it at 3:

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)


```{hint} Additionally, remember that for k-Means to work correctly, the data must be on similar scales, which means you should try to scale your data before introducing it to the model.

```
## Attributes

Once trained, we can find the clusters of the data using the `labels_` attribute:

In [None]:
kmeans.labels_


We can also access the calculated centroids; remember that there are as many centroids as there are clusters:

In [None]:
kmeans.cluster_centers_


In this case, we have 3 centroids of 4 dimensions each because our input data was 4-dimensional.

However, it is better visualized in a graph (this graph only uses a pair of the dataset's features):

In [None]:
from utils import view_centroids_iris

view_centroids_iris(kmeans, X)


Another attribute is inertia. Inertia measures the internal dispersion of clusters, that is, how far points are from the nearest centroid. In general, the objective of k-Means is to minimize this value. Once trained, we can access this information through the attribute:

In [None]:
kmeans.inertia_


## Arguments of `kmeans`

The KMeans algorithm has several important arguments that can be adjusted to obtain the desired results. Below, I present the most important arguments:

 - `**n_clusters**`: Specifies the number of clusters desired in the solution. This is the most important parameter and must be carefully adjusted.
 - `**init**`: Specifies the initialization method for the cluster centroids. The options are "k-means++", "random", and a custom array. "k-means++" is the default method and is recommended for most cases.
 - `**n_init**`: Specifies the number of times the algorithm will run with different centroid initializations. The final solution will be the best of all runs. The default value is 10, but it can be increased if you want to find a more precise solution.
 - `**max_iter**`: Specifies the maximum number of iterations allowed before the algorithm stops, even if it has not converged. The default value is 300.
 - `**tol**`: Specifies the tolerance for convergence. If the distance between the centroid and its previous centroid is less than `**tol**`, the algorithm is considered to have converged. The default value is `1e-4`.
## Playing with the arguments

In [None]:
from utils import plot_centroids

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=6, cluster_std=1, random_state=42)

plt.scatter(X[:,0],X[:,1], c=y)


### `n_clusters`

Perhaps the most important values to tune are the number of clusters:

In [None]:
# Variando n_clusters
n_clusters_list = [2, 3, 4, 5, 6, 7]

trained_kmeans = []
titles = []
for n_clusters in n_clusters_list:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init='auto')
    kmeans.fit(X)
    trained_kmeans.append(kmeans)
    titles.append(f"n_clusters = {n_clusters}")
plot_centroids(input_features=X, trained_kmeans=trained_kmeans, titles=titles)


### Elbow method - the elbow rule

Most of the time, it's impossible to visualize the centroids of our data (due to high dimensionality). But you can use the "elbow rule". The elbow rule is a heuristic used to determine the optimal number of clusters. It involves looking for an inflection point where the inertia stops changing dramatically.

In [None]:
inertias = [kmeans.inertia_ for kmeans in trained_kmeans]

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(n_clusters_list, inertias , marker='o')
ax.set_xlabel('Número de clusters')
ax.set_ylabel('Inercia')
ax.set_title('Regla del codo')


Additionally, remember that there are other metrics that we previously saw in the clustering metrics chapter.

### `init`

The method of initializing the clusters

In [None]:
init_methods = ['k-means++', 'random', np.array([[-10, -10] for _ in range(6)])]
init_titles = ['k-means++', 'random', 'custom']

trained_kmeans = []
titles = []
for title, init in zip(init_titles, init_methods):
    kmeans = KMeans(n_clusters=6, init=init, random_state=42, n_init=1)
    kmeans.fit(X)
    trained_kmeans.append(kmeans)
    titles.append(f"init = {title}")
plot_centroids(input_features=X, trained_kmeans=trained_kmeans, titles=titles)


## `max_iter`

The maximum number of iterations

In [None]:
# Variando max_iter
max_iter_list = [1, 2, 3, 4, 5, 300]

initial_centroids = np.array([[0, 0] for _ in range(6)])

trained_kmeans = []
titles = []
for max_iter in max_iter_list:
    kmeans = KMeans(n_clusters=6, max_iter=max_iter, init=initial_centroids, n_init=1, random_state=42)
    kmeans.fit(X)
    trained_kmeans.append(kmeans)
    titles.append(f'max_iter: {max_iter}')
    
plot_centroids(input_features=X, trained_kmeans=trained_kmeans, titles=titles)


## Kmeans and large datasets

Kmeans is an algorithm that works well for datasets of moderate size. However, it becomes very inefficient when used for large datasets, both in terms of number of rows or observations, and in number of columns or features.

Within Scikit-Learn, there is another algorithm called Mini-batch k-Means, which instead of operating on the entire dataset at once (as is the case with kMeans) operates on a subset of elements at a time.

```{hint} As a task, why don't you use it and see its behavior? You can import it from `sklearn.cluster` as `MiniBatchKMeans`.

```
## In conclusion

KMeans is an algorithm that you can use when:

 1. You need to group unlabeled data based on their similarity, as KMeans seeks to divide the dataset into compact and separate groups (clusters).
 1. You have a dataset of moderate size and not very high dimensionality
 1. You want an algorithm that is easy to implement and understand
But you should be careful using it with:

 1. Data with noise, outlier values, or data that overlaps between different groups
 1. High-dimensional data, as KMeans can be affected by the "curse of dimensionality"
 1. Extremely large datasets, in which case you might consider using Mini-Batch KMeans or other more scalable clustering algorithms.
 1. Situations where you don't have a rough idea of the number of clusters, as KMeans requires you to specify the number of clusters beforehand.
I'll see you in the next chapter where we'll discuss another clustering algorithm.