# k-Means

## Setup

In [None]:
import altair as alt
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons
from sklearn.cluster import SpectralClustering
from sklearn.cluster import MiniBatchKMeans



<img  src="https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/PDSH-cover-small.png?raw=1">

- This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).
- See [Algobeans (2016)](https://algobeans.com/2015/11/30/k-means-clustering-laymans-tutorial/) for a non technical explanation of the k-means method.

## Introduction

K-Means Clustering is an unsupervised learning algorithm which is inferring a function to describe hidden structure from *unlabeled* data. A *label* is the variable we're predicting (e.g. the 'Y' variable in a logistic regression). This means the algorithm only uses input variables, also called features (e.g. the 'X' variables in a logistic regression). 

- Cluster analysis use case: "tell me what patterns exist in my data"

The k-means algorithm groups observations (usually customers or products) in distinct clusters, where *k* represents the number of clusters identified. Hence, clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.

In particualr, the k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:

- The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
- Each point is closer to its own cluster center than to other cluster centers.

Those two assumptions are the basis of the k-means model. 

## Generate Data

First, let's generate a two-dimensional dataset containing four distinct blobs ([see sklearn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)). To emphasize that this is an unsupervised algorithm, we will leave the labels out of the visualization (see [Matplotlib's documentation](https://matplotlib.org/gallery/shapes_and_collections/scatter.html) for more information about the scatter plot). 



In [None]:
# create data
X, y_true = make_blobs(n_samples=300, 
                       centers=4,
                       cluster_std=0.6, 
                       random_state=0)

In [None]:
# save data as Pandas Dataframe
df = pd.DataFrame(X, columns=['var1', 'var2'])
df

In [None]:
# create scatterplot

By eye, it is relatively easy to pick out the four clusters.

## Standardize

Standardization of a dataset is a common requirement for many methods: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). 

In our case the features were generated and standardization is not necessary. However, the standardization process will be performed to demonstrate the procedure ([see sklearn standardscaler documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler))  

In [None]:
# Initialize StandardScaler() as scaler
# scaler = ___

In [None]:
# use fit_transform to use the function on the data
# X_std = ___

In [None]:
# create Pandas Dataframe
# df_std = pd.DataFrame(___, columns=['var1_std', 'var2_std'])

In [None]:
# create chart

## Algorithm

In [None]:
# use KMeans with n_cluster=4 and n_init=10 and save it as kmeans
# kmeans =

In [None]:
# fit the algorithm to the data


In [None]:
# use predict to assign the clusternumber to the observations and save it as y_kmeans


In [None]:
# assign y_kmeans to our pandas dataframe df_std


Let's visualize the results by plotting the data colored by these labels.
We will also plot the cluster centers as determined by the *k*-means estimator:

In [None]:
# Create a DataFrame with the cluster centers data
centers_data = pd.DataFrame(kmeans.cluster_centers_, columns=['var1_center', 'var2_center'])
centers_data

In [None]:
# Create the scatter plot for the data points with cluster colors
scatter = alt.Chart(df_std).mark_circle(size=50).encode(
    x='var1_std:Q',
    y='var2_std:Q',
    color=alt.Color('cluster:N', scale=alt.Scale(scheme='viridis'))
)

# Create the scatter plot for the cluster centers
centers_scatter = alt.Chart(centers_data).mark_circle(size=200, color='black', opacity=0.5).encode(
    x='var1_center',
    y='var2_center'
)

# Combine the two scatter plots
alt.layer(scatter, centers_scatter)

The good news is that the k-means algorithm (at least in this simple case) assigns the points to clusters very similarly to how we might assign them by eye.