<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/v2/05_clustering/01_k_means/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# k-means


k-means clustering is an *unsupervised* machine learning algorithm that can be used to group items into clusters.

So far we have only worked with supervised algorithms. Supervised algorithms have training data with labels that identify the numeric value or class for each item. These algorithms use labeled data to build a model that can be used to make predictions.

k-means clustering is different. The training data is not labeled. Unlabeled training data is fed into the model, which attempts to find relationships in the data and create clusters based on those relationships. Once these clusters are formed, predictions can be made about which cluster new data items belong to.

The clusters can't easily be labeled in many cases. The clusters are "emergent clusters" which are created by the algorithm and don't always map to groupings that you might expect.

## Example: Customer Segmentation

Let's start by looking at a real world use case: customer segmentation.

Businesses often segment their customers into groups for marketing purposes. Often these segments are based on some characteristic of the customer: age, gender, spending bracket, etc. These segments are created based on assumptions that marketers have about their customers.

In this example we will use k-means clustering to find customer segments instead of relying on traditional segmentation methods.


### Load the data

For this example we'll load the [Kaggle Black Friday](https://www.kaggle.com/dalalmanish/black-friday) data. The dataset contains demographic and purchasing information about shoppers on Black Friday.

Upload your `kaggle.json` file and run the code block below.

In [0]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

And then use the Kaggle API to download the dataset.

In [0]:
! kaggle datasets download dalalmanish/black-friday
! ls

And unzip the file.

In [0]:
! unzip black-friday.zip
! ls 'black friday'

And finally, load the training data into a `DataFrame`.

In [0]:
import pandas as pd

data = pd.read_csv('black friday/train.csv')
data.head()

### Examine and clean the data

There are many useful columns of data in this data file that could be used for segmenting customers. In this case we'll ignore the common segmentation attributes (age, gender, spending bracket etc.) and instead focus solely on the product categories and purchase amounts of the customers. The aim is that we can find clusters of customers based on their purchases so that we can fine-tune our marketing for each cluster.

For this we'll be looking at the 'Product_Category_1', 'Product_Category_2', 'Product_Category_3', and 'Purchase' fields.

Let's peek at the data types.


In [0]:
data.dtypes

'Purchase' is a numeric value that contains the amount spent. The product category fields seem to be encoding a category as a numeric value. 'Product_Category_1' holds integers, which matches that assumption.

In [0]:
data.groupby('Product_Category_1')['Product_Category_1'].count()

But 'Product_Category_2' and 'Product_Category_3' contain floats instead of integers. This is because some of the columns contain `NaN` values.

In [0]:
data.groupby('Product_Category_2')['Product_Category_2'].count()

We can see from above that our valid category ranges from 1 through 18. Let's convert all of the categories to integers and just fill in invalid categories with zeros.

In [0]:
data['Product_Category_2'] = data['Product_Category_2'].fillna(0).astype(int)
data['Product_Category_3'] = data['Product_Category_3'].fillna(0).astype(int)

data.groupby('Product_Category_2')['Product_Category_2'].count()

### Create Synthetic Columns

We want to be able to cluster based on the categories that customers purchased in. The current encoding makes that a little difficult. Instead of having three category columns with a number between 0 and 18 in them, let's create 18 columns with a 1 in that column if the customer purchased from the category and a 0 if they did not.

In [0]:
for i in range(1, 19):
  new_column = "PC_{}".format(i)
  data[new_column] = ((data['Product_Category_1'] == i) |
                      (data['Product_Category_2'] == i) |
                      (data['Product_Category_3'] == i)).astype(int)

data.head()

Since customers show up on more than one row of data, we need to aggregate the data by customer so that we have only one customer per row. To do this we can group by 'User_ID' and sum each category. This will give us a nice idea of how many purchases each customer made in each category.

In [0]:
aggregations = {'Purchase': 'sum'}

for i in range(1, 19):
  col = "PC_{}".format(i)
  aggregations[col] = 'sum'
  
data_by_user = data.groupby('User_ID').agg(aggregations)

data_by_user.head()

### Perform Clustering

We now have a nice data format containing purchasing information for each customer. To run k-means clustering on the data we simply load [k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) from scikit-learn and ask the model to find a specific number of clusters for us.

Notice that we are scaling the data. Our purchase total and category counts are very different in magnitude. In order not to give the purchase total too much weight we scale the values.

In [0]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale

model = KMeans(n_clusters=10)
model.fit(scale(data_by_user))

print(model.inertia_)

We asked scikit-learn to create 10 clusters for us and we then printed out the `inertia_` for the resultant clusters. *Inertia* is the sum of the squared distances of samples to their closest cluster center. Typically, the smaller the inertia the better.

But why did we choose 10 clusters? And is the inertia that we received reasonable?

### Find the optimal number of clusters

With just one run of the algorithm, it is difficult to tell how many clusters we should have an what an appropriate inertia value is. k-means is trying to discover things about your data that you do not know. Picking a number of clusters at random isn't the best way to use k-means.

Instead, you should experiment with a few different cluster values and measure the inertia of each. As you increase the number of clusters, your inertia should decrease.

In [0]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt

clusters = list(range(5, 50, 5))
inertias = []

scaled_data = scale(data_by_user)

for c in clusters:
  model = KMeans(n_clusters=c)
  model = model.fit(scaled_data)
  inertias.append(model.inertia_)

plt.plot(clusters, inertias)
plt.show()

The resulting graph should start high and to the left and curve down as the number of clusters grows. The initial slope is steep, but begins to level off. Your optimal number of clusters is somewhere in the ["elbow" of the graph](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), as the slope levels.

Once you have this number, you need to then check to see if the number is reasonable for your use case. Say that the 'optimal' number of clusters for our customer segmentation is 20. Is it reasonable to ask our marketing department to market to 20 distinct segments?

And what makes the segments distinct? We only know that specific customers are clustered together. Was it due to age, or maybe purchase price? Perhaps the groups are formed based on unexpected combinations such as "bought snack food and cosmetics and spent between $100-150". What is a good name for that segment?

Clustering the data is often just the start of your journey. Once you have clusters you'll need to look at each group and try to determine what makes them similar. What patterns did the clustering find, and will that clustering be useful to you?

## Example: Classification of Digits

Clustering for data exploration purposes can lead to interesting insights into your data, but clustering can also be used for classification purposes.

In the example below we'll try to use k-means clustering to predict handwritten digits.

### Load the data

We'll load the digits dataset packaged with Scikit Learn.

In [0]:
from sklearn.datasets import load_digits

digits = load_digits()

### Scale the data

It is a good practice to scale the data to ensure that outliers don't have too big of an impact on the clustering.

In [0]:
from sklearn.preprocessing import scale

scaled_digits = scale(digits.data)

### Fit a model

We can then create a k-means model with 10 clusters (we know there are 10 digits from 0 through 9).

In [0]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=10)
model = model.fit(scaled_digits)

### Make predictions

We can then use the model to predict which category a data point belongs to.

In the case below we'll just use some of the data that we trained with for illustrative purposes. The prediction will provide a numeric value.

In [0]:
cluster = model.predict([scaled_digits[0]])[0]

cluster

What is this value? Is it the predicted digit?

No. This number is the cluster that the model thinks the digit belongs to. To determine the predicted digit, we'll need to see what other digits are in the cluster and choose the most popular one for our classification.


In [0]:
import numpy as np

labels = digits.target

cluster_to_digit = [
  np.argmax(
      np.bincount(
        np.array(
          [labels[i] for i in range(len(model.labels_)) if model.labels_[i] == cluster]
        )
      )
    ) for cluster in range(10)
]

cluster_to_digit

Here we can see the digit that each cluster represents.

### Measure model quality

If we do have labeled data, as is the case with our digits data, then we can measure the quality of our model using the [homogeneity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score_) score and the [completeness](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_score) score.

In [0]:
from sklearn.metrics import homogeneity_score
from sklearn.metrics import completeness_score

homogeneity = homogeneity_score(labels, model.labels_)
completeness = completeness_score(labels, model.labels_)
homogeneity, completeness

# Exercises

## Exercise 1

Load the [iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), then create a k-means model with three clusters, and find the homogeneity and completeness scores for the model. 

**Student Solution**

In [0]:
# Your code goes here

---

### Answer Key

In [0]:
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import homogeneity_score
from sklearn.metrics import completeness_score
from sklearn.preprocessing import scale

import numpy as np

iris = load_iris()

scaled_iris = scale(iris.data)

model = KMeans(n_clusters=3)
model = model.fit(scaled_iris)

labels = iris.target

cluster_to_species = [
  np.argmax(
      np.bincount(
        np.array([
                  labels[i]
                  for i in range(len(model.labels_))
                  if model.labels_[i] == cluster
        ])
      )
    ) for cluster in range(3)
]

homogeneity = homogeneity_score(labels, model.labels_)
completeness = completeness_score(labels, model.labels_)
homogeneity, completeness

---

## Exercise 2

Load the [iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), then create a k-means model with three clusters using only two features. (Try to find the best two features for clustering.) Create a plot of the two features.

For each datapoint in the chart use a [marker](https://matplotlib.org/api/markers_api.html) to encode the actual/correct species. For instance use a triangle for Setosa, a square for Versicolour, and a circle for Virginica). Color each marker green if the predicted class matches the actual. Color each marker red if the classes don't match.

**Student Solution**

In [0]:
# Your code goes here

---

### Answer Key

In [0]:
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.metrics import homogeneity_score
from sklearn.metrics import completeness_score
from sklearn.preprocessing import scale

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

iris = load_iris()

scaled_iris = scale(iris.data)
scaled_iris = np.array(scaled_iris)
labels = iris.target

highest_score, best_i, best_j = 0, 0, 0

for i in range(3):
  for j in range(3):
    if i != j:
      model = KMeans(n_clusters=3)
      model = model.fit(
          list(zip(np.array(iris.data)[:, i], np.array(iris.data)[:, j])))

      homogeneity = homogeneity_score(labels, model.labels_)
      completeness = completeness_score(labels, model.labels_)
      combined_score = homogeneity + completeness

      if combined_score > highest_score:
        highest_score = combined_score
        best_i = i
        best_j = j

cluster_to_species = [
  np.argmax(
      np.bincount(
        np.array([
                  labels[i]
                  for i in range(len(model.labels_))
                  if model.labels_[i] == cluster
        ])
      )
    ) for cluster in range(3)
]

correct = [cluster_to_species[model.labels_[i]] == labels[i] 
           for i in range(len(model.labels_))]

df = pd.DataFrame()
iris_data = np.array(iris.data)

df[iris.feature_names[best_i]] = iris_data[:, best_i]
df[iris.feature_names[best_j]] = iris_data[:, best_j]
df['species'] = labels
df['predicted_species'] = [cluster_to_species[model.labels_[i]] for i in range(len(model.labels_))]

MARKERS = ('^', 's', 'o')

fig, ax = plt.subplots(figsize=(10, 10))

for i in range(3):
  selector = (df['species'] == i) & (df['predicted_species'] == i)
  ax.scatter(
      df[selector][iris.feature_names[best_i]],
      df[selector][iris.feature_names[best_j]],
      marker = MARKERS[i],
      color = 'g',
      label = iris.feature_names[i] + ' correct'
  )
  selector = (df['species'] == i) & (df['predicted_species'] != i)
  ax.scatter(
      df[selector][iris.feature_names[best_i]],
      df[selector][iris.feature_names[best_j]],
      marker = MARKERS[i],
      color = 'r',
      label = iris.feature_names[i] + ' incorrect'
  )

ax.legend()
_ = plt.show()

---