# K-Means

In this notebook you will use GPU-accelerated K-means to find the best locations for a fixed number of humanitarian supply airdrop depots.

## Objectives

By the time you complete this notebook you will be able to:

- Use GPU-accelerated K-means

## Imports

For the first time we import `cuml`, the RAPIDS GPU-accelerated library containing many common machine learning algorithms. We will be visualizing the results of your work in this notebook, so we also import a couple of `bokeh` modules.

In [None]:
import cudf
import cuml

import cupy as cp

from bokeh import plotting as bplt
from bokeh import models as bmdl

## Load Data

For this notebook we load again the cleaned UK population data.

In [None]:
gdf = cudf.read_csv('../data/data_pop.csv')

In [None]:
gdf.drop(gdf.columns[0], axis=1, inplace=True)

In [None]:
gdf.dtypes

In [None]:
print(gdf.shape)

In [None]:
gdf.head()

## K-Means Clustering

The unsupervised K-means clustering algorithm will look for a fixed number *k* of centroids in the data and clusters each point with its closest centroid. K-means can be effective when the number of clusters *k* is known or has a good estimate (such as from a model of the underlying mechanics of a problem).

Assume that in addition to knowing the distribution of the population, which we do, we would like to estimate the best locations to build a fixed number of humanitarian supply depots from which we can perform airdrops and reach the population most efficiently. We can use K-means, setting *k* to the number of supply depots available and fitting on the locations of the population, to identify candidate locations.

GPU-accelerated K-means is just as easy as its CPU-only scikit-learn counterpart. In this series of exercises, you will use it to optimize the locations for 5 supply depots.

## Exercise 4: Make a `KMeans` Instance for 5 Clusters

`cuml.KMeans()` will initialize a K-means instance. Use it now to initialize a K-means instance called `km`, passing the named-argument `n_clusters` set equal to our desired number `5`:

## Exercise 5: Fit to Population

Use the `km.fit` method to fit `km` to the population's locations by passing it the coordinate (easting and northing) data of the population. After fitting, you can print `km.cluster_centers_` to see where the algorithm created the 5 centroids.

## Visualize the Clusters

Run the following cells to plot the centroids in red over the population distribution visualization.

In [None]:
# Generate a pandas dataframe for CPU visualization
plot_subset = gdf.take(cp.random.choice(gdf.shape[0], size=100000, replace=False))
df_subset = plot_subset.to_pandas()

In [None]:
# Turn on in-Jupyter viz
bplt.output_notebook()

In [None]:
# Helper function for visuals.
def base_plot(data=None, padding=None,
              tools='pan,wheel_zoom,reset', plot_width=500, plot_height=500, x_range=(0, 100), y_range=(0, 100), **plot_args):
    
    # If we send in two columns of data, we can use them to auto-size the scale.
    if data is not None and padding is not None:
        x_range = (min(data.iloc[:, 0]) - padding, max(data.iloc[:, 0]) + padding)
        y_range = (min(data.iloc[:, 1]) - padding, max(data.iloc[:, 1]) + padding)
        
    p = bplt.figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)

    p.axis.visible = True
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None

    p.add_tools(bmdl.BoxZoomTool(match_aspect=True))

    return p

In [None]:
# Plotting the density distribution with red cluster centroids.
dist = dict(line_color=None, 
            fill_color='blue', 
            size=2,
            alpha=.05)

cent = dict(line_color='black',
            line_width=2,
            fill_color='red', 
            size=15,
            alpha=.5)

p = base_plot(data=df_subset[['easting', 'northing']], 
              padding=10000)

p.circle(x=list(df_subset['easting']), y=list(df_subset['northing']), **dist)
p.circle(x=km.cluster_centers_.iloc[:, 0].to_pandas(), y=km.cluster_centers_.iloc[:, 1].to_pandas(), **cent)

In [None]:
bplt.show(p)