d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Applied K-means Clustering

**Objective**: *Demonstrate how to perform K-means clustering using Python and sklearn.*

In this demo, we will complete a series of exercises to practice performing K-means clustering analyses.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

-sandbox

## Prepare data

### Aggregate our user-level table

Remember we are interested in a user-level clustering based on our project objective. As a result, we'll recreate our **`adsda.ht_user_metrics`** table from the previous demo.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We are removing the **`lifestyle`** and **`device_id`** columns from this analysis because K-means clustering requires all feature variables to be numeric.

In [0]:
%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics
USING DELTA LOCATION "/adsda/ht-user-metrics" AS (
  SELECT avg(resting_heartrate) AS avg_resting_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         avg(bmi) AS avg_bmi,
         avg(vo2) AS avg_vo2,
         avg(workout_minutes) AS avg_workout_minutes,
         avg(steps) AS steps
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

num_affected_rows,num_inserted_rows


And we can visualize the result.

In [0]:
%sql
SELECT * FROM adsda.ht_user_metrics LIMIT 10

avg_resting_heartrate,avg_active_heartrate,avg_bmi,avg_vo2,avg_workout_minutes,steps
73.84670492514974,141.7665518846971,25.972345531696856,30.314501989400497,35.606040834647395,7007.131506849315
66.65136141242354,147.19021975989776,28.65722391606316,26.33148910948305,4.933198514612545,5222.191780821918
61.535264221129054,115.35464938159885,28.069176174299653,30.50585356539649,26.80897921409271,11651.545205479451
60.12761602451012,109.56012470925296,24.272346749436057,33.00945987357762,30.203698416630907,12232.284931506849
57.67928187578235,107.34804493259924,26.13666783325289,33.622191930293454,41.92978275902913,10685.441095890412
61.254321931199726,124.8439166626506,24.67510459315352,32.24746097933756,41.34549212923959,10838.29315068493
59.123912011085615,121.6281244436266,26.215711176682024,33.23270077091007,38.055842131738046,7065.627397260274
59.16707870402215,116.63147065007492,23.407247625285702,32.46844382359828,29.13189427605095,12333.975342465754
51.09170520840786,100.52788927517386,18.986985488002624,36.84238854947035,36.64032984626558,13517.109589041096
54.80101102825,106.12513976241604,12.015741781532324,32.37704612619004,50.81642676699068,15935.090410958905


-sandbox
### Split data

Next, we are going to split our table into a training set and inference set.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In unsupervised learning, we do not perform label-based evaluation like we do in supervised learning. We are going to use the inference set as an example for assigning rows that were not a part of the training process to clusters.

In [0]:
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics").toPandas()

train_df, inference_df = train_test_split(ht_user_metrics_pd_df, train_size=0.9, test_size=0.1, random_state=42)

Note that the resulting DataFrames have the same number of columns, but they have a different number of rows.

In [0]:
train_df.shape

Out[9]: (2700, 6)

In [0]:
inference_df.shape

Out[10]: (300, 6)

-sandbox
## K-means

### Training
Now, we can apply the K-means algorithm to our **`train_df`** DataFrame.

Remember that in order to do this, we need to manually specify *K* ahead of the training process to the `num_clusters` parameter.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We need to scale our feature variables because the K-means algorithm treats all features as if they're on the same scale. We'll go into more detail on this with more advanced tools in the next module.

In [0]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale

k_means = KMeans(n_clusters=4, random_state=42)
k_means.fit(scale(train_df))

Out[11]: KMeans(n_clusters=4, random_state=42)

#### Other Parameters

There are plenty of other parameters to the K-means process, including:

* `init` - how the initial centroids are determined
* `max_iter` - how many iterations of the algorithm (i.e. how many times the centroids are reset)

#### Getting Centroids

Once the model has been fit, the centroid locations can be extracted using the `cluster_centers_` attribute.

In [0]:
k_means.cluster_centers_

Out[12]: array([[-0.81628809, -0.82771227, -0.65713427,  0.82717395,  0.40978552,
         0.86157681],
       [ 1.49007169,  1.07097812, -0.00256275, -1.36179705,  0.35708406,
        -0.94146452],
       [-0.0254251 ,  0.18963867,  0.97138668, -0.10461086,  0.02524379,
        -0.22271996],
       [ 1.34192461,  1.43713671,  0.18521438, -1.22722739, -2.38117604,
        -1.68906728]])

-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Note that each of these array elements corresponds to a point, and the nested elements are the locations of each centroid for the various features used.

-sandbox
### Inference

Once we've trained our K-means model, we can use the final cluster centroids to place new, unseen rows into clusters, as well.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We are scaling our inference set, too. There are more advanced tools that can do this in an unbiased way, and we'll go over them in the next module.

In [0]:
inference_df_clusters = k_means.predict(scale(inference_df))

Notice that the output is a numpy array of the same length as **`inference_df`** DataFrame. This is because we returned a cluster for each one of our rows.

In [0]:
type(inference_df_clusters)

Out[14]: numpy.ndarray

In [0]:
len(inference_df_clusters)

Out[15]: 300

Since we have a numpy array of the same length, we can bind the array with **`inference_df`** into a new DataFrame.

In [0]:
clusters_df = inference_df.copy()
clusters_df["cluster"] = inference_df_clusters

So we can easily view the cluster of each row.

In [0]:
display(clusters_df)

avg_resting_heartrate,avg_active_heartrate,avg_bmi,avg_vo2,avg_workout_minutes,steps,cluster
51.26632775279057,105.93233389251066,16.473336965670203,36.61798456931176,40.7040080285147,14107.542465753424,0
53.36949895580931,105.13342668907612,15.266639373950296,30.50197298862535,61.723381727310944,12528.257534246575,0
86.51162895591307,147.31573126952208,19.14825600046248,19.448406520026342,45.00008651086257,7257.693150684931,1
62.67832822341962,126.55070760439426,28.538456483092705,30.136315631288845,36.31521454706571,10320.065753424658,2
78.3843123177059,144.53044270305998,23.844876989899003,26.163978233651594,5.414068675377703,5161.969863013699,3
88.77611148763101,141.08416692979114,19.766028414343424,17.78960369105766,44.18652798085717,7320.213698630137,1
57.27612483145937,111.60339113286958,23.145380588746512,35.686557410414416,31.522302178442796,12780.506849315068,0
70.70788585271985,126.2168310858353,29.521241718275277,27.893232976180293,34.27955417018405,7003.47397260274,2
60.792963094367806,112.31210378465444,26.230405518542675,33.26614809274604,38.53207307447889,7152.268493150685,2
64.43306557874654,138.60789599559456,32.12179645298689,31.69966046250843,32.93655042544162,6827.219178082192,2


Through the rest of this lesson, we'll look at optimizing the use of the K-means algorithm.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>