## Learn K-Means Clustering wit R and Tidy Data Principles

### [**Pre-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/29/)

For dis lesson, you go sabi how to create clusters wit Tidymodels package and oda packages wey dey R ecosystem (we go call dem friends üßë‚Äçü§ù‚Äçüßë), and di Nigerian music dataset wey you don import before. We go talk about di basics of K-Means for Clustering. Remember say, as you don learn for di earlier lesson, plenty ways dey to work wit clusters and di method wey you go use depend on your data. We go try K-Means because na di most common clustering technique. Make we start!

Terms wey you go sabi:

-   Silhouette scoring

-   Elbow method

-   Inertia

-   Variance

### **Introduction**

[K-Means Clustering](https://wikipedia.org/wiki/K-means_clustering) na method wey dem take from signal processing domain. E dey used to divide and arrange groups of data into `k clusters` based on similarities for dia features.

Di clusters fit show as [Voronoi diagrams](https://wikipedia.org/wiki/Voronoi_diagram), wey get one point (or 'seed') and di region wey dey follow am.

<p >
   <img src="../../../../../../translated_images/voronoi.1dc1613fb0439b9564615eca8df47a4bcd1ce06217e7e72325d2406ef2180795.pcm.png"
   width="500"/>
   <figcaption>Infographic by Jen Looper</figcaption>


Steps for K-Means clustering be like dis:

1.  Di data scientist go first specify di number of clusters wey dem wan create.

2.  Di algorithm go randomly select K observations from di dataset to serve as di first centers for di clusters (we dey call dem centroids).

3.  Next, dem go assign each of di remaining observations to di centroid wey dey closest to am.

4.  Next, dem go calculate di new means for each cluster and move di centroid go di mean.

5.  Now wey dem don recalculate di centers, dem go check every observation again to see if e fit dey closer to another cluster. Dem go reassign all di objects again using di updated cluster means. Di cluster assignment and centroid update steps go dey repeat until di cluster assignments no dey change again (i.e., when e don converge). Normally, di algorithm go stop when each new iteration no dey move di centroids much and di clusters don static.

<div>

> Remember say because of di randomization of di first k observations wey dem use as di starting centroids, we fit get small different results each time we apply di procedure. Because of dis, most algorithms dey use plenty *random starts* and dem dey choose di iteration wey get di lowest WCSS. So e dey very important to always run K-Means wit plenty values of *nstart* to avoid *undesirable local optimum.*

</div>

Dis short animation wey use di [artwork](https://github.com/allisonhorst/stats-illustrations) of Allison Horst dey explain di clustering process:

<p >
   <img src="../../images/kmeans.gif"
   width="550"/>
   <figcaption>Artwork by @allison_horst</figcaption>



One big question wey dey come up for clustering be dis: how you go sabi how many clusters to separate your data into? One wahala wey dey wit K-Means be say you go need to decide `k`, wey be di number of `centroids`. But di `elbow method` dey help to estimate one good starting value for `k`. You go try am soon.

### 

**Prerequisite**

We go continue from where we stop for di [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb), where we analyze di dataset, make plenty visualizations and filter di dataset to observations wey dey important. Make sure say you check am!

We go need some packages to finish dis module. You fit install dem like dis: `install.packages(c('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork'))`

Or, di script below go check whether you get di packages wey you need to complete dis module and e go install dem for you if some dey miss.


In [None]:
suppressWarnings(if(!require("pacman")) install.packages("pacman"))

pacman::p_load('tidyverse', 'tidymodels', 'cluster', 'summarytools', 'plotly', 'paletteer', 'factoextra', 'patchwork')


Make we start sharp sharp!

## 1. Dance wit data: Pick di 3 most popular music genres

Dis na small reminder of wetin we do for di last lesson. Make we chop di data small small!


In [None]:
# Load the core tidyverse and make it available in your current R session
library(tidyverse)

# Import the data into a tibble
df <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/5-Clustering/data/nigerian-songs.csv", show_col_types = FALSE)

# Narrow down to top 3 popular genres
nigerian_songs <- df %>% 
  # Concentrate on top 3 genres
  filter(artist_top_genre %in% c("afro dancehall", "afropop","nigerian pop")) %>% 
  # Remove unclassified observations
  filter(popularity != 0)



# Visualize popular genres using bar plots
theme_set(theme_light())
nigerian_songs %>%
  count(artist_top_genre) %>%
  ggplot(mapping = aes(x = artist_top_genre, y = n,
                       fill = artist_top_genre)) +
  geom_col(alpha = 0.8) +
  paletteer::scale_fill_paletteer_d("ggsci::category10_d3") +
  ggtitle("Top genres") +
  theme(plot.title = element_text(hjust = 0.5))


ü§© E go well!

## 2. More data exploration.

How clean dis data be? Make we check for outliers wit box plots. We go focus on numeric columns wey get small outliers (even though you fit clean out di outliers). Boxplots fit show di range of di data and e go help us choose which columns to use. Note say, Boxplots no dey show variance, wey be one important thing for good clusterable data. Abeg check [dis discussion](https://stats.stackexchange.com/questions/91536/deduce-variance-from-boxplot) for more reading.

[Boxplots](https://en.wikipedia.org/wiki/Box_plot) dey used to show di distribution of `numeric` data, so make we start by *selecting* all di numeric columns plus di popular music genres.


In [None]:
# Select top genre column and all other numeric columns
df_numeric <- nigerian_songs %>% 
  select(artist_top_genre, where(is.numeric)) 

# Display the data
df_numeric %>% 
  slice_head(n = 5)


See as di selection helper `where` make am easy üíÅ? Check out other functions like dat [here](https://tidyselect.r-lib.org/).

Since we go dey make boxplot for each numeric features and we no wan use loops, make we reformat our data into one *longer* format wey go allow us use `facets` - subplots wey go show one subset of di data for each one.


In [None]:
# Pivot data from wide to long
df_numeric_long <- df_numeric %>% 
  pivot_longer(!artist_top_genre, names_to = "feature_names", values_to = "values") 

# Print out data
df_numeric_long %>% 
  slice_head(n = 15)


Much longer! Now na time for some `ggplots`! So which `geom` we go use?


In [None]:
# Make a box plot
df_numeric_long %>% 
  ggplot(mapping = aes(x = feature_names, y = values, fill = feature_names)) +
  geom_boxplot() +
  facet_wrap(~ feature_names, ncol = 4, scales = "free") +
  theme(legend.position = "none")


Easy-gg!

Now we fit see say dis data get small wahala: if you look each column as boxplot, you go see outliers. You fit waka through di dataset and comot dis outliers, but e go make di data small well well.

For now, make we choose which columns we go use for our clustering exercise. Make we pick di numeric columns wey get similar ranges. We fit change `artist_top_genre` to numeric but we go leave am for now.


In [None]:
# Select variables with similar ranges
df_numeric_select <- df_numeric %>% 
  select(popularity, danceability, acousticness, loudness, energy) 

# Normalize data
# df_numeric_select <- scale(df_numeric_select)


## 3. How to do k-means clustering for R

We fit do k-means for R wit di built-in `kmeans` function, check `help("kmeans()")`. Di `kmeans()` function dey take data frame wey get all di numeric columns as im main argument.

Di first step wen you wan use k-means clustering na to talk how many clusters (k) wey go dey for di final solution. We sabi say na 3 song genres wey we commot from di dataset, so make we try 3:


In [None]:
set.seed(2056)
# Kmeans clustering for 3 clusters
kclust <- kmeans(
  df_numeric_select,
  # Specify the number of clusters
  centers = 3,
  # How many random initial configurations
  nstart = 25
)

# Display clustering object
kclust


The kmeans object get plenty information wey dem explain well for `help("kmeans()")`. For now, make we focus on some. We fit see say dem group di data into 3 clusters wey get size 65, 110, 111. Di output still show di cluster centers (means) for di 3 groups across di 5 variables.

Di clustering vector na di cluster wey dem assign for each observation. Make we use di `augment` function take add di cluster assignment to di original data set.


In [None]:
# Add predicted cluster assignment to data set
augment(kclust, df_numeric_select) %>% 
  relocate(.cluster) %>% 
  slice_head(n = 10)


Perfect, we don divide our data set into 3 groups. So, how we go take know say our clustering dey good ü§∑? Make we check `Silhouette score`.

### **Silhouette score**

[Silhouette analysis](https://en.wikipedia.org/wiki/Silhouette_(clustering)) fit help us study how far apart di clusters wey we get dey from each other. Dis score dey range from -1 to 1, and if di score near 1, e mean say di cluster dey tight and e dey well-separated from other clusters. If e near 0, e mean say di clusters dey overlap and di samples dey close to di decision boundary of di clusters wey dey near each other. [source](https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam).

Di average silhouette method dey calculate di average silhouette of di observations for different values of *k*. If di average silhouette score high, e mean say di clustering dey good.

Di `silhouette` function wey dey di cluster package fit calculate di average silhouette width.

> You fit calculate di silhouette with any [distance](https://en.wikipedia.org/wiki/Distance "Distance") metric, like [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance "Euclidean distance") or [Manhattan distance](https://en.wikipedia.org/wiki/Manhattan_distance "Manhattan distance") wey we talk about for di [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/1-Visualize/solution/R/lesson_14-R.ipynb).


In [None]:
# Load cluster package
library(cluster)

# Compute average silhouette score
ss <- silhouette(kclust$cluster,
                 # Compute euclidean distance
                 dist = dist(df_numeric_select))
mean(ss[, 3])


Our score na **.549**, so e dey right for middle. Dis one mean say our data no too fit well for dis kain clustering. Make we see whether we fit confirm dis idea wit eye. Di [factoextra package](https://rpkgs.datanovia.com/factoextra/index.html) get functions (`fviz_cluster()`) wey fit help us see di clustering.


In [None]:
library(factoextra)

# Visualize clustering results
fviz_cluster(kclust, df_numeric_select)


Di overlap wey dey for di clusters show say our data no too fit well for dis kain clustering, but make we still dey go.

## 4. How to sabi di best clusters

One big question wey dey always show for K-Means clustering na dis - if we no get class labels wey we sabi, how we go sabi how many clusters we go use take divide di data?

One way we fit try sabi na to use one data sample take `create series of clustering models` wey go dey increase di number of clusters (like from 1-10), and check di clustering metrics like **Silhouette score.**

Make we find di best number of clusters by calculating di clustering algorithm for different values of *k* and check di **Within Cluster Sum of Squares** (WCSS). Di total within-cluster sum of square (WCSS) dey measure how tight di clustering be, and we want make e dey small as e fit be, because lower values mean say di data points dey close.

Make we see wetin go happen if we choose different `k` values, from 1 to 10, for dis clustering.


In [None]:
# Create a series of clustering models
kclusts <- tibble(k = 1:10) %>% 
  # Perform kmeans clustering for 1,2,3 ... ,10 clusters
  mutate(model = map(k, ~ kmeans(df_numeric_select, centers = .x, nstart = 25)),
  # Farm out clustering metrics eg WCSS
         glanced = map(model, ~ glance(.x))) %>% 
  unnest(cols = glanced)
  

# View clustering rsulsts
kclusts


Now wey we don get di total within-cluster sum-of-squares (tot.withinss) for each clustering algorithm wey get center *k*, we go use di [elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) to find di correct number of clusters. Dis method na to plot di WCSS as e dey relate to di number of clusters, and then choose di [elbow of di curve](https://en.wikipedia.org/wiki/Elbow_of_the_curve "Elbow of the curve") as di number of clusters wey we go use.


In [None]:
set.seed(2056)
# Use elbow method to determine optimum number of clusters
kclusts %>% 
  ggplot(mapping = aes(x = k, y = tot.withinss)) +
  geom_line(size = 1.2, alpha = 0.8, color = "#FF7F0EFF") +
  geom_point(size = 2, color = "#FF7F0EFF")


Di plot dey show say WCSS reduce well well (so e tight pass) as di number of clusters dey increase from one go two, and e still reduce wey person fit notice from two go three clusters. After dat one, di reduction no too dey obvious again, e come form one `elbow` üí™ for di chart around three clusters. Dis one mean say e dey show well say di data points get two to three clusters wey dey separate well.

We fit now move go extract di clustering model wey `k = 3`:

> `pull()`: e dey use to comot one column
>
> `pluck()`: e dey use to index data structures like lists


In [None]:
# Extract k = 3 clustering
final_kmeans <- kclusts %>% 
  filter(k == 3) %>% 
  pull(model) %>% 
  pluck(1)


final_kmeans


Nice one! Make we go ahead show di clusters wey we get. You wan try some interactive stuff wit `plotly`?


In [None]:
# Add predicted cluster assignment to data set
results <-  augment(final_kmeans, df_numeric_select) %>% 
  bind_cols(df_numeric %>% select(artist_top_genre)) 

# Plot cluster assignments
clust_plt <- results %>% 
  ggplot(mapping = aes(x = popularity, y = danceability, color = .cluster, shape = artist_top_genre)) +
  geom_point(size = 2, alpha = 0.8) +
  paletteer::scale_color_paletteer_d("ggthemes::Tableau_10")

ggplotly(clust_plt)


Maybe we for don expect say each cluster (wey different colors dey represent) go get im own clear genres (wey different shapes dey represent).

Make we check how correct di model be.


In [None]:
# Assign genres to predefined integers
label_count <- results %>% 
  group_by(artist_top_genre) %>% 
  mutate(id = cur_group_id()) %>% 
  ungroup() %>% 
  summarise(correct_labels = sum(.cluster == id))


# Print results  
cat("Result:", label_count$correct_labels, "out of", nrow(results), "samples were correctly labeled.")

cat("\nAccuracy score:", label_count$correct_labels/nrow(results))


Dis model accuracy no too bad, but e no too beta either. E fit be say di data no too fit well for K-Means Clustering. Di data dey too imbalanced, e no too correlate, and di variance between di column values dey too much to cluster well. In fact, di clusters wey form fit dey heavily influenced or skewed by di three genre categories wey we define before.

Sha sha, e be good learning process!

For Scikit-learn documentation, you go see say model like dis one, wey di clusters no dey well demarcated, get 'variance' problem:

<p >
   <img src="../../../../../../translated_images/problems.f7fb539ccd80608e1f35c319cf5e3ad1809faa3c08537aead8018c6b5ba2e33a.pcm.png"
   width="500"/>
   <figcaption>Infographic from Scikit-learn</figcaption>



## **Variance**

Variance na "di average of di squared differences from di Mean" [source](https://www.mathsisfun.com/data/standard-deviation.html). For di context of dis clustering problem, e mean say di numbers for our dataset dey diverge small too much from di mean.

‚úÖ Dis na good time to think about all di ways wey you fit take correct dis issue. You fit tweak di data small more? Use different columns? Try different algorithm? Hint: Try [scaling your data](https://www.mygreatlearning.com/blog/learning-data-science-with-k-means-clustering/) to normalize am and test other columns.

> Try dis '[variance calculator](https://www.calculatorsoup.com/calculators/statistics/variance-calculator.php)' to understand di concept well well.

------------------------------------------------------------------------

## **üöÄChallenge**

Spend time with dis notebook, dey tweak di parameters. You fit improve di accuracy of di model by cleaning di data more (like removing outliers)? You fit use weights to give more weight to some data samples. Wetin else you fit do to create better clusters?

Hint: Try to scale your data. Di notebook get commented code wey fit add standard scaling to make di data columns dey resemble each other more for range. You go notice say di silhouette score go reduce, but di 'kink' for di elbow graph go smooth out. Dis na because if you leave di data unscaled, e go make data wey get less variance carry more weight. Read more about dis problem [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226).

## [**Post-lecture quiz**](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/30/)

## **Review & Self Study**

-   Check out K-Means Simulator [like dis one](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/). You fit use dis tool to visualize sample data points and determine di centroids. You fit edit di data randomness, number of clusters, and number of centroids. E dey help you get idea of how di data fit group?

-   Also, check [dis handout on K-Means](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html) from Stanford.

You wan try your new clustering skills for datasets wey fit well for K-Means clustering? Check:

-   [Train and Evaluate Clustering Models](https://rpubs.com/eR_ic/clustering) using Tidymodels and friends

-   [K-means Cluster Analysis](https://uc-r.github.io/kmeans_clustering), UC Business Analytics R Programming Guide

- [K-means clustering with tidy data principles](https://www.tidymodels.org/learn/statistics/k-means/)

## **Assignment**

[Try different clustering methods](https://github.com/microsoft/ML-For-Beginners/blob/main/5-Clustering/2-K-Means/assignment.md)

## THANK YOU TO:

[Jen Looper](https://www.twitter.com/jenlooper) for creating di original Python version of dis module ‚ô•Ô∏è

[`Allison Horst`](https://twitter.com/allison_horst/) for creating di amazing illustrations wey make R more welcoming and engaging. Find more illustrations for her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).

Happy Learning,

[Eric](https://twitter.com/ericntay), Gold Microsoft Learn Student Ambassador.

<p >
   <img src="../../../../../../translated_images/r_learners_sm.e4a71b113ffbedfe727048ec69741a9295954195d8761c35c46f20277de5f684.pcm.jpeg"
   width="500"/>
   <figcaption>Artwork by @allison_horst</figcaption>


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don translate wit AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). Even as we dey try make sure say e correct, abeg make you sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument for im native language na di main correct source. For important information, e good make you use professional human translation. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
