# Clustering

## General

Clustering can be defined as a data analysis technique that involves dividing a dataset into smaller subgroups that are related in some way. 

Can be used to seperate: 
- Online customers into groups with similar purchasing behaviours 
- People with similar genetic properties 
- Documents that correspond to different topics 

It is considered an *unsupervised* task because it tries to understand data without response variables.

It uses a clustering algorithm and in the end has findings that help **generate new questions** and **improve predictive analyses**.

## Algorithm

K-means algorithm is a process that groups data in to *K* clusters and its measured quality is in *WSSD* (within-cluster sum-of-squared-distances).

Total WSSD is determined by adding the squared distance between each of the data points and its respective cluster and taking the sum. 

This is the algorithm process: 
1. Begin the K-means algorithm by picking *K*, and uniformly randomly assigning data to the *K* clusters.
2. K-means then repeatedly goes through two major steps that minimize total WSSD <br>
**i) Center Update:** Compute the center of each cluster <br>
**ii) Label Update:** Reassign each point to the cluster with the nearest center

K-means should be repeated in a random restart to avoid being stuck in a bad solution. 

## Data preprocessing 

```
standardized_data <- 
    not_standardized_data |>
    select(c(...)) |>
    mutate(across(everything(), scale)) ##(1)##
```

1. Calculating distance so we should standardize the data using accross everything scale. 

## With K value

```
library(broom) ##(1)##

kmeans_object <- kmeans(standardized_data, centers = 3) ##(2)##

clustered_data <- augment(kmeans_object, standardized_data) ##(3)##

cluster_plot <- ggplot(clustered_data, aes(x = VALUE, y = VALUE, color = .cluster), size = 2) + ##(4)##
  geom_point() +
  labs(x = "VALUE", y = "VALUE", color = "Cluster") + 
  scale_color_manual(values = c("dodgerblue3", "darkorange3", "goldenrod1")) + 
  theme(text = element_text(size = 12))

glance(kmeans_object) ##(5)##
```

1. The broom package allows us to use the **augment** function. 
2. Kmeans goes through the process that assigns data points. 
3. Augment takes in the model and original df, and returns df with data and cluster assignments mutated. 
4. Visualization with cluster assignments for the cluster. 
5. Glance can be used on kmeans_object to find total WSSD

## Without K value

```
##(1)##
elbow_stats <-  tibble(k = 1:10) |>
 rowwise() |>
 mutate(clusters = list(kmeans(standardized_data, centers = k, nstart = ...)), 
 glanced = list(glance(clusters)))                                           

##(2)##
clustering_statistics <- elbow_stats |>
 select(-clusters) |>
 unnest(glanced)

##(3)##
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
 geom_point(size = 2) +
 geom_line() +
 labs(x = "K",
      y = "Total within-cluster sum of squares",
      title = "Elbow Plot") +
 scale_x_continuous(breaks = 1:10) +
 theme(text = element_text(size = 20))

```

1. Get Kmeans object for each k value in a given tibble range that we want (perform operations on each row with a k value we want it to use). 
2. Unnest because each item in this column is a dataframe. 
3. Visualize the elbow plot so that we can choose a good k value for our analysis. 

# Inference I (Statistical Inference)

# Inference II (Bootstrapping and Confidence Intervals)