# DSCI 100 Quiz 3 Review

> ## Author: Owen Kwong

### Loading relevant packages for notebook

## Chapter 11: Clustering

### 11.1 Overview of Clustering

Clustering is an exploratory data analysis to see if there are meaningful clusters/subgroups in the data. Learning objectives from this chapter are centered around using K-means algorithm: 
- Describe what kind of situation is appropriate to use clustering in and what insight it may extract
- Explain K-means clustering algorithm 
- Interpret the output of K-means analysis 
- Differentiate between clusering and classification 
- Identify the need to and conduct scaling 
- Perform K-means clustering 
- Use the elbow graphing method to determine best K value 
- Visualize output of K-means in R using colored scatter plots 
- Describe advantages, limitations, and assumptions of K-means algorithm 

### 11.2 Clustering process and K-means

Clustering is an unsupervised task with questionable quality but the course does not go in-depth about this because it is complex. K-means clustering works by making an initial clustering and then improving the assignment until it cannot be improved further. In K-means clustering, quality is measured by *within-cluster sum-of-squared-distances* (WSSD). Two steps for this are: 

$$ μ_{x} = \frac{1}{4}(x_{1}+x_{2}+x_{3}+x_{4}) \ μ_{y} = \frac{1}{4}(y_{1}+y_{2}+y_{3}+y_{4}) $$ 

Then, the second step is to calculate WSSD to add up squared distance between each point in the cluster and the cluster center. Therefore, the straight line distance WSSD for aboe would be: 

$$ S^2 = ((x_{1} - μ_{x})^2 + (y_{1} - μ_{y})^2) + ((x_{2} - μ_{x})^2 + (y_{2} - μ_{y})^2) + ((x_{3} - μ_{x})^2 + (y_{3} - μ_{y})^2) + ((x_{4} - μ_{x})^2 + (y_{4} - μ_{y})^2) $$

**The algorithm** 

Three total steps: 
- Begins by assigning equal observations to each K cluster randomly then does next two steps repeatedly: 
1. **Center Update:** Compute the center of each cluster 
2. **Label Update:** Reassign each point to the cluster with the nearest center

Note: 

- This algorithm is guaranteed to stop after some number of iterations. 
- Bad clustering can be solved by randomly re-initializing a few times and picking one with lowest WSSD

**The algorithm** 

To choose number of clusters you graph WSSD for a range of number of clusters. Then, plot total WSSD against number of clusters and select WSSD point that is the "elbow" of the plot. 

### 11.3 Data Preprocessing 

Clustering requires straight-line distance to determine which points are similar. Due to this, we have to scale the data: 

**Example code:**

```
not_standardized_data <- read_csv("data/penguins_not_standardized.csv") #(1)

standardized_data <- not_standardized_data |> #(2)
  mutate(across(everything(), scale))
```

1. Reads csv file and assigns to variable
2. mutates accross the whole tibble scaling all the data

### 11.4 K-means with cluster number

The **kmeans** function takes two arguments.  The first one is the data. The second one is the number of centers. 

```{r}
penguin_clust <- kmeans(standardized_data, centers = 3)
```

Then the **broom** package is needed to use the **augment** function. The augment function takes in the model and the original data frame, and returns a data frame with the data and cluster assignments mutated. 

```{r}
library(broom)

clustered_data <- augment(penguin_clust, standardized_data)
```

Finally, the cluster assignments can be visualized using **ggplot**.

```{r}
cluster_plot <- ggplot(clustered_data, aes(x = flipper_length_mm, y = bill_length_mm, color = .cluster), size = 2) +
  geom_point() +
  labs(x = "Flipper Length (standardized)", y = "Bill Length (standardized)", color = "Cluster") + 
  scale_color_manual(values = c("dodgerblue3", "darkorange3", "goldenrod1")) + 
  theme(text = element_text(size = 12))
```

To find total WSSD for a model can use **glance** function

```{r}
glance(penguin_clust)
```

### 11.5 K-means to determine cluster number

To find total WSSD for variety of K values first need to create dataframe with K values

```{r}
penguin_clust_ks <- tibble(k = 1:9)
```

Then we use **rowwise** + **mutate** to apply the **kmeans** function within each row to each K. However, given that the **kmeans** function returns a model object to us (not a vector), we will need to store the results as a list column. This works because both vectors and lists are legitimate data structures for data frame columns. To do this we use **list** function.

```{r}
penguin_clust_ks <- tibble(k = 1:9) |>
  rowwise() |>
  mutate(penguin_clusts = list(kmeans(standardized_data, k)))
```

If we wanted to get one of the clusterings out of the list column in the data frame, use **pull**. **pull** will return to us a data frame column as a simpler data structure; here, that would be a list. To extract first item use **pluck** function. This example plucks 1.

Use mutate again to apply **glance** to each of the K-means clustering objects to get the clustering statistics (including WSSD). Since output of **glance** is dataframe must use list function again. This results in a complex data frame with 3 columns, one for K, one for the K-means clustering objects, and one for the clustering statistics:

Finally we extract the total WSSD from the column named glanced (whatever it is named). Each item in list is dataframe so we will need to use the **unnest** function to unpack the data frames into simpler column data types.

Now that there is tot.withinss and k is same df we can create plot to determine k

To try multiple random initializations we use the **nstart** argument in the first block of code. Number of nstart depends on size/characteristics of dataset and computer power. The larger the nstart value the better from an analysis perspective, but trade-off of doing so many clusters is time. 