# Clustering

## General

Clustering can be defined as a data analysis technique that involves dividing a dataset into smaller subgroups that are related in some way. 

Can be used to seperate: 
- Online customers into groups with similar purchasing behaviours 
- People with similar genetic properties 
- Documents that correspond to different topics 

It is considered an *unsupervised* task because it tries to understand data without response variables.

It uses a clustering algorithm and in the end has findings that help **generate new questions** and **improve predictive analyses**.

## Algorithm

K-means algorithm is a process that groups data in to *K* clusters and its measured quality is in *WSSD* (within-cluster sum-of-squared-distances).

Total WSSD is determined by adding the squared distance between each of the data points and its respective cluster and taking the sum. 

This is the algorithm process: 
1. Begin the K-means algorithm by picking *K*, and uniformly randomly assigning data to the *K* clusters.
2. K-means then repeatedly goes through two major steps that minimize total WSSD <br>
**i) Center Update:** Compute the center of each cluster <br>
**ii) Label Update:** Reassign each point to the cluster with the nearest center

K-means should be repeated in a random restart to avoid being stuck in a bad solution. 

## Data preprocessing 

```
standardized_data <- 
    not_standardized_data |>
    select(c(...)) |>
    mutate(across(everything(), scale)) ##(1)##
```

1. Calculating distance so we should standardize the data using accross everything scale. 

## With K value

```
library(broom) ##(1)##

kmeans_object <- kmeans(standardized_data, centers = 3) ##(2)##

clustered_data <- augment(kmeans_object, standardized_data) ##(3)##

cluster_plot <- ggplot(clustered_data, aes(x = VALUE, y = VALUE, color = .cluster), size = 2) + ##(4)##
  geom_point() +
  labs(x = "VALUE", y = "VALUE", color = "Cluster") + 
  scale_color_manual(values = c("dodgerblue3", "darkorange3", "goldenrod1")) + 
  theme(text = element_text(size = 12))

glance(kmeans_object) ##(5)##
```

1. The broom package allows us to use the **augment** function. 
2. Kmeans goes through the process that assigns data points. 
3. Augment takes in the model and original df, and returns df with data and cluster assignments mutated. 
4. Visualization with cluster assignments for the cluster. 
5. Glance can be used on kmeans_object to find total WSSD

## Without K value

```
##(1)##
elbow_stats <-  tibble(k = 1:10) |>
 rowwise() |>
 mutate(clusters = list(kmeans(standardized_data, centers = k, nstart = ...)), 
 glanced = list(glance(clusters)))                                           

##(2)##
clustering_statistics <- elbow_stats |>
 select(-clusters) |>
 unnest(glanced)

##(3)##
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
 geom_point(size = 2) +
 geom_line() +
 labs(x = "K",
      y = "Total within-cluster sum of squares",
      title = "Elbow Plot") +
 scale_x_continuous(breaks = 1:10) +
 theme(text = element_text(size = 20))

```

1. Get Kmeans object for each k value in a given tibble range that we want (perform operations on each row with a k value we want it to use). 
2. Unnest because each item in this column is a dataframe. 
3. Visualize the elbow plot so that we can choose a good k value for our analysis. 

# Inference I (Statistical Inference)

## Definitions

| Terms |  Definitions |
|----------------|------------|
| <p align="left">Mean | <p align="left">The sum of all of the data observations divided by number of observations. |
| <p align="left">Median | <p align="left">The middle observation of a sorted variable’s data. Count half from the right or left. |
| <p align="left">Variance | <p align="left">The mean of the sum of the squared distances of each observation from the mean value of all observations. |
| <p align="left">Standard deviation | <p align="left">The square root of the variance. |
| <p align="left">Proportion | <p align="left">The number of entities/object with a specific characteristic divided by the total number of entities/objects. |
| <p align="left">Observation |  <p align="left">A quantity or quality (or a set of these) from a single member of a population. |
| <p align="left">Population | <p align="left">The entire set of entities/objects of interest. |
| <p align="left">Population Parameter | <p align="left">A numerical summary value about the population. <p align="left">_(Directly computing population parameters is often time-consuming and costly, and sometimes impossible)_ |
| <p align="left">Sample | <p align="left">A subset of entities/objects in the population |
| <p align="left">Point Estimate | <p align="left"> A single-value/statistic calculated from sample data that estimates an unknown population parameter of interest. <p align="left">For example, the sample mean $\bar{x}$ is a point estimate of the population mean $\mu$. Similarly, the sample proportion $p$ is a point estimate of the population proportion $P$. <p align="left">_(High variation in the sampling distribution of the sample mean causes point estimate to be unreliable.)_|
| <p align="left">Statistical Inference | <p align="left">The process of using a sample to make a conclusion about the broader population from which it is taken is referred to as statistical inference. |
| <p align="left">Sample Variablity | <p align="left">Estimates vary from sample to sample due to sampling variability.   |
| <p align="left">Sampling Distribution | <p align="left">A distribution of point estimates, where each point estimate was calculated from a different random sample from the same population. |
| <p align="left">Random sampling | <p align="left">electing a subset of observations from a population where each observation is equally likely to be selected at any point during the selection process. |
| <p align="left">Representative sampling | <p align="left">selecting a subset of observations from a population where the sample’s characteristics are a good representation of the population’s characteristics |

## Sample Size 

1) The mean of the sample mean (across all samples) is equal to the population mean. In other words, the sampling distribution is centred at the population mean.
2) Increasing the size of the sample decreases the spread (i.e., the variability) of the sampling distribution making it more narrow. Therefore, a larger sample size results in a more reliable point estimate of the population parameter.
3) The distribution of the sample mean is roughly bell-shaped once the sample size is large enough.

## Sampling Distribution (for sample mean)

```
samples <- rep_sample_n(airbnb, size = 40, reps = 20000) ##(1)##

sample_estimates <- samples |>  ##(2)##
  group_by(replicate) |>
  summarize(sample_mean = mean(price))

sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean)) +   ##(3)##
  geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
  xlab("Sample mean price per night ($)") +
  theme(text = element_text(size = 20))

```

1. Take 20000 samples of size 40 out of the population 
2. Group by their sample number and find the mean 
3. Plot the sample means on a histogram 

# Inference II (Bootstrapping and Confidence Intervals)

## Bootstrapping Overview 

The concept of sampling from our original one sample with replacement to get a bootstrapping distribution.

The original sample acts as a population and with bootstrapping we can get an approximation for the sampling distribution. 

Useful if we can only get one sample from the population. 

## How to create bootstrapping distribution 

For a sample of size *n*: 
1. Randomly select an observation drawn from the original sample 
2. Record its value 
3. Replace it 
4. Repeat steps 1-3 until you have *n* observations 
5. Record the bootstrap point estimate you want 
6. Repeat steps 1-5 many times to create an approximate sampling distribution. 
7. Calculate plausibe ranges 

```
boot20000 <- one_sample |>
  rep_sample_n(size = 40, replace = TRUE, reps = 20000)

boot20000_means <- boot20000 |>
  group_by(replicate) |>
  summarize(mean = mean(price))

boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
  geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
  xlab("Sample mean price per night ($)") +
  ggtitle("Bootstrap Distribution") +
  theme(text = element_text(size = 20))

```

## Comparison to true sampling 

1. The shape and spread of the true sampling distribution and the sampling distribution should be similar.
2. The mean of the bootstrap distribution is not the same as the mean of the sampling distribution because boostrap was sampled from a sample and sampling distribution is sampled from population. 

## Confidence Intervals

- One should think of a confidence interval as a range of plausible values for the population parameter, which may or may not fall within the interval. This is significantly different than a point estimate, which is a single plausible value for the population parameter.
- We can interpret a 90% confidence interval as: we are 90% confident that the true mean is captured by the interval. Or, in other words, across all 90% confidence intervals that could be calculated for the mean of the population of interest, we can expect that 90% of the intervals contain the true mean.

```
bounds <- boot20000_means |>
  select(mean) |>
  pull() |>
  quantile(c(0.025, 0.975))
```