# Homework 08
This homework is based on the clustering lectures. Check the lecture notes and TA notes - they should help!

## Question 1
This question will walk you through creating your own `kmeans` function.

#### a) What are the steps of `kmeans`?
**Hint**: There are 4 steps/builder functions that you'll need.

In [None]:
Step1: Randomly Label your data
Step2: Compute cluster means
Step3: Assign points to labels 
Step4: Recluster and repeat

#### b) Create the builder function for step 1.

In [11]:
label_randomly <- function(n_points, n_clusters){
  sample(((1:n_points) %% n_clusters)+1, n_points, replace=F)
}


#### c) Create the builder function for step 2.

In [12]:
get_cluster_means <- function(data, labels){
  data %>%
    mutate(label__ = labels) %>%
    group_by(label__) %>%
    summarize(across(everything(), mean), .groups = "drop") %>%
    arrange(label__)
}


#### d) Create the builder function for step 3.
*Hint*: There are two ways to do this part - one is significantly more efficient than the other. You can do either.  

In [13]:
assign_cluster_fast <- function(data, means){
  data_matrix <- as.matrix(data)
  means_matrix <- as.matrix(means %>% dplyr::select(-label__))
  dii <- sort(rep(1:nrow(data), nrow(means)))
  mii <- rep(1:nrow(means), nrow(data))
  data_repped <- data_matrix[dii, ]
  means_repped <- means_matrix[mii, ]
  diff_squared <- (data_repped - means_repped)^2
  all_distances <- rowSums(diff_squared)
  tibble(dii=dii, mii=mii, distance=all_distances) %>%
    group_by(dii) %>%
    arrange(distance) %>%
    filter(row_number()==1) %>%
    ungroup() %>%
    arrange(dii) %>%
    pull(mii)
}


#### e) Create the builder function for step 4.

In [14]:
kmeans_done <- function(old_means, new_means, eps=1e-6){
  om <- as.matrix(old_means)
  nm <- as.matrix(new_means)
  m <- mean(sqrt(rowSums((om - nm)^2)))
  if(m < eps) TRUE else FALSE
}


#### f) Combine them all into your own `kmeans` function.

In [15]:
mykmeans <- function(data, n_clusters, eps=1e-6, max_it = 1000, verbose = FALSE){
  labels <- label_randomly(nrow(data), n_clusters)
  old_means <- get_cluster_means(data, labels)
  done <- FALSE
  it <- 0
  while(!done & it < max_it){
    labels <- assign_cluster_fast(data, old_means)
    new_means <- get_cluster_means(data, labels)
    if(kmeans_done(old_means, new_means)){
      done <- TRUE
    } else {
      old_means <- new_means
      it <- it + 1
      if(verbose){
        cat(sprintf("%d\n", it))
      }
    }
  }
  list(labels=labels, means=new_means)
}

## Question 2
This is when we'll test your `kmeans` function.
#### a) Read in the `voltages_df.csv` data set. 

In [16]:
library(tidyverse)
voltages <- read_csv("voltages_df.csv")

[1mRows: [22m[34m900[39m [1mColumns: [22m[34m250[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (250): 0, 1.00401606425703, 2.00803212851406, 3.01204819277108, 4.016064...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


#### b) Call your `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$means`. 

In [17]:
results <- mykmeans(voltages, 3)
print(results$labels)
print(results$means)

  [1] 1 2 2 2 2 2 3 3 2 1 1 1 3 1 2 1 1 1 1 1 2 2 2 2 2 1 1 1 3 3 3 3 2 3 3 2 2
 [38] 1 3 2 1 1 1 3 2 3 3 2 3 3 2 3 1 1 3 1 2 1 3 2 1 1 2 1 3 3 2 3 1 3 1 2 3 3
 [75] 2 2 1 3 3 3 3 3 2 3 3 2 2 2 1 3 3 3 2 3 2 2 3 2 2 3 3 3 2 3 1 1 3 1 1 2 1
[112] 1 2 3 1 3 1 1 2 2 2 2 1 1 3 2 2 3 1 3 2 1 2 3 1 2 2 2 1 3 1 2 2 3 1 1 2 1
[149] 1 3 1 3 2 2 3 1 3 3 2 1 3 3 1 2 1 3 2 1 2 1 1 3 2 1 3 3 2 3 1 2 3 2 3 1 3
[186] 2 2 1 2 1 1 2 1 1 3 3 1 2 1 1 3 1 3 3 2 1 2 3 2 1 2 3 1 3 3 2 3 3 2 2 2 2
[223] 2 2 2 3 3 3 3 1 2 1 3 2 2 3 3 1 1 3 3 1 3 3 2 3 1 3 3 2 3 2 1 3 2 3 3 1 1
[260] 1 2 2 3 1 2 2 2 3 3 3 1 2 2 2 3 1 3 3 2 2 3 1 3 1 1 2 3 2 2 3 3 1 3 3 1 2
[297] 1 3 3 2 2 3 1 1 1 3 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 3 2 3 1 1 1 1 2 1 2 1 3
[334] 1 2 2 3 3 1 2 3 2 3 3 1 3 2 3 1 1 3 2 2 2 1 3 2 1 3 3 2 2 2 2 1 3 1 2 1 1
[371] 1 1 1 2 2 3 3 1 1 1 1 3 3 1 1 3 3 2 2 3 1 2 2 2 2 3 2 2 3 1 1 1 1 2 2 1 1
[408] 3 3 3 2 3 3 1 2 2 2 3 2 2 2 1 1 2 1 3 2 2 2 2 3 3 2 3 1 2 1 1 1 3 3 3 3 1
[445] 2 1 1 1 3 2 1 1 1 2 2 1 2 2 1 3 2 

#### c) Call R's `kmeans` function with 3 clusters. Print the results with `results$labels` and `results$cluster`. 
*Hint*: Use the `as.matrix()` function to make the `voltages_df` data frame a matrix before calling `kmeans()`.

In [18]:
kresults <- kmeans(as.matrix(voltages_df), 3)
print(kresults$cluster)
print(kresults$centers)

  [1] 3 3 3 3 3 3 1 1 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 1 3 2 2 3 3
 [38] 3 2 3 3 3 3 2 3 1 1 3 2 2 3 1 3 3 2 3 3 3 2 3 3 3 3 3 2 2 3 1 3 2 3 3 1 1
 [75] 3 3 3 1 2 1 2 2 3 1 2 3 3 3 3 1 1 2 3 2 3 3 2 3 3 2 1 2 3 2 3 3 2 3 3 3 3
[112] 3 3 2 3 1 3 3 3 3 3 3 3 3 2 3 3 1 3 2 3 3 3 2 3 3 3 3 3 1 3 3 3 1 3 3 3 3
[149] 3 1 3 1 3 3 1 3 2 1 3 3 1 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2 3 2 3 3 1 3 1 3 2
[186] 3 3 3 3 3 3 3 3 3 1 2 3 3 3 3 1 3 2 2 3 3 3 2 3 3 3 1 3 1 1 3 1 2 3 3 3 3
[223] 3 3 3 1 2 2 2 3 3 3 1 3 3 2 2 3 3 1 1 3 1 1 3 2 3 1 1 3 2 3 3 2 3 1 1 3 3
[260] 3 3 3 1 3 3 3 3 1 1 1 3 3 3 3 2 3 2 2 3 3 1 3 1 3 3 3 1 3 3 2 1 3 1 1 3 3
[297] 3 2 2 3 3 1 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 1
[334] 3 3 3 2 2 3 3 2 3 1 2 3 1 3 1 3 3 2 3 3 3 3 2 3 3 1 2 3 3 3 3 3 1 3 3 3 3
[371] 3 3 3 3 3 1 2 3 3 3 3 2 2 3 3 2 2 3 3 2 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3
[408] 1 1 1 3 1 1 3 3 3 3 1 3 3 3 3 3 3 3 2 3 3 3 3 1 2 3 2 3 3 3 3 3 1 2 1 1 3
[445] 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 2 3 

#### d) Are your labels/clusters the same? If not, why? Are your means the same?

In [None]:
The labels and clusters are not the same because the ones that we assigned randomly is different than that matrix that R used. However, The means should be fairly sylilar, since the underlying dataset is the same and, after running the K-means multiple times, besides ssome runding differences, the mean information should be the same.  

## Question 3
#### a) Explain the process of using a for loop to assign clusters for kmeans.

In [None]:
Utilizing a for loop assigns value to each point in our data, and looping through every point, calculating distance to every center, picking the nearest center 

#### b) Explain the process of vectorizing the code to assign clusters for kmeans.

In [None]:
vectoririzing is the process of using a single operation on multiple data point such that you don't have to repeat actions multiple times. 

#### c) State which (for loops or vectorizing) is more efficient and why.

In [None]:
vectorization

## Question 4
#### When does `kmeans` fail? What assumption does `kmeans` use that causes it to fail in this situation?

In [None]:
K means fail when the clusters are non-spherical or unvequal in size and density. K-means assumes spherical clusters and equal sixe and density.  

## Question 5
#### What assumption do Guassian mixture models make?

In [None]:
A Gaussiam Mixture Model assumes that the data data is drawn from a mixture of Gaussian distributions whose individual parameters are estimated from the data.

## Question 6
#### What assumption does spectral clustering make? Why does this help us?

In [None]:
It assumes that clusters are connected and not spherical. 

## Question 7
#### Define the gap statistic method. What do we use it for?

In [None]:
The gap statistic method is used identify the optimal number of clusters in a data set. 