**Question 1**

a)
1.  assign each point to a cluster N at random.
2.  calculate the mean position of each cluster using the previous assignments.
3.  loop through the points - assign each point to the cluster to whose center it is closest.
4.  Repeat this process until the centers stop moving around.

In [26]:
#b)
##load data, and then proceed
library(dplyr)

#Step 1: Randomly assign cluster labels
label_randomly <- function(n_points, n_clusters) {
  sample(rep(1:n_clusters, length.out = n_points))
}

#c)
#Step 2: Compute cluster means (centroids)
get_cluster_means <- function(data, labels) {
  data %>%
    mutate(cluster = labels) %>%
    group_by(cluster) %>%
    summarise(across(everything(), mean), .groups = "drop") %>%
    arrange(cluster)
}

#d)
#Step 3: Assign each point to nearest centroid
assign_cluster <- function(data, means) {
  data_mat <- as.matrix(data)
  mean_mat <- as.matrix(select(means, -cluster))

  dists <- as.matrix(dist(rbind(data_mat, mean_mat)))[1:nrow(data_mat), (nrow(data_mat) + 1):(nrow(data_mat) + 
                                                                                              nrow(mean_mat))]

  apply(dists, 1, which.min)
}

#e)
#Step 4: Check for convergence (centroids not moving)
kmeans_done <- function(old_means, new_means, eps = 1e-6) {
  om <- as.matrix(select(old_means, -cluster))
  nm <- as.matrix(select(new_means, -cluster))
  diff <- mean(sqrt(rowSums((om - nm)^2)))
  diff < eps
}

#f)
#Combine all into one function
mykmeans <- function(data, n_clusters = 3, eps = 1e-6, max_iter = 100) {
  data <- as.data.frame(data)

  labels <- label_randomly(nrow(data), n_clusters)
  old_means <- get_cluster_means(data, labels)
  
  for (i in 1:max_iter) {
    labels <- assign_cluster(data, old_means)
    new_means <- get_cluster_means(data, labels)
    
    if (kmeans_done(old_means, new_means, eps)) {
      message("Converged after ", i, " iterations.")
      break
    }
    old_means <- new_means
  }
  
  list(labels = labels, means = new_means)
}

**Question 2**

In [27]:
#a)
voltages_df <- read.csv("~/Downloads/voltages_df.csv")

head(voltages_df)

Unnamed: 0_level_0,X0,X1.00401606425703,X2.00803212851406,X3.01204819277108,X4.01606425702811,X5.02008032128514,X6.02409638554217,X7.0281124497992,X8.03212851405623,X9.03614457831325,⋯,X240.963855421687,X241.967871485944,X242.971887550201,X243.975903614458,X244.979919678715,X245.983935742972,X246.987951807229,X247.991967871486,X248.995983935743,X250
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-1.031463,1.104665,0.8982475,0.4142208,-1.1490888,-1.07851,-1.002401,-0.9182083,-0.8215574,-0.7023741,⋯,-0.7392703,-0.7633694,-0.7792297,-0.784434,-0.777982,-0.7608812,-0.736983,-0.7138199,-0.7014771,-0.7056029
2,-1.031463,1.246157,1.0948587,0.9039343,0.465441,-1.160496,-1.112005,-1.0721319,-1.0385633,-1.0075872,⋯,-0.8859964,-0.8511675,-0.8064307,-0.7534558,-0.6954785,-0.6404759,-0.6105817,-0.6348313,-0.6767121,-0.7140939
3,-1.031463,1.216111,1.0557873,0.8417629,-0.5636836,-1.147653,-1.101783,-1.0645681,-1.0336197,-1.0051885,⋯,-0.9503509,-0.9122991,-0.8625269,-0.8016142,-0.7306757,-0.6527186,-0.5812047,-0.587556,-0.6768023,-0.7206992
4,-1.031463,1.166244,0.9899628,0.7230858,-1.1806746,-1.125106,-1.077167,-1.0370309,-1.0027385,-0.9709488,⋯,-0.9498509,-0.9236047,-0.8896604,-0.850212,-0.8086367,-0.7700917,-0.7418958,-0.7315532,-0.7409824,-0.7644406
5,-1.031463,1.230222,1.07467,0.873388,0.2116394,-1.153728,-1.106832,-1.0691075,-1.038335,-1.0108187,⋯,-0.8710166,-0.8237315,-0.7590005,-0.6698582,-0.5061566,1.0975578,0.9348933,0.6673692,-1.1669718,-1.1047735
6,-1.031463,1.25765,1.1112886,0.9322788,0.604562,-1.166325,-1.113488,-1.0696859,-1.0332517,-1.0009558,⋯,-0.9092342,-0.8715416,-0.8213803,-0.7589597,-0.6842115,-0.5958772,-0.4766488,1.1008087,0.9169321,0.5137345


In [28]:
#b)
results <- mykmeans(voltages_df, n_clusters = 3)

results$labels
results$means

Converged after 3 iterations.



cluster,X0,X1.00401606425703,X2.00803212851406,X3.01204819277108,X4.01606425702811,X5.02008032128514,X6.02409638554217,X7.0281124497992,X8.03212851405623,⋯,X240.963855421687,X241.967871485944,X242.971887550201,X243.975903614458,X244.979919678715,X245.983935742972,X246.987951807229,X247.991967871486,X248.995983935743,X250
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-1.031463,0.9381238,0.7619864,0.3631543,-1.1179412,-1.051145,-0.9766807,-0.8694758,-0.6892375,⋯,-0.7900387,-0.8070676,-0.8182598,-0.8207339,-0.8132928,-0.7969549,-0.77567272,-0.75689256,-0.7496483,-0.7570393
2,-1.031463,1.2439759,1.0924697,0.900444,0.3011754,-1.159714,-1.1098127,-1.0685484,-1.0338649,⋯,-0.9107472,-0.8732292,-0.8234477,-0.7607812,-0.6682618,-0.3380864,-0.04693168,0.02820486,-0.41135,-0.8115784
3,-1.031463,1.3093239,1.1616772,0.9787498,0.6481497,-1.16861,-1.1196122,-1.0590962,-0.9943176,⋯,0.3364266,0.8337474,0.7125412,-0.2659209,-1.0409179,-1.0587745,-1.01359887,-0.96467777,-0.9151047,-0.8610245


In [20]:
#c)
kmeans_final <- kmeans(as.matrix(voltages_df), centers = 3)

kmeans_final$cluster   
kmeans_final$centers   

Unnamed: 0,X1.00401606425703,X2.00803212851406,X3.01204819277108,X4.01606425702811,X5.02008032128514,X6.02409638554217,X7.0281124497992,X8.03212851405623,X9.03614457831325,X10.0401606425703,⋯,X240.963855421687,X241.967871485944,X242.971887550201,X243.975903614458,X244.979919678715,X245.983935742972,X246.987951807229,X247.991967871486,X248.995983935743,X250
1,1.2439759,1.0924697,0.900444,0.3011754,-1.159714,-1.1098127,-1.0685484,-1.0338649,-1.0022396,-0.9699741,⋯,-0.9107472,-0.8732292,-0.8234477,-0.7607812,-0.6682618,-0.3380864,-0.04693168,0.02820486,-0.41135,-0.8115784
2,0.9381238,0.7619864,0.3631543,-1.1179412,-1.051145,-0.9766807,-0.8694758,-0.6892375,-0.5661321,-0.2497152,⋯,-0.7900387,-0.8070676,-0.8182598,-0.8207339,-0.8132928,-0.7969549,-0.77567272,-0.75689256,-0.7496483,-0.7570393
3,1.3093239,1.1616772,0.9787498,0.6481497,-1.16861,-1.1196122,-1.0590962,-0.9943176,-0.9237437,-0.8457536,⋯,0.3364266,0.8337474,0.7125412,-0.2659209,-1.0409179,-1.0587745,-1.01359887,-0.96467777,-0.9151047,-0.8610245


In [29]:
#d)
table(results$labels, kmeans_final$cluster)

abs(results$means[, -1] - kmeans_final$centers)

#The cluster labels differ because k-means cluster numbering is arbitrary — for example, my function may call 
#a group “cluster 1” that R’s function calls “cluster 3.” However, the centroid coordinates are nearly identical, 
#showing both methods found the same clusters.

   
      1   2   3
  1   0 300   0
  2 300   0   0
  3   0   0 300

X0,X1.00401606425703,X2.00803212851406,X3.01204819277108,X4.01606425702811,X5.02008032128514,X6.02409638554217,X7.0281124497992,X8.03212851405623,X9.03614457831325,⋯,X240.963855421687,X241.967871485944,X242.971887550201,X243.975903614458,X244.979919678715,X245.983935742972,X246.987951807229,X247.991967871486,X248.995983935743,X250
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2.275439,0.1543459,0.1384575,0.06197889,0.04177278,0.05866745,0.09186773,0.16438916,0.31300219,0.403842,⋯,0.08319049,0.01638017,0.057478647,0.15247205,0.47520645,0.7500232,0.80387758,0.3455426,0.06193014,2.001015
1.969587,0.4819894,0.7293154,2.01838517,1.35232063,0.18303334,0.24033693,0.37931093,0.46773283,0.7525245,⋯,0.10367959,0.05496935,0.002713851,0.05251161,0.12869303,0.43758637,0.70996087,0.77785315,0.34568932,1.749702
2.340787,0.1476467,0.1829274,0.33060012,1.81675918,0.04899733,0.06051595,0.06477861,0.07057394,0.0779901,⋯,0.49732076,0.12120623,0.978462037,0.77499703,0.01785663,0.04517566,0.04892109,0.04957305,0.05408025,2.170348


**Question 3**

a) In a for loop, clusters are manually assigned by looping through each data point and computing the Euclidean distance to the centroid, which can be time-consuming.

b) Vectorizing the code uses matrices to compute the distances between the data and the centroids and then assigns each point to the closest cluster.

c) Vectorizing is the more efficient choice because R is built for matrix and vector operations. It is also quicker than executing loops.

**Question 4**

As the table in the notes points out, not all clusters are spherical or evenly sized. They have different densities and sometimes the function doesn't always choose the correct number of clusters that should be representative of the data. This is all because kmeans relies on the assumption that clusters are spherical and equal.

**Question 5**

GMMs assume that the data is generated from a mixture of multiple Gaussian distributions, each with its own mean and variance. GMMs do not assume uniformity across clusters because they allow for different shapes, sizes, and orientations. This makes them more flexible in modeling complex data distributions.

**Question 6**

Spectral clustering assumes that points that are close together are more likely to belong to the same cluster.
This minimal assumption allows it to detect complex or irregularly shaped clusters that kmeans might miss. Spectral clustering is more flexible and effective in revealing natural groupings in complex datasets.

**Question 7**

The gap statistic method is used to determine the optimal number of clusters in a dataset. It works by comparing the clustering dispersion for the number of clusters to that of a reference dataset with randomized points. The difference between these dispersions is that the "gap" indicates how much better the clustering is. The ideal number of clusters is found at the point where this gap is largest, which is better than random clustering.