# K-means


In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from plotnine import *
import numpy as np


## Together
### Algorithm


1. Randomly (or, in order to make convergence quicker, cleverly) choose K centroids in the feature space
2. Assign each data point to the centroid/cluster closest to it
3. Recalculate each centroid by taking the mean (for each predictor) of all the data points in each cluster.
4. Repeat Steps 2 and 3 until convergence
    - either cluster assignments don't change from step to step OR
    - the centroid doesn't change much from step to step

### Assumptions reminder
One thing to keep in mind with K-Means is that is assumes *spherical* variance within each cluster. That means that K-means behaves as if--within each cluster--all predictors have the same variance. Roughly, this means that we could easily draw a sphere (or circle) around each of our clusters, even if it's not perfectly spherical.

<img src="https://drive.google.com/uc?export=view&id=1RHRfcPIjIZ_-IMOE00gyzVadlaGxXPh8" width=350px />

Unsupervised learning has a slightly easier workflow!

Previous Workflow (Supervised):
1. Separate your data into X (predictors) and y (outcome), and maybe do some model validation set up.
2. Create an Empty Model.
3. call .fit() using your training data
4. call .predict() on ANY X data (train or test) to get the model prediction for that data.
5. Assess the model

New workflow for unsupervised:
1. Load in our data + z-score
2. Create Empty Model
3. Fit  Model
4. Assess your model (using `.predict()`/`labels_` and by looking at plots, or something like a silhouette score)

TODO: Discussion

You may notice that there is no model validation step. Why don't we care about model validation and data leakage in unsupervised learning?

You may also notice that z-scoring still must be done. Why do we still care about z-scoring in K-Means?

In [None]:
# 1. Load the data + standardize
beyonce = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/Beyonce_data.csv")

predictors = ["energy", "danceability", "valence"]

X = beyonce[predictors]
# notice that there is no y

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Create empty model
km = #

# 3. Fit model + predict
labels = km.fit_predict(X_scaled)

# 4. Assess model
silhouette_avg = silhouette_score(X_scaled, labels)
print(silhouette_avg)

# Add cluster labels to the original DataFrame
X["clusters"] = labels

# 5: Plot the results. We are plotting each combo of predictors
print(ggplot(X, aes(x="energy", y="danceability", color="factor(clusters)")) +
      geom_point() + theme_minimal() +
      labs(x="Energy", y="Danceability", title="KMeans Clustering Results for K = 4",
           color="Clusters"))

print(ggplot(X, aes(x="energy", y="valence", color="factor(clusters)")) +
      geom_point() + theme_minimal() +
      labs(x="Energy", y="Valence", title="KMeans Clustering Results for K = 4",
           color="Clusters"))

print(ggplot(X, aes(x="valence", y="danceability", color="factor(clusters)")) +
      geom_point() + theme_minimal() +
      labs(x="Valence", y="Danceability", title="KMeans Clustering Results for K = 4",
           color="Clusters"))


Now let's try a new method: using the *data* to select K. This code aims to cluster wines based on `citric.acid` and `residual_sugar`.

In [None]:
wine = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/wineLARGE.csv")

# drop and reset rows
wine.dropna(inplace = True)
wine.reset_index(inplace = True)

# grab data we want to cluster
feats = ["citric.acid", "residual.sugar"]

X = wine[feats]

# standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# create dictionary to
metrics = {"sil": [], "k": []}

for i in range(2,20):
    km = #
    labels = km.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)

    metrics["sil"].append(sil)
    metrics["k"].append(i)

df = pd.DataFrame(metrics)


In [2]:
print(ggplot(df, aes(x = "k", y = "sil")) +
  geom_line() + theme_minimal() +
    labs(x = "K", y = "Mean Silhouette Score",
         title = "Silhouette Scores for Different Ks"))

In [3]:
km = KMeans(7)
labels = km.fit_predict(X_scaled)

X["cluster"] = labels
print(ggplot(X, aes(x = "citric.acid", y = "residual.sugar", color = "factor(cluster)")) +
      geom_point() +
      theme_minimal() +
      scale_color_discrete(name = "Cluster") +
      labs(x = "Citric Acid",
           y = "Residual Sugar",
           title = "7 Cluster Solution"))

## ICA

### 1.

TODO: Using the purchases.csv dataset, select 2 features and make clusters using K-Means

In [6]:
data = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/purchases.csv")
data.head()

In [None]:
# TODO: Select 2 features and make K-Means clusters



### 2.

In this ICA, you'll be writing a k-means algorithm from scratch. Some steps are done for you.

Your function, `KM()` should take in two arguments:

- `df` a dataframe with all of your data.
- `k` the number of clusters to fit.

and apply K-Means to it. Remember that the steps of K means are:

**1**. DONE FOR YOU. Randomly select k centroids.
- I recommend choosing `k` random data points from `df`. You can do this by using `np.random.choice(range(0,df.shape[0]), k)` to select the indices for `k` randomly selected rows. THEN use those indices to grab the chosen rows from df and store them. Technically, you are also able to start with any random point in your space, even if it is not a member of your df, but we will use this simplified version.

**2**. Assign each data point from `df` to the closest centroid.
- You'll need to calculate the distance between each data point and each centroid. Perhaps try using `np.linalg.norm()` (see Hint 3).
    - I recommend storing cluster/centroid membership by having a dictionary with one key for each cluster/centroid, and the value is a list of row indices pertaining to the data points in each cluster (see HINT 1 for an example of this)

**3**. Re-calculate the cluster mean/centroid
- For each centroid/cluster, find the mean value for each predictor/feature by taking the mean for that feature from all the data points assigned to the centroid/cluster. (see Hint 2)

**4**. DONE FOR YOU. Repeat Steps 2-3 until the change in centroid positions are all less than 0.0001
- in other words, calculate how far each centroid moved. If all of them moved less than 0.0001 units, then stop.

**5**. DONE FOR YOU. Return the cluster assignments by returning the dictonary of the clusters and their memberships that you create in #2.

### HINT 1:

You can store your cluster memberships like this (in this case k = 3, and there are only 20 datapoints, but your function should take any k, and any number of data points):

```
clust = {0: [0,7,4,5,12,18,20],
         1: [10,8,3,2,14,17,19],
         2: [1,6,7,9,11,13,15,16]}
```
      

### HINT 2:

If a cluster contained the following data points:

|           | X1 | X2 | X3 |
|-----------|----|----|----|
| Person 1  | 5  | 2  | 9  |
| Person 2  | 2  | 3  | 2  |
| Person 3  | 1  | 6  | 1  |
| Person 4  | 7  | 1  | 4  |
| Person 5  | 3  | 2  | 5  |
| Person 6  | 1  | 1  | 8  |
| Person 7  | 7  | 0  | 6  |
| Person 8  | 0  | 7  | 2  |
| Person 9  | 2  | 3  | 7  |
| Person 10 | 4  | 6  | 1  |


Then the centroid for that cluster would be [a,b,c] where a, b, and c are the means of each column X1, X2, and X3:

a = (5 + 2 + 1 + 7 + 3 + 1 + 7 + 0 + 2 + 4)/10

b = (2 + 3 + 6 + 1 + 2 + 1 + 0 + 7 + 3 + 6)/10

c = (9 + 2 + 1 + 4 + 5 + 8 + 6 + 2 + 7 + 1)/10


### HINT 3:

To calculate the distance between two vectors, you can use:

```
distance_ab = np.linalg.norm(a-b)

```

where `a` and `b` are the two vectors.

### HINT 4:

The `np.argmin()` function takes in a list (or array) of values, and returns the *index* of the smallest one.

For example:

```
my_list = [1,6,2,5,0]

np.argmin(my_list)
```

this code would return 4, because the smallest value (0) in `my_list` is at index 4.

In [None]:
def choose_centroids(df,k):
    # DONE, DON'T CHANGE ANYTHING

    '''
    use the row nums in c to grab the k data points and store them in centroids.
    centroids should be a list of rows (each row contains the data point you chose)
    to be a cluster center
    '''

    c = np.random.choice(range(0,df.shape[0]), k)
    centroids = [df.iloc[l] for l in c]
    return(centroids)

def choose_closest_cluster(centroids, df, k):
    '''
    Create a dictionary, clust, that stores the row numbers of all the data points
    in each cluster.

    e.g. {0: [0,7], 1: [1,6,12,13], 2: [4,5], 3: [3,8,9], 4: [2,10,11]}

    Loop through each data point, and calculate the distance between that data point
    and all the cluster centers (centroids), and store them in a list. Remember
    np.linalg.norm()!

    Then, use np.argmin() to figure out which centroid is the closest, and assign
    the data point to that cluster by adding it's row number (stored in dataPoint)
    to the clust dictionary.
    '''

    # creating empty clust dictionary
    clust = {}
    for c in range(0,k):
        clust[c] = []

    # loop through each row in df
    for dataPoint in range(0,df.shape[0]):
        pass
        # TODO: calculate the distance between current data point and each centroid


        # TODO: find the centroid that's closest


        # TODO: add dataPoint to that cluster


    return(clust)

def recalculate_cluster_mean(clust, df, k):
    '''
    This function takes in a dictionary of cluster memberships and the data
    and returns the NEW cluster centers, stored in new_centroids. Cluster centers
    are calculated by taking the mean of each feature for all the data points in
    each cluster.

    new_centroids should be a list of arrays that represent the new cluster centers.

    Hint: what happens when you call .mean() on a dataframe?
    '''

    new_centroids = [[] for c in range(0,k)] #create an empty list of k 0's

    for c in range(0,k):
        # TODO: calculate the center/mean of cluster c
        pass

    # turn our list into an array
    new_centroids = np.array(new_centroids)

    return(new_centroids)


In [None]:
def KM(df,k):

    # 1. randomly select k centroids
    centroids = choose_centroids(df,k)

    converged = False # has the algorithm converged yet?
    while not converged: # until the centroids stop moving

        # 2. assign data points to a cluster
        clust = choose_closest_cluster(centroids,df, k)

        # 3. re-calculate the center/centroid of each cluster
        new_centroids = recalculate_cluster_mean(clust,df, k)

        # 4. check whether you can stop iterating by checking whether the
        # distance between the previous position and current position is
        # less than 0.0001 for all k centroids.

        # calculate the distance between the old centroid values, and new_centroids values
        change = np.array([np.linalg.norm(centroids[i]-new_centroids[i]) for i in range(0,k)])

        # check whether all of them moved less than 0.0001 units.
        converged = np.all(change < 0.0001)

        # set new_centroids to be established centroids
        centroids = new_centroids






    # 5. Return cluster memberships dictionary, the structure
    # should look like this (but can have different k, and different
    # assignments depending on data/starting centroids/chosen )
    # {0: [55, 72, 76, 85, 89, 93, 100, 104, 105, 107, 110, 119,
#          123, 132, 144, 201, 202, 203, 204, 205, 206, 207, 209,
#          210, 212, 213, 214, 215, 217, 218, 220, 221, 222, 223,
#          225, 226, 227, 228, 229, 231, 232, 233, 234, 237, 238,
#          241, 243, 245, 246, 247, 248, 249],
#     1: [8, 47, 102, 103, 111, 114, 117, 120, 126, 129, 131, 136,
#          141, 142, 143, 145, 146, 148],
#     2: [51, 101, 106, 108, 109, 112, 113, 115, 116, 118, 121,
#           122, 124, 125, 127, 128, 130, 133, 134, 135, 137, 138,
#           139, 140, 147, 149, 200, 219, 224, 230, 235, 239, 240,
#           242, 244],
#     3: [150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160,
#           161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
#           172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
#           183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193,
#           194, 195, 196, 197, 198, 199, 216, 236],
#     4: [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16,
#           17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
#           31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
#           45, 46, 48, 49, 50, 52, 53, 54, 56, 57, 58, 59, 60, 61,
#           62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 77,
#           78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 90, 91, 92, 94,
#           95, 96, 97, 98, 99, 208, 211]}
    return(clust)

In [4]:
data = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/programmers3.csv")

data.head()

## Using your K-Means Function

Now that you have done the incredibly impressive work of writing your own K-means function. Let's use it and compare the results to what we'd get from `sklearn`!

First, use your OWN function `KM()` to do K-means on `data` with k = 5. Then generate the cluster assingments using the code provided. Then make a ggplot scatterplot of your clusters.

Second, use sklearn's `KMeans()` function to do K-means on `data` with k = 5. Then generate the cluster assignments using `.predict()`. Then make a ggplot scatterplot of your clusters.

In [None]:
# run k-means

# create features list
feats = ["py", "r"]

# z score
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# store z scored data in another data frame
data_z = X.copy()
data_z[feats] = pd.DataFrame(scaler.fit_transform(data[feats]))

# use your function
clusters = KM(data_z[feats], 5)

# generate assignments
assignments = np.array([999 for row in range(0, data_z.shape[0])])

for cluster in clusters:
    assignments[clusters[cluster]] = cluster

data["assignments_ME"] = assignments


# create ggplot scatter plot of data, using x, y and color = "assignments_ME"
(ggplot(data, aes("py", "r", color = "factor(assignments_ME)")) +
  geom_point() + theme_minimal() +
  scale_color_discrete(name = "Clusers"))

In [None]:
# USING SKLEARN

# create kmeans model
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[feats])


m = KMeans(5)
assignments = m.fit_predict(X_scaled)
# add assignments to data
data["assignments_SK"] = assignments



# create another ggplot scatter plot of data, using x, y and color = "assignments_SK"
(ggplot(data, aes("py", "r", color = "factor(assignments_SK)")) +
geom_point() + theme_minimal() +
scale_color_discrete(name = "Cluster"))


TODO: Compare the results of each