# K-Means Clustering Example

Let's make some fake data that includes people clustered by income and age, randomly:

## Imports

In [2]:
%matplotlib inline
from numpy import random, array, single, version
# float
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale

print(f'numpy version {version.version}')

numpy version 1.26.3


## Create Fake-Data Generator

In [None]:
#Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
    random.seed(10)
    pointsPerCluster = single(N)/k
    X = []
    for i in range (k):
        incomeCentroid = random.uniform(20000.0, 200000.0)
        ageCentroid = random.uniform(20.0, 70.0)
        for j in range(int(pointsPerCluster)):
            X.append([random.normal(incomeCentroid, 10000.0), random.normal(ageCentroid, 2.0)])
    X = array(X)
    return X

## Generate Fake Data

In [None]:
HOW_MANY_POINTS = 100
HOW_MANY_CLUSTERS = 5
data = createClusteredData(HOW_MANY_POINTS, HOW_MANY_CLUSTERS)

## Use K-Means Clustering to discover centroids

In [None]:
model = KMeans(n_clusters=HOW_MANY_CLUSTERS)

# scale & "normalize" the data
model = model.fit(scale(data))

# We can look at the clusters each data point was assigned to
print(f'labels: {model.labels_}')

## Visualize The Data

In [None]:
modelLabels = model.labels_.astype(single)

plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=modelLabels)
plt.show()

## Activity

Things to play with: what happens if you don't scale the data? What happens if you choose different values of K? In the real world, you won't know the "right" value of K to start with - you'll need to converge on it yourself.