# <font color="#418FDE" size="6.5" uppercase>**Clustering Concepts**</font>

>Last update: 20260201.
    
By the end of this Lecture, you will be able to:
- Describe clustering as grouping similar examples based on feature values. 
- Use simple distance-based reasoning to decide which examples belong together. 
- Interpret clusters in terms of meaningful patterns in a given context. 


## **1. Clustering Without Labels**

### **1.1. No target column**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_01_01.jpg?v=1769971477" width="250">



>* Clustering works without labels or target column
>* Algorithm groups examples using only feature similarities

>* No labels, so results lack fixed answers
>* Algorithm groups by similarity; humans interpret meanings

>* Analysts make creative choices when forming clusters
>* No labels; clusters reveal useful hidden structure



### **1.2. Similarity Based Groups**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_01_02.jpg?v=1769971488" width="250">



>* Clustering groups examples using their feature values
>* Similar points in feature space form groups

>* Fruits are positioned by features in 3D space
>* Algorithm finds dense, similar groups without labels

>* Clustering finds hidden behavior patterns in data
>* Groups form from similar features, not labels



### **1.3. Clustering for Exploration**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_01_03.jpg?v=1769971499" width="250">



>* Clustering explores data by letting groups emerge
>* Reveals hidden patterns that guide deeper analysis

>* Clustering reveals community or visitor groups from features
>* Clusters guide follow-up questions and targeted actions

>* Clustering supports iterative, reflective data exploration cycles
>* Refined clusters reveal patterns that inspire new hypotheses



## **2. Distance and Similarity**

### **2.1. Point to Point Distance**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_02_01.jpg?v=1769971510" width="250">



>* Examples become points; distance measures their closeness
>* Distances help form clusters and spot different groups

>* Song features define distances that reflect similarity
>* Closer songs join the same playlist-style cluster

>* Distance depends on chosen features and scales
>* Comparing distances reveals dense clusters and separations



In [None]:
#@title Python Code - Point to Point Distance

# This script shows simple point distance calculations.
# We compare customers using tiny numeric feature vectors.
# Focus on how distance reflects similarity between examples.

# import math for square root distance calculations.
import math

# define two simple customers with age and purchases features.
customer_a = {"name": "Alice", "features": (25, 5)}
customer_b = {"name": "Bob", "features": (27, 6)}

# define a third customer intentionally more different.
customer_c = {"name": "Cara", "features": (40, 1)}

# define a function computing Euclidean distance between two points.
def euclidean_distance(point_one, point_two):
    # unpack coordinates from the two feature tuples.
    x1, y1 = point_one
    x2, y2 = point_two

    # compute squared differences along each feature dimension.
    squared_sum = ((x1 - x2) ** 2) + ((y1 - y2) ** 2)

    # return the square root as final distance value.
    return math.sqrt(squared_sum)

# collect all customers into a small list for iteration.
customers = [customer_a, customer_b, customer_c]

# print a short header explaining the feature meaning.
print("Features: age in years, number of products purchased.")

# compute and print pairwise distances between all customers.
for i in range(len(customers)):
    # select the first customer in the pair.
    first = customers[i]

    # compare with later customers to avoid duplicates.
    for j in range(i + 1, len(customers)):
        # select the second customer in the pair.
        second = customers[j]

        # compute distance using the helper function.
        dist = euclidean_distance(first["features"], second["features"])

        # print a readable summary of the distance value.
        print(
            f"Distance between {first['name']} and {second['name']} is {dist:.2f}."
        )

# interpret which pair is most similar based on smallest distance.
print("Smaller distance means customers are more similar overall.")




### **2.2. Scaling and Distance**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_02_02.jpg?v=1769971543" width="250">



>* Feature scale strongly affects distance-based grouping
>* Large-range features dominate, hiding small-scale information

>* Rescale features so typical differences are comparable
>* Scaling balances feature influence, producing fairer clusters

>* Scaling prevents one feature dominating many others
>* It lets us choose and balance feature importance



In [None]:
#@title Python Code - Scaling and Distance

# This script shows how scaling changes distances.
# We compare raw and scaled customer feature distances.
# Focus on income and satisfaction feature distance contributions.

# import required numerical and plotting libraries.
import numpy as np
import matplotlib.pyplot as plt

# set deterministic random seed for reproducible behavior.
np.random.seed(42)

# create tiny synthetic customer feature data array.
customers = np.array(
    [[20000, 2],
     [50000, 4],
     [80000, 3],
     [120000, 5]],
    dtype=float,
)

# print original customer feature table with labels.
print("Customers as [income_dollars, satisfaction_score]:")
print(customers)

# define simple function computing euclidean distance values.
def euclidean_distance(a, b):
    diff = a - b

    return float(np.sqrt(np.sum(diff ** 2)))

# choose reference customer index for distance comparison.
ref_index = 0
ref_customer = customers[ref_index]

# compute distances from reference using raw features.
raw_distances = []
for i in range(customers.shape[0]):
    dist = euclidean_distance(ref_customer, customers[i])
    raw_distances.append(dist)

# convert raw distances list into numpy array.
raw_distances = np.array(raw_distances, dtype=float)

# compute feature means and standard deviations for scaling.
feature_means = customers.mean(axis=0)
feature_stds = customers.std(axis=0, ddof=0)

# avoid division by zero using safe standard deviation values.
feature_stds_safe = np.where(feature_stds == 0, 1.0, feature_stds)

# scale features to zero mean and unit variance values.
customers_scaled = (customers - feature_means) / feature_stds_safe

# compute distances from reference using scaled features.
scaled_ref_customer = customers_scaled[ref_index]
scaled_distances = []
for i in range(customers_scaled.shape[0]):
    dist = euclidean_distance(scaled_ref_customer, customers_scaled[i])
    scaled_distances.append(dist)

# convert scaled distances list into numpy array.
scaled_distances = np.array(scaled_distances, dtype=float)

# print raw and scaled distances for interpretation.
print("\nRaw distances from customer 0:")
print(raw_distances)
print("\nScaled distances from customer 0:")
print(scaled_distances)

# prepare x positions for bar plot comparison.
indices = np.arange(customers.shape[0])

# create bar width and figure for distance comparison.
bar_width = 0.35
plt.figure(figsize=(6, 4))

# plot raw distances as left bars.
plt.bar(indices - bar_width / 2, raw_distances, width=bar_width, label="Raw")

# plot scaled distances as right bars.
plt.bar(indices + bar_width / 2, scaled_distances, width=bar_width, label="Scaled")

# label axes and title for clarity.
plt.xlabel("Customer index")
plt.ylabel("Distance from customer zero")
plt.title("Effect of feature scaling on distance values")

# add legend and layout adjustment.
plt.legend()
plt.tight_layout()

# display the final comparison plot.
plt.show()




### **2.3. Visual 2D Examples**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_02_03.jpg?v=1769971588" width="250">



>* Plot examples as dots on two features
>* Nearby dots form clusters; distant dots are different

>* Plot students by two exam scores visually
>* Assign each student to cluster of nearest neighbors

>* Imagine restaurant points, connect nearby ones into groups
>* Gaps between point groups mark cluster boundaries visually



In [None]:
#@title Python Code - Visual 2D Examples

# This script visualizes simple two dimensional distance based clustering examples.
# It shows how nearby points can form intuitive visual clusters.
# Use it to connect distance ideas with real looking scatter plots.

# !pip install numpy matplotlib seaborn.

# Import required numerical and plotting libraries.
import numpy as np
import matplotlib.pyplot as plt

# Set deterministic random seed for reproducible point locations.
np.random.seed(42)

# Create grocery versus entertainment spending cluster centers.
centers_spend = np.array([[20, 40], [60, 20], [55, 70]])

# Generate small clouds of spending points around each center.
points_spend = []
for center in centers_spend:
    cloud = center + np.random.normal(loc=0.0, scale=5.0, size=(15, 2))
    points_spend.append(cloud)

# Stack all spending points into one array for plotting.
points_spend = np.vstack(points_spend)

# Create math versus writing exam score cluster centers.
centers_exam = np.array([[80, 80], [30, 30], [80, 40], [40, 80]])

# Generate small clouds of exam points around each center.
points_exam = []
for center in centers_exam:
    cloud = center + np.random.normal(loc=0.0, scale=4.0, size=(10, 2))
    points_exam.append(cloud)

# Stack all exam points into one array for plotting.
points_exam = np.vstack(points_exam)

# Choose one grocery customer and one student for distance highlighting.
customer_index = 5
student_index = 12

# Extract chosen reference points from arrays safely.
customer_point = points_spend[customer_index]
student_point = points_exam[student_index]

# Compute Euclidean distances from reference customer to all customers.
distances_customer = np.sqrt(np.sum((points_spend - customer_point) ** 2, axis=1))

# Compute Euclidean distances from reference student to all students.
distances_student = np.sqrt(np.sum((points_exam - student_point) ** 2, axis=1))

# Find three nearest neighbors for customer excluding itself.
neighbor_indices_customer = np.argsort(distances_customer)[1:4]

# Find three nearest neighbors for student excluding itself.
neighbor_indices_student = np.argsort(distances_student)[1:4]

# Print short summary explaining selected nearest neighbors.
print("Customer nearest neighbor distances:", np.round(distances_customer[neighbor_indices_customer], 2))
print("Student nearest neighbor distances:", np.round(distances_student[neighbor_indices_student], 2))

# Create a figure with two side by side scatter subplots.
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

# Plot grocery versus entertainment spending scatter with neighbors highlighted.
ax0 = axes[0]
ax0.scatter(points_spend[:, 0], points_spend[:, 1], c="lightgray", label="Other customers")

# Highlight reference customer with a distinct color and marker.
ax0.scatter(customer_point[0], customer_point[1], c="red", marker="x", s=80, label="Chosen customer")

# Highlight nearest neighbor customers using another color.
ax0.scatter(points_spend[neighbor_indices_customer, 0], points_spend[neighbor_indices_customer, 1], c="blue", marker="o", s=60, label="Nearest neighbors")

# Label axes and title for spending subplot.
ax0.set_xlabel("Grocery spending per month")
ax0.set_ylabel("Entertainment spending per month")
ax0.set_title("Customers grouped by similar spending patterns")
ax0.legend(loc="best")

# Plot math versus writing exam scatter with neighbors highlighted.
ax1 = axes[1]
ax1.scatter(points_exam[:, 0], points_exam[:, 1], c="lightgray", label="Other students")

# Highlight reference student with a distinct color and marker.
ax1.scatter(student_point[0], student_point[1], c="red", marker="x", s=80, label="Chosen student")

# Highlight nearest neighbor students using another color.
ax1.scatter(points_exam[neighbor_indices_student, 0], points_exam[neighbor_indices_student, 1], c="green", marker="o", s=60, label="Nearest neighbors")

# Label axes and title for exam subplot.
ax1.set_xlabel("Math exam score")
ax1.set_ylabel("Writing exam score")
ax1.set_title("Students grouped by similar exam performance")
ax1.legend(loc="best")

# Adjust layout and display the combined figure clearly.
plt.tight_layout()
plt.show()




## **3. Making Sense of Clusters**

### **3.1. Cluster summaries**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_03_01.jpg?v=1769971639" width="250">



>* Summarize each clusterâ€™s typical shared characteristics clearly
>* Turn feature statistics into a simple, meaningful story

>* Focus on features where clusters stand out
>* Use contrasts to explain meaningful group differences

>* Cluster summaries must match real-world goals and context
>* Contextual summaries guide targeted actions and decisions



### **3.2. Labeling Cluster Groups**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_03_02.jpg?v=1769971650" width="250">



>* Create human-friendly labels describing each cluster
>* Summarize key feature patterns in short phrases

>* Compare clusters to find what makes them unique
>* Use clear, contrastive labels that aid decisions

>* Use cautious, unbiased labels based on observable data
>* Check labels with experts and note limitations



### **3.3. Applying Clusters to Decisions**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_09/Lecture_A/image_03_03.jpg?v=1769971659" width="250">



>* Use each cluster as a distinct segment
>* Tailor actions to clusters for targeted decisions

>* Design different actions for each customer cluster
>* Balance personalization with ethics, cost, and practicality

>* Treat clusters as tentative patterns, not facts
>* Keep monitoring, updating, and revising cluster-based decisions



# <font color="#418FDE" size="6.5" uppercase>**Clustering Concepts**</font>


In this lecture, you learned to:
- Describe clustering as grouping similar examples based on feature values. 
- Use simple distance-based reasoning to decide which examples belong together. 
- Interpret clusters in terms of meaningful patterns in a given context. 

In the next Lecture (Lecture B), we will go over 'Simplifying Features'