# Machine Learning Basics: Clustering
This notebook introduces the basic concepts of machine learning with a focus on clustering.

## 1. Introduction to Clustering
Clustering is an unsupervised learning technique used to group similar data points together based on patterns in the data.

## 2. K-Means Clustering with Randomly Generated Data
K-Means is a popular clustering algorithm that partitions the dataset into K clusters.

## 3. Example: Clustering with K-Means

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib.patches import Circle

# Generate random dataset
np.random.seed(42)
X = np.random.rand(50, 2) * 100  # Generating random data with two features

# Applying K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)
y_kmeans = kmeans.labels_

# Check if more than one cluster is formed before computing silhouette score
if len(set(y_kmeans)) > 1:
    silhouette_avg = silhouette_score(X, y_kmeans)
    print(f"K-Means Silhouette Score: {silhouette_avg:.2f}")
else:
    print("K-Means formed only one cluster; silhouette score cannot be computed.")

# Scatter plot of clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis', legend='full')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering with Random Data')
plt.legend()
plt.show()


## 4. DBSCAN Clustering

In [None]:

# DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is useful for finding clusters with irregular shapes.
# The `eps` parameter defines the radius of a neighborhood, and `min_samples` determines the minimum points required to form a cluster.
dbscan = DBSCAN(eps=5, min_samples=3)
y_dbscan = dbscan.fit_predict(X)

# Identifying noise points
noise_points = y_dbscan == -1

# Evaluating DBSCAN only if clusters exist
if len(set(y_dbscan) - {-1}) > 1:
    dbscan_silhouette = silhouette_score(X[y_dbscan != -1], y_dbscan[y_dbscan != -1])
    print(f"DBSCAN Silhouette Score: {dbscan_silhouette:.2f}")
else:
    print("DBSCAN did not find enough clusters to compute a silhouette score.")

# Scatter plot for DBSCAN with circles around clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_dbscan, palette='coolwarm', legend='full')
plt.scatter(X[noise_points, 0], X[noise_points, 1], color='black', label='Noise', marker='x')

# Draw circles around clusters
for cluster in set(y_dbscan) - {-1}:  # Exclude noise points
    cluster_points = X[y_dbscan == cluster]
    center = cluster_points.mean(axis=0)
    radius = np.max(np.linalg.norm(cluster_points - center, axis=1))
    circle = Circle(center, radius, color='gray', fill=False, linestyle='dashed')
    plt.gca().add_patch(circle)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN Clustering with Circles')
plt.legend()
plt.show()


## 5. Agglomerative Hierarchical Clustering

In [None]:

# Hierarchical clustering builds a hierarchy of clusters.
# The linkage method determines how distances between clusters are measured.
hierarchical = AgglomerativeClustering(n_clusters=3)
y_hierarchical = hierarchical.fit_predict(X)

# Check if more than one cluster is formed before computing silhouette score
if len(set(y_hierarchical)) > 1:
    hierarchical_silhouette = silhouette_score(X, y_hierarchical)
    print(f"Agglomerative Clustering Silhouette Score: {hierarchical_silhouette:.2f}")
else:
    print("Agglomerative clustering formed only one cluster; silhouette score cannot be computed.")

# Plot dendrogram with clear labels
linked = linkage(X, method='ward')
plt.figure(figsize=(10, 6))
dendrogram(linked, truncate_mode='level', p=5)
plt.title('Hierarchical Clustering Dendrogram (Truncated)')
plt.xlabel('Cluster Index')
plt.ylabel('Distance')
plt.show()


## Conclusion
This notebook introduced clustering with multiple algorithms: K-Means, DBSCAN, and Agglomerative Clustering.
It used randomly generated data to illustrate clustering concepts.
Checks were added to prevent silhouette score computation errors when only one cluster is formed.
DBSCAN clusters and noise points were clearly visualized, with circles added around detected clusters.
The hierarchical dendrogram was truncated for better readability.
Happy Learning! 🎉