<a href="https://colab.research.google.com/github/thepersonuadmire/Clustering/blob/main/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Theorectical Questions

1. What is unsupervised learning in the context of machine learning?


Unsupervised learning is a type of machine learning where the model is trained on data without labeled responses. The goal is to identify patterns, structures, or relationships within the data. Common tasks include clustering, dimensionality reduction, and anomaly detection.

2. How does K-Means clustering algorithm work?


K-Means clustering works by partitioning the dataset into K distinct clusters. The algorithm follows these steps:

Initialize K centroids randomly.

Assign each data point to the nearest centroid.

Update the centroids by calculating the mean of all points assigned to each cluster.

Repeat the assignment and update steps until convergence (i.e., when assignments no longer change).

3. Explain the concept of a dendrogram in hierarchical clustering?


A dendrogram is a tree-like diagram that illustrates the arrangement of clusters formed by hierarchical clustering. It shows the relationships between clusters and how they are merged at various levels of similarity. The height of the branches indicates the distance or dissimilarity between clusters.

4. What is the main difference between K-Means and Hierarchical Clustering?


The main difference is that K-Means is a partitional clustering method that requires the number of clusters (K) to be specified in advance, while hierarchical clustering builds a hierarchy of clusters without needing to predefine the number of clusters. Hierarchical clustering can be agglomerative (bottom-up) or divisive (top-down).



5. What are the advantages of DBSCAN over K-Means?


 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has several advantages:

It can find arbitrarily shaped clusters, while K-Means assumes spherical clusters.

It can identify noise points as outliers, which K-Means cannot do.

It does not require the number of clusters to be specified in advance.

6. When would you use Silhouette Score in clustering?


The Silhouette Score is used to evaluate the quality of clustering. It measures how similar an object is to its own cluster compared to other clusters. A higher Silhouette Score indicates better-defined clusters. It is particularly useful for determining the optimal number of clusters.

7. What are the limitations of Hierarchical Clustering?


 Limitations of hierarchical clustering include:

It can be computationally expensive, especially for large datasets (O(n^3) complexity).

It is sensitive to noise and outliers.

Once a merge or split is made, it cannot be undone, which can lead to suboptimal clustering.

8. Why is feature scaling important in clustering algorithms like K-Means?


Feature scaling is important because K-Means uses distance metrics (like Euclidean distance) to assign points to clusters. If features are on different scales, those with larger ranges can disproportionately influence the clustering results. Scaling ensures that all features contribute equally to the distance calculations.

9. How does DBSCAN identify noise points?


DBSCAN identifies noise points as those points that do not belong to any cluster. A point is considered noise if it is not within the ε-neighborhood of any core point (a point with a minimum number of neighbors within ε). Noise points are those that are neither core points nor directly reachable from core points.

10. Define inertia in the context of K-Means?


nertia is a measure of how tightly the clusters are packed. It is defined as the sum of squared distances between each point and its assigned cluster centroid. Lower inertia values indicate better clustering, as they suggest that points are closer to their centroids.

11. What is the elbow method in K-Means clustering?


The elbow method is a technique used to determine the optimal number of clusters (K) in K-Means clustering. It involves plotting the inertia against the number of clusters and looking for a "knee" or "elbow" point where the rate of decrease sharply changes. This point suggests a suitable number of clusters.

12. Describe the concept of "density" in DBSCAN.


In DBSCAN, density refers to the number of points within a specified radius (ε) around a point. A point is considered a core point if it has at least a minimum number of neighbors (MinPts) within this radius. Clusters are formed from core points and their reachable points, while low-density areas are classified as noise.

13. Can hierarchical clustering be used on categorical data?


 Yes, hierarchical clustering can be used on categorical data, but it requires appropriate distance metrics (e.g., Hamming distance or Jaccard distance) that are suitable for categorical variables. Standard distance measures like Euclidean distance are not appropriate for categorical data.

14. What does a negative Silhouette Score indicate?


 A negative Silhouette Score indicates that a data point is likely assigned to the wrong cluster. It suggests that the point is closer to points in other clusters than to points in its own cluster, indicating poor clustering quality.

15. Explain the term "linkage criteria" in hierarchical clustering.


Linkage criteria determine how the distance between clusters is calculated during the merging process in hierarchical clustering. Common linkage methods include:

Single linkage: distance between the closest points of two clusters.

Complete linkage: distance between the farthest points of two clusters.

Average linkage: average distance between all pairs of points in two clusters.

16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?


K-Means assumes that clusters are spherical and of similar size. If clusters have varying sizes or densities, K-Means may struggle to accurately assign points to clusters, leading to poor clustering results. It may merge smaller clusters into larger ones or fail to capture the true structure of the data.

17. What are the core parameters in DBSCAN, and how do they influence clustering?


The core parameters in DBSCAN are:

ε (epsilon): the radius within which to search for neighbors. A larger ε can merge clusters, while a smaller ε may result in more noise points.

MinPts: the minimum number of points required to form a dense region. A higher MinPts value can lead to fewer clusters and more noise, while a lower value may result in more clusters.


18. How does K-Means++ improve upon standard K-Means initialization?


K-Means++ improves the initialization of centroids by selecting them in a way that spreads them out across the data space. It chooses the first centroid randomly and then selects subsequent centroids based on their distance from existing centroids, which helps to reduce the chances of poor clustering and improves convergence speed.

19. What is agglomerative clustering?


Agglomerative clustering is a type of hierarchical clustering that starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until a single cluster is formed or a specified number of clusters is reached. It builds a hierarchy from the bottom up.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?

The Silhouette Score provides a measure of how well-separated the clusters are, taking into account both the cohesion (how close points in the same cluster are) and separation (how far apart clusters are). In contrast, inertia only measures the compactness of clusters without considering their separation. Therefore, the Silhouette Score can give a more comprehensive view of clustering quality, especially when clusters have different shapes or sizes.

# Practical Questions

21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4)
y_kmeans = kmeans.fit_predict(X)

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.title('K-Means Clustering with 4 Centers')
plt.show()

22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.


In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load the Iris dataset
iris = load_iris()
X_iris = iris.data

# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
predicted_labels = agg_clustering.fit_predict(X_iris)

# Display the first 10 predicted labels
print(predicted_labels[:10])

23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.


In [None]:
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Generate synthetic data
X_moons, _ = make_moons(n_samples=300, noise=0.1)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
y_dbscan = dbscan.fit_predict(X_moons)

# Visualize the results
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_dbscan, cmap='plasma')
plt.title('DBSCAN Clustering on Moons Data')
plt.show()

24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.


In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Load the Wine dataset
wine = load_wine()
X_wine = wine.data

# Standardize the features
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)

# Apply K-Means clustering
kmeans_wine = KMeans(n_clusters=3)
y_wine_kmeans = kmeans_wine.fit_predict(X_wine_scaled)

# Print the size of each cluster
unique, counts = np.unique(y_wine_kmeans, return_counts=True)
cluster_sizes = dict(zip(unique, counts))
print(cluster_sizes)

25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.


In [None]:
from sklearn.datasets import make_circles

# Generate synthetic data
X_circles, _ = make_circles(n_samples=300, noise=0.05)

# Apply DBSCAN
dbscan_circles = DBSCAN(eps=0.1, min_samples=5)
y_circles_dbscan = dbscan_circles.fit_predict(X_circles)

# Visualize the results
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles_dbscan, cmap='viridis')
plt.title('DBSCAN Clustering on Circular Data')
plt.show()

26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data

# Apply MinMaxScaler
scaler_cancer = MinMaxScaler()
X_cancer_scaled = scaler_cancer.fit_transform(X_cancer)

# Apply K-Means clustering
kmeans_cancer = KMeans(n_clusters=2)
kmeans_cancer.fit(X_cancer_scaled)

# Output the cluster centroids
print (kmeans_cancer.cluster_centers_)

27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.


In [None]:
# Generate synthetic data with varying standard deviations
X_varied, _ = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.0, 1.5], random_state=42)

# Apply DBSCAN
dbscan_varied = DBSCAN(eps=0.5, min_samples=5)
y_varied_dbscan = dbscan_varied.fit_predict(X_varied)

# Visualize the results
plt.scatter(X_varied[:, 0], X_varied[:, 1], c=y_varied_dbscan, cmap='plasma')
plt.title('DBSCAN Clustering on Varied Standard Deviations')
plt.show()

28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.


In [None]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

# Load the Digits dataset
digits = load_digits()
X_digits = digits.data

# Reduce dimensions using PCA
pca = PCA(n_components=2)
X_digits_pca = pca.fit_transform(X_digits)

# Apply K-Means clustering
kmeans_digits = KMeans(n_clusters=10)
y_digits_kmeans = kmeans_digits.fit_predict(X_digits_pca)

# Visualize the clusters
plt.scatter(X_digits_pca[:, 0], X_digits_pca[:, 1], c=y_digits_kmeans, cmap='viridis')
plt.title('K-Means Clustering on Digits Dataset (PCA Reduced)')
plt.show()

29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart.


In [None]:
from sklearn.metrics import silhouette_score

# Generate synthetic data
X_blob, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Evaluate silhouette scores for k = 2 to 5
silhouette_scores = []
k_values = range(2, 6)

for k in k_values:
    kmeans = KMeans(n_clusters=k)
    y_kmeans = kmeans.fit_predict(X_blob)
    score = silhouette_score(X_blob, y_kmeans)
    silhouette_scores.append(score)

# Plot the silhouette scores
plt.bar(k_values, silhouette_scores)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for Different k Values')
plt.xticks(k_values)
plt.show()

30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage


In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

# Load the Iris dataset
iris = load_iris()
X_iris = iris.data

# Perform hierarchical clustering
linked = linkage(X_iris, method='average')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', labels=iris.target_names[iris.target], distance_sort='descending', show_leaf_counts=True)
plt.title('Dendrogram for Iris Dataset (Average Linkage)')
plt.show()

31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.


In [None]:
# Generate synthetic data with overlapping clusters
X_overlap, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Apply K-Means clustering
kmeans_overlap = KMeans(n_clusters=3)
y_overlap_kmeans = kmeans_overlap.fit_predict(X_overlap)

# Create a mesh grid for decision boundaries
x_min, x_max = X_overlap[:, 0].min() - 1, X_overlap[:, 0].max() + 1
y_min, y_max = X_overlap[:, 1].min() - 1, X_overlap[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

# Predict cluster for each point in the mesh grid
Z = kmeans_overlap.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Visualize the clusters and decision boundaries
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X_overlap[:, 0], X_overlap[:, 1], c=y_overlap_kmeans, s=50, cmap='viridis')
plt.title('K-Means Clustering with Overlapping Clusters')
plt.show()

32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.


In [None]:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import seaborn as sns

# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Reduce dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=5, min_samples=5)
clusters = dbscan.fit_predict(X_tsne)

# Plotting the results
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=clusters, palette='tab10', legend='full', s=60)
plt.title("DBSCAN Clusters on Digits Data after t-SNE")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend(title="Cluster")
plt.show()


33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Apply Agglomerative Clustering with complete linkage
agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='complete')
y_agg = agg_clustering.fit_predict(X)

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=y_agg, cmap='viridis', marker='o', edgecolor='k', s=50)
plt.title('Agglomerative Clustering with Complete Linkage')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot.


In [None]:
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data

# Store inertia values
inertia_values = []
k_values = range(2, 7)

for k in k_values:
    kmeans_cancer = KMeans(n_clusters=k)
    kmeans_cancer.fit(X_cancer)
    inertia_values.append(kmeans_cancer.inertia_)

# Plot the inertia values
plt.plot(k_values, inertia_values, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Inertia Values for K-Means Clustering on Breast Cancer Dataset')
plt.xticks(k_values)
plt.show()

35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.


In [None]:
# Generate synthetic concentric circles
X_circles, _ = make_circles(n_samples=300, noise=0.05)

# Apply Agglomerative Clustering with single linkage
agg_clustering_single = AgglomerativeClustering(n_clusters=2, linkage='single')
y_single = agg_clustering_single.fit_predict(X_circles)

# Visualize the results
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_single, cmap='viridis')
plt.title('Agglomerative Clustering with Single Linkage on Concentric Circles')
plt.show()

36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).


In [None]:
# Load the Wine dataset
wine = load_wine()
X_wine = wine.data

# Scale the data
scaler_wine = StandardScaler()
X_wine_scaled = scaler_wine.fit_transform(X_wine)

# Apply DBSCAN
dbscan_wine = DBSCAN(eps=0.5, min_samples=5)
y_wine_dbscan = dbscan_wine.fit_predict(X_wine_scaled)

# Count the number of clusters excluding noise
n_clusters = len(set(y_wine_dbscan)) - (1 if -1 in y_wine_dbscan else 0)
print(f'Number of clusters (excluding noise): {n_clusters}')

37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.


In [None]:
# Generate synthetic data
X_blob, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Apply KMeans
kmeans_blob = KMeans(n_clusters=3)
y_blob_kmeans = kmeans_blob.fit_predict(X_blob)

# Visualize the clusters and cluster centers
plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y_blob_kmeans, s=50, cmap='viridis')
centers_blob = kmeans_blob.cluster_centers_
plt.scatter(centers_blob[:, 0], centers_blob[:, 1], c='red', s=200, alpha=0.75)
plt.title('KMeans Clustering with Cluster Centers')
plt.show()

38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.


In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN

# Load the Iris dataset
iris = load_iris()
X_iris = iris.data

# Apply DBSCAN
dbscan_iris = DBSCAN(eps=0.5, min_samples=5)
y_iris_dbscan = dbscan_iris.fit_predict(X_iris)

# Count how many samples were identified as noise
n_noise_samples = np.sum(y_iris_dbscan == -1)
print(f'Number of samples identified as noise: {n_noise_samples}')

39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans

# Generate synthetic non-linearly separable data
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# Apply K-Means clustering
kmeans_moons = KMeans(n_clusters=2, random_state=42)
y_moons_kmeans = kmeans_moons.fit_predict(X_moons)

# Visualize the clustering result
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons_kmeans, cmap='viridis', marker='o', edgecolor='k', s=50)
plt.title('K-Means Clustering on Non-Linearly Separable Data (Moons)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.


In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Load the Digits dataset
digits = load_digits()
X_digits = digits.data

# Reduce dimensions using PCA to 3 components
pca_3d = PCA(n_components=3)
X_digits_pca_3d = pca_3d.fit_transform(X_digits)

# Apply K-Means clustering
kmeans_digits_3d = KMeans(n_clusters=10)
y_digits_kmeans_3d = kmeans_digits_3d.fit_predict(X_digits_pca_3d)

# Visualize the clusters in 3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_digits_pca_3d[:, 0], X_digits_pca_3d[:, 1], X_digits_pca_3d[:, 2], c=y_digits_kmeans_3d, cmap='viridis')
ax.set_title('K-Means Clustering on Digits Dataset (3D PCA Reduced)')
plt.show()

41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering.


In [None]:
# Generate synthetic blobs with 5 centers
X_blobs_5, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.60, random_state=0)

# Apply KMeans
kmeans_blobs_5 = KMeans(n_clusters=5)
y_blobs_kmeans_5 = kmeans_blobs_5.fit_predict(X_blobs_5)

# Evaluate silhouette score
silhouette_avg = silhouette_score(X_blobs_5, y_blobs_kmeans_5)
print(f'Silhouette Score for KMeans with 5 centers: {silhouette_avg}')

42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.


In [None]:
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X_cancer = cancer.data

# Reduce dimensions using PCA
pca_cancer = PCA(n_components=2)
X_cancer_pca = pca_cancer.fit_transform(X_cancer)

# Apply Agglomerative Clustering
agg_clustering_cancer = AgglomerativeClustering(n_clusters=2)
y_cancer_agg = agg_clustering_cancer.fit_predict(X_cancer_pca)

# Visualize the results
plt.scatter(X_cancer_pca[:, 0], X_cancer_pca[:, 1], c=y_cancer_agg, cmap='viridis')
plt.title('Agglomerative Clustering on Breast Cancer Dataset (PCA Reduced)')
plt.show()

43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side.


In [None]:
# Generate noisy circular data
X_circles_noisy, _ = make_circles(n_samples=300, noise=0.1)

# Apply KMeans
kmeans_circles = KMeans(n_clusters=2)
y_circles_kmeans = kmeans_circles.fit_predict(X_circles_noisy)

# Apply DBSCAN
dbscan_circles_noisy = DBSCAN(eps=0.1, min_samples=5)
y_circles_dbscan = dbscan_circles_noisy.fit_predict(X_circles_noisy)

# Plot results side-by-side
fig, axs = plt.subplots(1 , 2, figsize=(12, 6))

# KMeans results
axs[0].scatter(X_circles_noisy[:, 0], X_circles_noisy[:, 1], c=y_circles_kmeans, cmap='viridis')
axs[0].set_title('KMeans Clustering on Noisy Circular Data')

# DBSCAN results
axs[1].scatter(X_circles_noisy[:, 0], X_circles_noisy[:, 1], c=y_circles_dbscan, cmap='plasma')
axs[1].set_title('DBSCAN Clustering on Noisy Circular Data')

plt.show()

44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering.


In [None]:
# Load the Iris dataset
iris = load_iris()
X_iris = iris.data

# Apply KMeans clustering
kmeans_iris = KMeans(n_clusters=3)
y_iris_kmeans = kmeans_iris.fit_predict(X_iris)

# Calculate Silhouette Coefficient
silhouette_vals = silhouette_samples(X_iris, y_iris_kmeans)

# Plot Silhouette Coefficient for each sample
plt.bar(range(len(silhouette_vals)), silhouette_vals)
plt.xlabel('Sample Index')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient for Each Sample in Iris Dataset')
plt.show()

45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.


In [None]:
# Generate synthetic data
X_blobs_avg, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

# Apply Agglomerative Clustering with average linkage
agg_clustering_avg = AgglomerativeClustering(n_clusters=3, linkage='average')
y_avg = agg_clustering_avg.fit_predict(X_blobs_avg)

# Visualize the results
plt.scatter(X_blobs_avg[:, 0], X_blobs_avg[:, 1], c=y_avg, cmap='viridis')
plt.title('Agglomerative Clustering with Average Linkage')
plt.show()

46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features).


In [None]:
import seaborn as sns
import pandas as pd

# Load the Wine dataset
wine = load_wine()
X_wine = wine.data[:, :4]  # First 4 features
y_wine_kmeans = KMeans(n_clusters=3).fit_predict(X_wine)

# Create a DataFrame for visualization
wine_df = pd.DataFrame(X_wine, columns=wine.feature_names[:4])
wine_df['Cluster'] = y_wine_kmeans

# Visualize with seaborn pairplot
sns.pairplot(wine_df, hue='Cluster', palette='viridis')
plt.title('KMeans Clustering on Wine Dataset (First 4 Features)')
plt.show()

47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count.


In [None]:
# Generate noisy blobs
X_noisy_blobs, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Apply DBSCAN
dbscan_noisy = DBSCAN(eps=0.5, min_samples=5)
y_noisy_dbscan = dbscan_noisy.fit_predict(X_noisy_blobs)

# Count clusters and noise points
n_clusters_noisy = len(set(y_noisy_dbscan)) - (1 if -1 in y_noisy_dbscan else 0)
n_noise_points = np.sum(y_noisy_dbscan == -1)

print(f'Number of clusters (excluding noise): {n_clusters_noisy}')
print(f'Number of noise points: {n_noise_points}')

48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.

In [None]:
# Load the Digits dataset
digits = load_digits()
X_digits = digits.data

# Reduce dimensions using t-SNE
tsne_digits = TSNE(n_components=2, random_state=42)
X_digits_tsne = tsne_digits.fit_transform(X_digits)

# Apply Agglomerative Clustering
agg_clustering_digits = AgglomerativeClustering(n_clusters=10)
y_digits_agg = agg_clustering_digits.fit_predict(X_digits_tsne)

# Visualize the results
plt.scatter(X_digits_tsne[:, 0], X_digits_tsne[:, 1], c=y_digits_agg, cmap='viridis')
plt.title('Agglomerative Clustering on Digits Dataset (t-SNE Reduced)')
plt.show()