In [None]:
1️⃣ What is unsupervised learning in the context of machine learning?
Unsupervised learning is a type of machine learning where the model learns patterns and structures in unlabeled data. Unlike supervised learning (with input-output pairs), unsupervised learning finds hidden structures (like clusters, groups, or associations) within data without predefined labels. Examples include clustering and dimensionality reduction.

2️⃣ How does K-Means clustering algorithm work?
K-Means partitions data into K clusters by:

Randomly initializing K cluster centroids.

Assigning each point to the nearest centroid (based on distance).

Updating centroids by calculating the mean of points in each cluster.

Repeating steps 2-3 until centroids stabilize or max iterations are reached.

3️⃣ Explain the concept of a dendrogram in hierarchical clustering.
A dendrogram is a tree-like diagram that shows how data points are merged in hierarchical clustering. It starts with each point as its own cluster and merges them step by step, showing cluster hierarchy. Cutting the dendrogram at a chosen height gives the final clusters.

4️⃣ What is the main difference between K-Means and Hierarchical Clustering?
K-Means: Requires the number of clusters (K) in advance; forms non-overlapping clusters.

Hierarchical: Builds a hierarchy (dendrogram); doesn't need K in advance; can give nested clusters at different levels.

5️⃣ What are the advantages of DBSCAN over K-Means?
Can find arbitrarily shaped clusters.

Automatically detects outliers (noise).

No need to specify the number of clusters in advance.

Handles varying densities better than K-Means.

6️⃣ When would you use Silhouette Score in clustering?
Use Silhouette Score to evaluate how well points are clustered by measuring cohesion (within-cluster) and separation (between-cluster). It’s useful when comparing different clustering models or when the true labels are unknown.

7️⃣ What are the limitations of Hierarchical Clustering?
Computationally expensive for large datasets (O(n²) complexity).

No way to "undo" a merge step.

Sensitive to noise and outliers.

Less flexible for large or streaming data.

8️⃣ Why is feature scaling important in clustering algorithms like K-Means?
K-Means relies on distance metrics (e.g., Euclidean). Features with large magnitudes dominate the distance calculation, so scaling (like standardization) ensures all features contribute equally to the cluster formation.

9️⃣ How does DBSCAN identify noise points?
DBSCAN labels points as noise if they have fewer than min_samples points in their neighborhood (within distance ε). These points don’t belong to any cluster.

🔟 Define inertia in the context of K-Means.
Inertia is the sum of squared distances between data points and their assigned cluster centroids. It measures how tightly points are clustered around centroids. Lower inertia suggests better clustering.

1️⃣1️⃣ What is the elbow method in K-Means clustering?
The elbow method involves plotting inertia vs. the number of clusters (K). The "elbow point" where inertia starts decreasing more slowly suggests the optimal K value.

1️⃣2️⃣ Describe the concept of "density" in DBSCAN.
Density refers to the number of points within a certain ε-neighborhood. DBSCAN groups points with enough neighbors (min_samples) into a cluster, forming clusters of high density, while sparse regions are considered noise.

1️⃣3️⃣ Can hierarchical clustering be used on categorical data?
Yes, but with modifications. You need to define a suitable distance metric for categorical data (e.g., Hamming distance or matching coefficient), as standard Euclidean distance isn’t meaningful for categories.

1️⃣4️⃣ What does a negative Silhouette Score indicate?
A negative score means a point is closer to points in a different cluster than to its own cluster—indicating poor clustering.

1️⃣5️⃣ Explain the term "linkage criteria" in hierarchical clustering.
Linkage criteria define how distances between clusters are measured when merging:

Single linkage: Minimum distance.

Complete linkage: Maximum distance.

Average linkage: Average distance.

1️⃣6️⃣ Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?
K-Means assumes clusters are spherical and similar in size. If clusters vary in size/density, K-Means may assign points incorrectly, merge small clusters into large ones, or split dense clusters.

1️⃣7️⃣ What are the core parameters in DBSCAN, and how do they influence clustering?
ε (epsilon): Defines the radius for neighborhood searches.

min_samples: Minimum points to form a dense region.
A small ε forms many small clusters, large ε may merge clusters. The right choice of ε and min_samples is critical for effective clustering.

1️⃣8️⃣ How does K-Means++ improve upon standard K-Means initialization?
K-Means++ selects initial centroids by:

Choosing the first centroid randomly.

Selecting subsequent centroids based on the distance from already chosen centroids.
This avoids poor initializations and speeds up convergence.

1️⃣9️⃣ What is agglomerative clustering?
Agglomerative clustering is a type of hierarchical clustering where:

Each data point starts as its own cluster.

Pairs of clusters are merged step by step based on a linkage criterion until all points form a single cluster.

2️⃣0️⃣ What makes Silhouette Score a better metric than just inertia for model evaluation?
Inertia measures within-cluster compactness but not how clusters are separated.

Silhouette Score combines cohesion (how similar points are to their own cluster) and separation (how different they are from other clusters), giving a more comprehensive evaluation, especially for comparing different cluster counts or models.

In [None]:
2️⃣1️⃣ Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=0)
y_kmeans = kmeans.fit_predict(X)

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title("K-Means Clustering with 4 Centers")
plt.show()
2️⃣2️⃣ Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels

from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)
labels = agg_clustering.fit_predict(X)

# Display the first 10 predicted labels
print("First 10 predicted labels:", labels[:10])
2️⃣3️⃣ Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import numpy as np

# Generate synthetic data
X, _ = make_moons(n_samples=300, noise=0.05, random_state=0)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Identify core samples and outliers
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
outliers = labels == -1

# Plot the results
plt.scatter(X[~outliers, 0], X[~outliers, 1], c=labels[~outliers], cmap='viridis', s=50)
plt.scatter(X[outliers, 0], X[outliers, 1], c='red', s=50, marker='x', label='Outliers')
plt.title("DBSCAN Clustering with Outliers Highlighted")
plt.legend()
plt.show()
2️⃣4️⃣ Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from collections import Counter

# Load the Wine dataset
wine = load_wine()
X = wine.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(X_scaled)

# Print the size of each cluster
cluster_sizes = Counter(labels)
print("Cluster sizes:", cluster_sizes)
2️⃣5️⃣ Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result

from sklearn.datasets import make_circles

# Generate synthetic data
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=0)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("DBSCAN Clustering on Concentric Circles")
plt.show()
2️⃣6️⃣ Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Apply MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X_scaled)

# Output the cluster centroids
print("Cluster centroids:\n", kmeans.cluster_centers_)
2️⃣7️⃣ Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN

# Generate synthetic data with varying cluster standard deviations
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.0, 2.5], random_state=0)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.9, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("DBSCAN Clustering with Varying Cluster Densities")
plt.show()
2️⃣8️⃣ Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Reduce to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=0)
labels = kmeans.fit_predict(X_pca)

# Plot the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='tab10', s=50)
plt.title("K-Means Clustering on Digits Dataset (PCA Reduced)")
plt.show()
2️⃣9️⃣ Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart

from sklearn.metrics import silhouette_score

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Evaluate silhouette scores for k = 2 to 5
scores = []
k_values = range(2, 6)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    scores.append(score)

# Display as a bar chart
plt.bar(k_values, scores)
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for Different k Values")
plt.show()
3️⃣0️⃣ Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage

from scipy.cluster.hierarchy import dendrogram, linkage

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Compute the linkage matrix
linked = linkage(X, method='average')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, labels=iris.target, leaf_rotation=90)
plt.title("Hierarchical Clustering Dendrogram (Average Linkage)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()

3️⃣6️⃣ Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load and scale the Wine dataset
data = load_wine()
X = StandardScaler().fit_transform(data.data)

# Apply DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Count clusters (excluding noise)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)
3️⃣7️⃣ Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points
python
Copy
Edit
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)

# Plot data points and cluster centers
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title("KMeans Clustering with Cluster Centers")
plt.show()
3️⃣8️⃣ Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise

from sklearn.datasets import load_iris

# Load Iris dataset
X = load_iris().data

# Apply DBSCAN
dbscan = DBSCAN(eps=0.8, min_samples=5)
labels = dbscan.fit_predict(X)

# Count noise points
noise_points = np.sum(labels == -1)
print("Number of noise points:", noise_points)
3️⃣9️⃣ Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result

from sklearn.datasets import make_moons

# Generate non-linear data
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

# Plot clustering result
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("KMeans Clustering on Non-Linear Data (make_moons)")
plt.show()
4️⃣0️⃣ Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

# Load Digits dataset
X = load_digits().data

# Reduce to 3 components using PCA
X_pca = PCA(n_components=3).fit_transform(X)

# Apply KMeans
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# 3D Scatter plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='tab10', s=50)
legend = ax.legend(*scatter.legend_elements(), title="Clusters")
ax.add_artist(legend)
plt.title("3D PCA + KMeans on Digits Dataset")
plt.show()

3️⃣1️⃣ Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate blobs
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate Silhouette Score
score = silhouette_score(X, labels)
print("Silhouette Score:", score)
3️⃣2️⃣ Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D

from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load and preprocess
X = load_breast_cancer().data
X_pca = PCA(n_components=2).fit_transform(X)

# Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X)

# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.title("Agglomerative Clustering on Breast Cancer (PCA Reduced)")
plt.show()
3️⃣3️⃣ Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side

from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# KMeans
kmeans = KMeans(n_clusters=2, random_state=42).fit_predict(X)

# DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5).fit_predict(X)

# Plot
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].scatter(X[:, 0], X[:, 1], c=kmeans, cmap='viridis')
ax[0].set_title("KMeans Clustering")
ax[1].scatter(X[:, 0], X[:, 1], c=dbscan, cmap='viridis')
ax[1].set_title("DBSCAN Clustering")
plt.show()
3️⃣4️⃣ Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering

from sklearn.datasets import load_iris
from sklearn.metrics import silhouette_samples

X = load_iris().data
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
labels = kmeans.labels_

# Silhouette Coefficients
sil_scores = silhouette_samples(X, labels)

# Plot
plt.bar(range(len(sil_scores)), sil_scores)
plt.title("Silhouette Coefficients per Sample (Iris Dataset)")
plt.xlabel("Sample Index")
plt.ylabel("Silhouette Coefficient")
plt.show()
3️⃣5️⃣ Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title("Agglomerative Clustering ('average' linkage)")
plt.show()
3️⃣6️⃣ Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)

import seaborn as sns
import pandas as pd
from sklearn.datasets import load_wine

wine = load_wine()
df = pd.DataFrame(wine.data[:, :4], columns=wine.feature_names[:4])

# KMeans clustering
df['Cluster'] = KMeans(n_clusters=3, random_state=42).fit_predict(wine.data)

# Pairplot
sns.pairplot(df, hue='Cluster', palette='viridis', diag_kind='kde')
plt.show()
3️⃣7️⃣ Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)

dbscan = DBSCAN(eps=1.0, min_samples=5)
labels = dbscan.fit_predict(X)

# Count clusters and noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print("Number of clusters:", n_clusters)
print("Number of noise points:", n_noise)
3️⃣8️⃣ Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters

from sklearn.manifold import TSNE

X = load_digits().data
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(X)

agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10', s=40)
plt.title("Agglomerative Clustering on Digits (t-SNE Reduced)")
plt.show()
