<a href="https://colab.research.google.com/github/kanika0216/python-Basics/blob/main/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Theoretical Questions**

Ques 1: What is unsupervised learning in the context of machine learning

Ans
Unsupervised learning is a type of machine learning where the algorithm learns patterns and structure from data without any labeled responses. It is mainly used for clustering and dimensionality reduction.

Ques 2: How does K-Means clustering algorithm work

Ans
K-Means clustering partitions data into K clusters by randomly initializing centroids, assigning data points to the nearest centroid, recalculating centroids based on current clusters, and repeating until centroids stabilize or a maximum number of iterations is reached.

Ques 3: Explain the concept of a dendrogram in hierarchical clustering

Ans
A dendrogram is a tree-like diagram that shows how individual data points are merged into clusters step by step in hierarchical clustering. Each merge is represented by a horizontal line connecting the merged clusters. The height of the line indicates the distance or dissimilarity between the merged clusters. By cutting the dendrogram at a certain height, we can decide the number of clusters.

Ques 4: What is the main difference between K-Means and Hierarchical Clustering

Ans
The main difference is that K-Means requires the number of clusters to be defined in advance and partitions the data directly, whereas Hierarchical Clustering builds a hierarchy of clusters without needing the number of clusters beforehand, using a tree-like structure.

Ques 5: What are the advantages of DBSCAN over K-Means

Ans
DBSCAN can find clusters of arbitrary shapes and is more robust to noise and outliers, unlike K-Means which assumes spherical clusters and is sensitive to outliers.

Ques 6: When would you use Silhouette Score in clustering

Ans
Silhouette Score is used to measure the quality of clustering. It helps determine how similar an object is to its own cluster compared to other clusters and is commonly used to select the optimal number of clusters.

Ques 7: What are the limitations of Hierarchical Clustering

Ans
Hierarchical Clustering has high computational complexity, does not perform well with large datasets, and once a merge or split is done, it cannot be undone, making it inflexible.

Ques 8: Why is feature scaling important in clustering algorithms like K-Means

Ans
Feature scaling ensures that all features contribute equally to distance calculations. Without it, variables with larger ranges dominate the clustering process, leading to biased results.

Ques 9: How does DBSCAN identify noise points

Ans
DBSCAN labels points as noise if they do not have enough neighboring points within a specified radius (epsilon) and are not reachable from any dense region.

Ques 10: Define inertia in the context of K-Means

Ans
Inertia is the sum of squared distances between each data point and its assigned cluster centroid. Lower inertia indicates tighter clusters.

Ques 11: What is the elbow method in K-Means clustering

Ans
The elbow method involves plotting the inertia against a range of K values and selecting the point where the decrease in inertia slows down significantly, resembling an "elbow" shape.

Ques 12: Describe the concept of "density" in DBSCAN

Ans
Density in DBSCAN refers to the number of data points within a given radius. A point is considered part of a cluster if there are enough points (defined by min_samples) within its epsilon neighborhood.

Ques 13: Can hierarchical clustering be used on categorical data

Ans
Yes, hierarchical clustering can be applied to categorical data using appropriate distance measures like Hamming distance or by converting categories into binary indicators.

Ques 14: What does a negative Silhouette Score indicate

Ans
A negative Silhouette Score indicates that a sample is likely placed in the wrong cluster, as its average distance to points in its own cluster is higher than to points in a neighboring cluster.

Ques 15: Explain the term "linkage criteria" in hierarchical clustering

Ans
Linkage criteria determine how distances between clusters are calculated when merging. Common types include single linkage (minimum distance), complete linkage (maximum distance), and average linkage (average distance between points in the clusters).

Ques 16: Why might K-Means clustering perform poorly on data with varying cluster sizes or densities

Ans
K-Means assumes equal-sized, spherical clusters. It performs poorly on data with clusters of varying sizes or densities because it uses Euclidean distance, which does not handle such variations well.

Ques 17: What are the core parameters in DBSCAN, and how do they influence clustering

Ans
The core parameters are eps (radius of neighborhood) and min_samples (minimum number of points in a neighborhood to form a core point). These parameters influence how dense regions are defined and how many clusters or noise points are identified.

Ques 18: How does K-Means++ improve upon standard K-Means initialization

Ans
K-Means++ improves initialization by choosing initial centroids that are far apart, which leads to better and more consistent clustering results and reduces the chances of poor local minima.

Ques 19: What is agglomerative clustering

Ans
Agglomerative clustering is a bottom-up approach in hierarchical clustering where each data point starts as its own cluster, and pairs of clusters are merged based on similarity until one cluster remains or a stopping criterion is met.

Ques 20: What makes Silhouette Score a better metric than just inertia for model evaluation

Ans
Silhouette Score considers both intra-cluster cohesion and inter-cluster separation, providing a more comprehensive measure of cluster quality compared to inertia, which only evaluates compactness.

**Practical Questions:**

Ques 21: Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering

Ans
Use make_blobs(n_samples=500, centers=5) to create synthetic data. Apply KMeans(n_clusters=5) and fit it to the data. Then calculate silhouette_score(X, kmeans.labels_) to evaluate clustering quality. A higher silhouette score indicates better-defined clusters.

Ques 22: Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D

Ans
Load the Breast Cancer dataset, standardize it, and reduce dimensions to 2 using PCA. Apply AgglomerativeClustering(n_clusters=2) and visualize the clusters in a 2D scatter plot using the two principal components.

Ques 23: Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side

Ans
Generate circular data using make_circles(noise=0.05). Fit both KMeans(n_clusters=2) and DBSCAN(eps=0.2, min_samples=5) to the data. Use subplots to show both clustering outputs side-by-side. DBSCAN handles the circular shape better.

Ques 24: Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering

Ans
Load and standardize the Iris dataset. Apply KMeans(n_clusters=3). Use silhouette_samples() to get the silhouette score for each sample and plot them using a bar graph to assess individual sample clustering quality.

Ques 25: Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters

Ans
Create data using make_blobs. Apply AgglomerativeClustering(n_clusters=3, linkage='average'). Visualize the clusters using a scatter plot, coloring by predicted labels to see how average linkage merges clusters based on average distances.

Ques 26: Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)

Ans
Load and scale the Wine dataset. Apply KMeans(n_clusters=3) and add the labels to the dataframe. Use sns.pairplot() on the first 4 features with the cluster labels as hue to visualize how the model grouped the samples.

Ques 27: Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count

Ans
Use make_blobs and optionally add noise manually. Fit DBSCAN with suitable eps and min_samples. Count the number of clusters by checking np.unique(labels) and count noise points as those labeled -1.

Ques 28: Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters

Ans
Load the Digits dataset and reduce dimensions to 2 using TSNE(n_components=2). Fit AgglomerativeClustering(n_clusters=10) and plot the results with each digit cluster shown in a different color.

Ques 29: Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot

Ans
Generate data with make_blobs(n_samples=500, centers=4). Apply KMeans(n_clusters=4) and fit it. Plot the data with colors based on cluster labels and optionally mark the centroids in the scatter plot.

Ques 30: Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels

Ans
Load and standardize the Iris dataset. Apply AgglomerativeClustering(n_clusters=3) and fit it to the data. Print the first 10 values of model.labels_ to show the cluster assignments for the initial records.

Ques 31: Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot

Ans
Generate moon-shaped data using make_moons(noise=0.1). Fit DBSCAN(eps=0.2, min_samples=5) and identify outliers as points labeled -1. Plot the data using a different color or marker for noise points to highlight them.

Ques 32: Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster

Ans
Load and standardize the Wine dataset. Apply KMeans(n_clusters=3). Count the number of samples in each cluster using np.bincount(kmeans.labels_) to display the size of each group.

Ques 33: Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result

Ans
Create circular data using make_circles(noise=0.05). Apply DBSCAN(eps=0.2, min_samples=5) and plot the clusters using different colors. DBSCAN should effectively separate the two circular regions.

Ques 34: Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids

Ans
Load the Breast Cancer dataset and scale it using MinMaxScaler. Apply KMeans(n_clusters=2) and fit the data. Print kmeans.cluster_centers_ to get the centroids of the two clusters.

Ques 35: Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN

Ans
Use make_blobs with the cluster_std parameter set to different values for each cluster. Fit DBSCAN with a suitable eps and min_samples. Plot the clusters to see how DBSCAN handles the variation in density.

Ques 36: Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means

Ans
Load and scale the Digits dataset. Apply PCA(n_components=2) to reduce the dimensions. Use KMeans(n_clusters=10) and plot the results in 2D using cluster labels to color the points.

Ques 37: Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart

Ans
Generate data using make_blobs. For k in range 2 to 5, fit KMeans and compute the silhouette score. Store the scores and plot them as a bar chart to find the optimal number of clusters based on maximum silhouette score.

Ques 38: Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage

Ans
Load and scale the Iris dataset. Use linkage(method='average') from scipy to compute the linkage matrix. Then use dendrogram() to plot the hierarchical structure, showing how clusters are merged at each step.

Ques 39: Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries

Ans
Create overlapping clusters with make_blobs(cluster_std=2.0). Fit KMeans(n_clusters=3). Use meshgrid and decision boundary plotting techniques to visualize how KMeans separates the overlapping areas.

Ques 40: Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results

Ans
Load and standardize the Digits dataset. Apply TSNE(n_components=2) and fit DBSCAN to the reduced data. Plot the clusters using different colors and highlight noise points separately.

Ques 41: Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result

Ans
Use make_blobs to generate data and apply AgglomerativeClustering(linkage='complete', n_clusters=3). Plot the results using a scatter plot to visualize how complete linkage forms the clusters based on the farthest point distances.

Ques 42: Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot

Ans
Load and scale the dataset. For k in range 2 to 6, fit KMeans(n_clusters=k) and store the inertia_ values. Plot k vs inertia to create an elbow plot to identify the optimal number of clusters.

Ques 43: Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage

Ans
Create circular data using make_circles(noise=0.05). Apply AgglomerativeClustering(n_clusters=2, linkage='single') and plot the results. Single linkage may not perform well with such shapes, causing irregular clusters.

Ques 44: Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)

Ans
Load and scale the Wine dataset. Apply DBSCAN and extract labels_. Count the unique labels excluding -1 to determine the number of clusters found by DBSCAN.

Ques 45: Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points

Ans
Create data with make_blobs(n_samples=300, centers=4). Apply KMeans(n_clusters=4) and plot the data using a scatter plot. Overlay the cluster centers using a larger marker or different color.

Ques 46: Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise

Ans
Load and standardize the Iris dataset. Apply DBSCAN and count the number of samples where label is -1 to get the count of noise points.

Ques 47: Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result

Ans
Use make_moons(noise=0.1) to generate the data. Apply KMeans(n_clusters=2) and plot the clustering result. KMeans usually fails on such non-linear data as it assumes spherical clusters.

Ques 48: Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot

Ans
Load and scale the Digits dataset. Apply PCA(n_components=3) and fit KMeans(n_clusters=10). Use a 3D scatter plot to visualize the clusters using the three principal components with different colors for each cluster.