# DBScan Clustering with PyCaret

## Assignment D
**Goal:** Implement DBScan clustering using the PyCaret library.

**Dataset:** [Glass Identification](https://paperswithcode.com/dataset/glass-identification) (loaded via `pycaret.datasets` or `sklearn` if not available directly, but we will use `pycaret`'s built-in data for simplicity or load external if needed. Here we use `glass` from PyCaret if available, otherwise we load from URL).

In [1]:
!pip install -q pycaret


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import pandas as pd
from pycaret.clustering import *

# Load Glass dataset from PyCaret repository or URL
# PyCaret has a 'glass' dataset in its repository
from pycaret.datasets import get_data
data = get_data('glass')

# Check data
print(data.head())

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


        RI     Na    Mg    Al     Si     K    Ca   Ba   Fe  Type
0  1.52101  13.64  4.49  1.10  71.78  0.06  8.75  0.0  0.0     1
1  1.51761  13.89  3.60  1.36  72.73  0.48  7.83  0.0  0.0     1
2  1.51618  13.53  3.55  1.54  72.99  0.39  7.78  0.0  0.0     1
3  1.51766  13.21  3.69  1.29  72.61  0.57  8.22  0.0  0.0     1
4  1.51742  13.27  3.62  1.24  73.08  0.55  8.07  0.0  0.0     1


## 1. Setup PyCaret Environment
We initialize the clustering environment. We ignore the 'Type' column as we want to cluster unsupervised.

In [3]:
s = setup(data, ignore_features=['Type'], session_id=123, verbose=False)
print("Setup Complete")

Setup Complete


## 2. Create DBScan Model
We create the DBScan model. Note that DBScan does not take `num_clusters` as a parameter, but `eps` and `min_samples`.

In [4]:
dbscan = create_model('dbscan', eps=0.5, min_samples=5)
print(dbscan)

Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.2529,27.6361,1.3886,0,0,0


DBSCAN(n_jobs=-1)


## 3. Assign Labels
We assign the cluster labels to the original dataset.

In [5]:
results = assign_model(dbscan)
results.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Cluster
0,1.52101,13.64,4.49,1.1,71.779999,0.06,8.75,0.0,0.0,Cluster -1
1,1.51761,13.89,3.6,1.36,72.730003,0.48,7.83,0.0,0.0,Cluster 0
2,1.51618,13.53,3.55,1.54,72.989998,0.39,7.78,0.0,0.0,Cluster 0
3,1.51766,13.21,3.69,1.29,72.610001,0.57,8.22,0.0,0.0,Cluster 0
4,1.51742,13.27,3.62,1.24,73.080002,0.55,8.07,0.0,0.0,Cluster 0


## 4. Visual Analysis
PyCaret provides easy plotting functions.

In [6]:
# 2D Plot (PCA)
plot_model(dbscan, plot='cluster')

# Distribution Plot
plot_model(dbscan, plot='distribution')

## 5. Evaluation
We can check the Silhouette Score which is automatically calculated by PyCaret during model creation, or calculate it manually.

In [7]:
from sklearn.metrics import silhouette_score
X = get_config('X')
labels = results['Cluster']

# Filter out noise points (-1) for silhouette score if desired, or keep them
# DBScan labels noise as 'Cluster -1' usually, or PyCaret might map it.
# Let's check unique labels
print(f"Unique clusters: {labels.unique()}")

# Compute score (excluding noise if needed, but standard is to include or handle separately)
# Here we calculate for all points
# Note: PyCaret labels might be strings like 'Cluster 0', 'Cluster 1'
# We need to encode them for sklearn metrics if they are strings
if labels.dtype == 'object':
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    labels_encoded = le.fit_transform(labels)
else:
    labels_encoded = labels

score = silhouette_score(X, labels_encoded)
print(f"Silhouette Score: {score:.4f}")

Unique clusters: ['Cluster -1' 'Cluster 0' 'Cluster 1' 'Cluster 3' 'Cluster 2']
Silhouette Score: 0.2529
