# **DBScan Clustering**

To apply DBSCAN clustering on the provided dataset to identify distinct groupings based on attributes like height and weight. The goal is to uncover any natural clusters that might indicate patterns or relationships in the data, which could be useful for applications like health assessments, demographic studies, or tailored product marketing.

**Load the data**

In [1]:
!pip install pycaret



In [2]:
import pandas as pd
dataset_url = 'https://raw.githubusercontent.com/neeharikasinghsjsu/cmpe255assignments/main/Clustering/dataset/d_dbscan_clustering_height_weight.csv'
dataset = pd.read_csv(dataset_url)
dataset.head()

Unnamed: 0,Weight,Height
0,67.062924,176.086355
1,68.804094,178.388669
2,60.930863,170.284496
3,59.733843,168.691992
4,65.43123,173.763679


**Data Preprocessing**

In [3]:
# Checking for missing values in the dataset
missing_values = dataset.isnull().sum()

# Summary statistics of the dataset to understand the scale of the data
summary_statistics = dataset.describe()

missing_values, summary_statistics



(Weight    0
 Height    0
 dtype: int64,
            Weight      Height
 count  500.000000  500.000000
 mean    61.270240  169.515781
 std      5.196976    4.805095
 min     50.433644  160.182164
 25%     57.772791  166.607599
 50%     61.961518  169.726252
 75%     65.439332  172.837284
 max     70.700456  178.894770)

The dataset has no missing values. The summary statistics show a reasonable range and variation in both weight and height.

DbScan clustering using pycaret

In [4]:
from pycaret.clustering import setup, create_model, assign_model

# Assuming your preprocessed dataset is named 'dataset'

# Setting up the PyCaret clustering environment
cluster_setup = setup(data = dataset, session_id=123)

# Creating the DBSCAN model
# You can adjust eps and min_samples as needed
dbscan_model = create_model('dbscan', eps=0.5, min_samples=5)

# Assigning the model to the dataset to create clusters
dbscan_results = assign_model(dbscan_model)



Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(500, 2)"
2,Transformed data shape,"(500, 2)"
3,Numeric features,2
4,Preprocess,True
5,Imputation type,simple
6,Numeric imputation,mean
7,Categorical imputation,mode
8,CPU Jobs,-1
9,Use GPU,False


Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.4785,666.8406,2.062,0,0,0


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

Analyze cluster results

In [5]:

from pycaret.clustering import plot_model

# Plotting the clusters
plot_model(dbscan_model, plot = 'cluster')

# Checking the distribution of data points in each cluster
cluster_distribution = dbscan_results['Cluster'].value_counts()

# Analyzing characteristics of each cluster
cluster_characteristics = dbscan_results.groupby('Cluster').mean()

print(cluster_distribution)
print(cluster_characteristics)


Cluster 3     118
Cluster 2     113
Cluster 1     110
Cluster 0     108
Cluster -1     44
Cluster 4       7
Name: Cluster, dtype: int64
               Weight      Height
Cluster                          
Cluster -1  63.202641  171.287247
Cluster 0   67.513885  175.987701
Cluster 1   60.763290  170.100037
Cluster 2   63.172745  169.166748
Cluster 3   53.646328  162.797043
Cluster 4   58.565067  168.240555


Based on the DBSCAN clustering results you provided, here's a summary and analysis:

The dataset is divided into six clusters (0 to 4) and an outlier group (-1).

Clusters 0 to 4 have fairly balanced sizes (108 to 118 data points), indicating distinct groupings within the dataset.

Cluster -1, representing outliers, consists of 44 data points. These are points that did not fit well into any of the other clusters.

Analyzing the mean values for Weight and Height in each cluster:

Cluster 0: Higher average weight and height.

Cluster 1: Lower weight and average height.

Cluster 2: Average weight and slightly below average height.

Cluster 3: Significantly lower weight and shortest height.

Cluster 4: Moderately low weight and slightly below average height.

Cluster -1 (Outliers): Average weight and height, but possibly with more variation or distinct patterns not captured by other clusters.