### we use DBScan algorithm 

## why we use dbscan ?

as the previous k means algorithms have following disadvantages
1) we have to define the number of clusters before and the elbow method(which is use to find how many clusters) not always works correctly.
2) Sensitive to outliers -> A single extreme point can drag a centroid far away, ruining a cluster.
3) it works poorly with non-spherical clusters.

## K means is a centroid based clustering algorithm(clusters make around center points) but dbscan is a "density based clustering" algorithm it make cluster where density is more

optics is also an density based algorithm.

## lets us discuss how dbscan checks point is in density or not

There are two hyperparameters Epsilon(radius) and Minpoint which is used <br>
1 - the epsilon is radius of circle we make around that point with take that point as a center and then check points that reside in the circle <br>
2 - if no. of points >= minpoint we take that point in dense region
<br>
Although we only have to tune these parameters to get best result


## Types of points in DBscan 
 
1 - Core point = this is the point from which if we create epsilon circle ,the no. of points in circle is >= minpoints (the core point is included in counting).
<br>
2 - Border point = this is the point from which if we create epsilon circle ,the no. of points in circle is < minpoints but at least one point is core point.<br>
3 - Noise = in epsilon < (less) points but don't have any core point in it.


## Density Connected Points

let two point A and B , they are density connected if

1 - they are indirectly connected through core points only(only core points between them) (Note :- not other point Noise or Border)<BR>
2 - and the distance between two adjacent core points (d<= e) .

Note :- we take density connected points in same clusters

### DBSCAN ALGORITHM

Step 1 - Identify all points as either core point, border point or noise poin<BR>
Step 2 - For all of the unclustered core points<BR>
    Step 2a - Create a new cluster<BR>
    Step 2b- add all the points that are unclustered and density connected to the current point into this cluster<BR>
Step 3 - For each unclustered border point assign it to the cluster of nearest core point<BR>
Step 4 - Leave all the noise points as it is.

# Advantages
1. Robust to outliers (used for outlier detection)<br>
2. No need to specify clusters<br>
3. Can find arbitrary(any) shaped clusters <br>
4. Only 2 hyperparameters to tune

<Br>

# Disadvantages
1. Sensitivity to hyperparameters<br>
2. Difficulty with varying density clusters (not same parameters is good for sparse and dense cluster )<br>
3. Does not predict (if new point come it does not predict cluster for that point)

In [25]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

In [26]:
df = pd.read_csv("delhi_danger_index.csv")

In [27]:
df.head()

Unnamed: 0,DISTRICT,# INJURED,# KILLED,DANGER_SCORE
0,NORTH DELHI(ROHINI),8615,2850,17165
1,NEW DELHI,9025,2400,16225
2,SOUTH EAST DELHI,9359,2233,16058
3,CENTRAL DELHI,7820,2175,14345
4,SOUTH WEST DELHI,7157,1869,12764


In [28]:
df.columns = df.columns.str.strip()


In [29]:
X = df[['# INJURED', '# KILLED', 'DANGER_SCORE']]

In [30]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## <div>StandardScaler just rescales your data so all features are on the same level.</div>
It prevents one column from dominating the model just because its numbers are larger.<br><br>
It does not move all data “near” the mean — instead:
Values above the mean become positive
Values below the mean become negative
The distance from the mean is scaled by the standard deviation
<br>           So:
                   z=x−μ/σ

​	
 

In [31]:
model = DBSCAN(eps=0.6, min_samples=2)

In [32]:
df['CLUSTER'] = model.fit_predict(X_scaled)

In [33]:
df.to_csv("delhi_hotspot_clusters.csv", index=False)

In [34]:
print("✅ Clustering complete! File: delhi_hotspot_clusters.csv")

✅ Clustering complete! File: delhi_hotspot_clusters.csv


In [35]:
df[['DISTRICT', 'DANGER_SCORE', 'CLUSTER']].head(50)


Unnamed: 0,DISTRICT,DANGER_SCORE,CLUSTER
0,NORTH DELHI(ROHINI),17165,0
1,NEW DELHI,16225,0
2,SOUTH EAST DELHI,16058,0
3,CENTRAL DELHI,14345,0
4,SOUTH WEST DELHI,12764,0
5,WEST DELHI,12681,0
6,NORTH WEST DELHI,11752,0
7,EAST DELHI,9001,1
8,SHAHDARA,7731,1
9,SOUTH DELHI,6116,1


In [36]:
pip install folium geopy


Note: you may need to restart the kernel to use updated packages.


In [37]:
import folium
from folium.plugins import HeatMap

# Load your clustered file
df = pd.read_csv("delhi_hotspot_clusters.csv")

# Delhi district center coordinates (approx)
district_coords = {
    "CENTRAL DELHI": (28.6519, 77.2315),
    "EAST DELHI": (28.6415, 77.2946),
    "NEW DELHI": (28.6139, 77.2090),
    "NORTH DELHI": (28.7041, 77.1025),
    "NORTH EAST DELHI": (28.6863, 77.2625),
    "NORTH WEST DELHI": (28.7480, 77.0565),
    "SOUTH DELHI": (28.5244, 77.1855),
    "SOUTH EAST DELHI": (28.5300, 77.2700),
    "SOUTH WEST DELHI": (28.5733, 76.9800),
    "WEST DELHI": (28.6692, 77.0680),
    "SHAHDARA": (28.6760, 77.2890)
}

In [38]:
# Normalize names
df['DISTRICT'] = df['DISTRICT'].str.upper().str.strip()

# Map lat-long
df['LAT'] = df['DISTRICT'].map(lambda x: district_coords.get(x, (None, None))[0])
df['LON'] = df['DISTRICT'].map(lambda x: district_coords.get(x, (None, None))[1])

# Drop rows without coords
df = df.dropna(subset=['LAT', 'LON'])


In [39]:
# Create base map (Delhi)
delhi_map = folium.Map(location=[28.61, 77.23], zoom_start=11)

# Heatmap points
heat_data = [[row['LAT'], row['LON'], row['DANGER_SCORE']] 
             for index, row in df.iterrows()]

HeatMap(heat_data, radius=30, blur=20).add_to(delhi_map)

delhi_map
