# Weather Station Clustering using DVSCAN

<b> DBSCAN ( Density-based spatial clustering of applications with noise ) </b> is especially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada.DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters but can find the denser part of data-centred samples by ignoring less-dense areas or noises. 

<b> About the DataSet </b>
<h4 align = "center"> Environment Canada
Monthly Values for July - 2015
</h4>

Name in the table Meaning Prov Province Tm Mean Temperature (°C) DwTm Days without Valid Mean Temperature D Mean Temperature difference from Normal (1981-2010) (°C) DwTx Days without Valid Maximum Temperature DwTn Days without Valid Minimum Temperature S Snowfall (cm) DwS Days without Valid Snowfall S%N Percent of Normal (1981-2010) Snowfall DwP Days without Valid Precipitation P%N Percent of Normal (1981-2010) Precipitation S_G Snow on the ground at the end of the month (cm) Pd Number of days with Precipitation 1.0 mm or more BS Bright Sunshine (hours) DwBS Days without Valid Bright Sunshine BS% Percent of Normal (1981-2010) Bright Sunshine HDD Degree Days below 18 °C CDD Degree Days above 18 °C Stn_No Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically). NA Not Available

In [1]:
import numpy as np
import pandas as pd
csv_file = 'weather-stations20140101-20141231.csv'
df = pd.read_csv(csv_file)
df

Unnamed: 0,Stn_Name,Lat,Long,Prov,Tm,DwTm,D,Tx,DwTx,Tn,...,DwP,P%N,S_G,Pd,BS,DwBS,BS%,HDD,CDD,Stn_No
0,CHEMAINUS,48.935,-123.742,BC,8.2,0.0,,13.5,0.0,1.0,...,0.0,,0.0,12.0,,,,273.3,0.0,1011500
1,COWICHAN LAKE FORESTRY,48.824,-124.133,BC,7.0,0.0,3.0,15.0,0.0,-3.0,...,0.0,104.0,0.0,12.0,,,,307.0,0.0,1012040
2,LAKE COWICHAN,48.829,-124.052,BC,6.8,13.0,2.8,16.0,9.0,-2.5,...,9.0,,,11.0,,,,168.1,0.0,1012055
3,DISCOVERY ISLAND,48.425,-123.226,BC,,,,12.5,0.0,,...,,,,,,,,,,1012475
4,DUNCAN KELVIN CREEK,48.735,-123.728,BC,7.7,2.0,3.4,14.5,2.0,-1.0,...,2.0,,,11.0,,,,267.7,0.0,1012573
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1336,NAIN A,56.550,-61.683,NL,-22.6,0.0,-5.2,-6.8,0.0,-33.5,...,0.0,66.0,74.0,5.0,,,,1136.5,0.0,8502800
1337,NAIN A,56.551,-61.682,NL,-19.2,24.0,,-7.5,17.0,,...,17.0,,,4.0,,,,148.7,0.0,8502801
1338,SAGLEK,58.333,-62.586,NL,-24.4,2.0,,-13.5,1.0,-32.3,...,,,,,,,,1101.2,0.0,8503249
1339,TUKIALIK BAY,54.716,-58.358,NL,-22.8,2.0,,-5.8,1.0,-32.5,...,,,,,,,,1060.0,0.0,8503992


### Cleaning
Now lets remove rows that don't have any value in the TM field

In [5]:
df = df.dropna(subset=['Tm'],axis=0)
df

Unnamed: 0,Stn_Name,Lat,Long,Prov,Tm,DwTm,D,Tx,DwTx,Tn,...,DwP,P%N,S_G,Pd,BS,DwBS,BS%,HDD,CDD,Stn_No
0,CHEMAINUS,48.935,-123.742,BC,8.2,0.0,,13.5,0.0,1.0,...,0.0,,0.0,12.0,,,,273.3,0.0,1011500
1,COWICHAN LAKE FORESTRY,48.824,-124.133,BC,7.0,0.0,3.0,15.0,0.0,-3.0,...,0.0,104.0,0.0,12.0,,,,307.0,0.0,1012040
2,LAKE COWICHAN,48.829,-124.052,BC,6.8,13.0,2.8,16.0,9.0,-2.5,...,9.0,,,11.0,,,,168.1,0.0,1012055
4,DUNCAN KELVIN CREEK,48.735,-123.728,BC,7.7,2.0,3.4,14.5,2.0,-1.0,...,2.0,,,11.0,,,,267.7,0.0,1012573
5,ESQUIMALT HARBOUR,48.432,-123.439,BC,8.8,0.0,,13.1,0.0,1.9,...,8.0,,,12.0,,,,258.6,0.0,1012710
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1336,NAIN A,56.550,-61.683,NL,-22.6,0.0,-5.2,-6.8,0.0,-33.5,...,0.0,66.0,74.0,5.0,,,,1136.5,0.0,8502800
1337,NAIN A,56.551,-61.682,NL,-19.2,24.0,,-7.5,17.0,,...,17.0,,,4.0,,,,148.7,0.0,8502801
1338,SAGLEK,58.333,-62.586,NL,-24.4,2.0,,-13.5,1.0,-32.3,...,,,,,,,,1101.2,0.0,8503249
1339,TUKIALIK BAY,54.716,-58.358,NL,-22.8,2.0,,-5.8,1.0,-32.5,...,,,,,,,,1060.0,0.0,8503992


### Visualization
Visualization of stations on map