# DBSCAN

In this notebook you will use GPU-accelerated DBSCAN to identify clusters of infected people.

## Objectives

By the time you complete this notebook you will be able to:

- Use GPU-accelerated DBSCAN
- Use cuXfilter to visualize DBSCAN clusters

## Imports

In [1]:
import cudf
import cuml

import cuxfilter as cxf

## Load Data

For this notebook, we again load a subset of our population data with only the columns we need. An `infected` column has been added to the data to indicate whether or not a person is known to be infected with our simulated virus.

In [2]:
gdf = cudf.read_csv('./data/pop_2-04.csv', dtype=['float32', 'float32', 'float32'])
print(gdf.dtypes)
gdf.shape

northing    float32
easting     float32
infected    float32
dtype: object


(1000000, 3)

In [3]:
gdf.head()

Unnamed: 0,northing,easting,infected
0,178547.296875,368012.125,0.0
1,174068.28125,543802.125,0.0
2,358293.6875,435639.875,0.0
3,87240.304688,389607.375,0.0
4,158261.015625,340764.9375,0.0


In [4]:
gdf['infected'].value_counts()

0.0    984331
1.0     15669
Name: infected, dtype: int32

## DBSCAN Clustering

DBSCAN is another unsupervised clustering algorithm that is particularly effective when the number of clusters is not known up front and the clusters may have concave or other unusual shapes--a situation that often applies in geospatial analytics.

In this series of exercises you will use DBSCAN to identify clusters of infected people by location, which may help us identify groups becoming infected from common patient zeroes and assist in response planning.

## Exercise: Make a DBSCAN Instance

Create a DBSCAN instance by using `cuml.DBSCAN`. Pass in the named argument `eps` (the maximum distance a point can be from the nearest point in a cluster to be considered possibly in that cluster) to be `5000`. Since the `northing` and `easting` values we created are measured in meters, this will allow us to identify clusters of infected people where individuals may be separated from the rest of the cluster by up to 5 kilometers.

In [24]:
cuDB = cuml.DBSCAN(eps = 5000)

#### Solution

In [25]:
# %load solutions/dbscan_instance
dbscan = cuml.DBSCAN(eps=5000)


### Exercise: Identify Infected Clusters

Create a new dataframe from rows of the original dataframe where `infected` is `1` (true), and call it `infected_df`--be sure to reset the dataframe's index afterward. Use `dbscan.fit_predict` to perform clustering on the `northing` and `easting` columns of `infected_df`, and turn the resulting series into a new column in `infected_gdf` called "cluster". Finally, compute the number of clusters identified by DBSCAN.

In [37]:
infected_df = gdf.loc[gdf['infected'] == 1.0]
infected_df.reset_index(drop=True)
infected_df['cluster'] = cuDB.fit_predict(infected_df[['northing','easting']])
infected_df['cluster'].nunique()
gdf.head()

Unnamed: 0,northing,easting,infected
0,178547.296875,368012.125,0.0
1,174068.28125,543802.125,0.0
2,358293.6875,435639.875,0.0
3,87240.304688,389607.375,0.0
4,158261.015625,340764.9375,0.0


#### Solution

In [35]:
# %load solutions/identify_infected
infected_df = gdf[gdf['infected'] == 1].reset_index()
infected_df.head()
infected_df['cluster'] = dbscan.fit_predict(infected_df[['northing', 'easting']])
infected_df['cluster'].nunique()
gdf.head()

Unnamed: 0,index,northing,easting,infected,cluster
0,41,595509.25,426545.59375,1.0,-1
1,182,185086.609375,547718.0,1.0,0
2,190,183406.546875,528708.0625,1.0,0
3,216,373846.5625,485603.84375,1.0,0
4,266,194947.40625,535378.375,1.0,0


## Visualize the Clusters

Because we have the same column names as in the K-means example--`easting`, `northing`, and `cluster`--we can use the same code to visualize the clusters.

### Associate a Data Source with cuXfilter

In [None]:
cxf_data = cxf.DataFrame.from_dataframe(infected_df)

### Define Charts and Widgets

As in the K-means notebook, we have an existing integer column to use with multi-select: `cluster`.

In [None]:
chart_width = 600
scatter_chart = cxf.charts.datashader.scatter(x='easting', y='northing', 
                                              width=chart_width, 
                                              height=int((gdf['northing'].max() - gdf['northing'].min()) / 
                                                         (gdf['easting'].max() - gdf['easting'].min()) *
                                                          chart_width))

cluster_widget = cxf.charts.panel_widgets.multi_select('cluster')

### Create and Show the Dashboard

In [None]:
dash = cxf_data.dashboard([scatter_chart, cluster_widget], theme=cxf.themes.dark, data_size_widget=True)

In [None]:
scatter_chart.view()

In [None]:
%%js
var host = window.location.host;
element.innerText = "'"+host+"'";

Set `my_url` in the next cell to the value just printed, making sure to include the quotes:

In [None]:
my_url = 'http://ec2-34-231-110-250.compute-1.amazonaws.com/'
dash.show(my_url, port=8789)

... and you can run the next cell to generate a link to the dashboard:

In [None]:
%%js
var host = window.location.host;
var url = 'http://'+host+'/lab/proxy/8789/';
element.innerHTML = '<a style="color:blue;" target="_blank" href='+url+'>Open Dashboard</a>';

In [None]:
dash.stop()

<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In the next notebook, you will use GPU-accelerated logistic regression to estimate infection risk based on features of our population members.