<a href="https://www.nvidia.com/dli"><img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/></a>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
from pathlib import Path

str_path = "/content/drive/MyDrive/NVIDIA/Fundamentals_of_Accelerated_Data_Science/Assessment"
base_path = Path(str_path)


In [5]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 621, done.[K
remote: Counting objects: 100% (187/187), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 621 (delta 143), reused 86 (delta 85), pack-reused 434 (from 3)[K
Receiving objects: 100% (621/621), 205.72 KiB | 22.86 MiB/s, done.
Resolving deltas: 100% (317/317), done.
Installing RAPIDS remaining 25.10 libraries
Using Python 3.12.12 environment at: /usr
Resolved 175 packages in 26.85s
Prepared 18 packages in 25.28s
Uninstalled 11 packages in 182ms
Installed 18 packages in 44ms
 - bokeh==3.7.3
 + bokeh==3.6.3
 + cucim-cu12==25.10.0
 + cugraph-cu12==25.10.1
 + cuxfilter-cu12==25.10.0
 + datashader==0.18.2
 - holoviews==1.22.1
 + holoviews==1.20.2
 + jupyter-server-proxy==4.4.0
 - nvidia-cublas-cu12==12.6.4.1
 + nvidia-cublas-cu12==12.9.1.4
 - nvidia-cuda-nvcc-cu12==12.5.82
 + nvidia-cuda-nvcc-cu12==12.9.86
 - nvidia-cuda-nvrtc-cu12==12.6.77
 + nvidia-cuda-nvrtc-cu12==12.9.86
 - nvidia-c

# Week 1: Find Clusters of Infected People

<span style="color:red">
**URGENT WARNING**

We have been receiving reports from health facilities that a new, fast-spreading virus has been discovered in the population. To prepare our response, we need to understand the geospatial distribution of those who have been infected. Find out whether there are identifiable clusters of infected individuals and where they are.    
</span>

Your goal for this notebook will be to estimate the location of dense geographic clusters of infected people using incoming data from week 1 of the simulated epidemic.

## Imports

In [6]:
%load_ext cudf.pandas
import pandas as pd
import cuml

import cupy as cp

## Load Data

Begin by loading the data you've received about week 1 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `'./data/week1.csv'`. For this notebook you will only need the `'lat'`, `'long'`, and `'infected'` columns. Either drop the columns after loading, or use the `pd.read_csv` named argument `usecols` to provide a list of only the columns you need.

In [7]:
#df = pd.read_csv("./data/week1.csv", usecols=["lat", "long", "infected"])
df = pd.read_csv( Path(base_path, "data", "week1.csv"), usecols=["lat", "long", "infected"])
df

Unnamed: 0,lat,long,infected
0,54.522510,-1.571896,False
1,54.554030,-1.524968,False
2,54.552486,-1.435203,False
3,54.537189,-1.566215,False
4,54.528212,-1.588462,False
...,...,...,...
58479889,51.634416,-2.925863,False
58479890,51.556972,-3.036290,False
58479891,51.588992,-2.921915,False
58479892,51.590974,-2.954539,False


## Make Data Frame of the Infected

Make a new DataFrame `infected_df` that contains only the infected members of the population.

**Tip**: Reset the index of `infected_df` with `.reset_index(drop=True)`.

### BEGIN: MWE

In [8]:
df[df["infected"] == True]

Unnamed: 0,lat,long,infected
28928759,54.472766,-1.654932,True
28930512,54.529717,-1.667143,True
28930904,54.512986,-1.589866,True
28932226,54.522322,-1.380694,True
28933748,54.541660,-1.613490,True
...,...,...,...
57404109,52.428347,-3.322932,True
57406802,52.415895,-3.263942,True
57410428,52.539934,-3.617128,True
57411005,52.435490,-3.597263,True


### END: MWE

In [9]:
infected_df = df[df["infected"] == True]
infected_df = infected_df.reset_index(drop=True)

### BEGIN: MWE

In [None]:
infected_df.head()

Unnamed: 0,age,sex,lat,long,employment,infected
0,44,f,54.472766,-1.654932,Q,True
1,47,f,54.529717,-1.667143,Q,True
2,47,f,54.512986,-1.589866,I,True
3,49,f,54.522322,-1.380694,Q,True
4,51,f,54.54166,-1.61349,Q,True


### END: MWE

## Make Grid Coordinates for Infected Locations

Provided for you in the next cell (which you can expand by clicking on the "..." and contract again after executing by clicking on the blue left border of the cell) is the lat/long to OSGB36 grid coordinates converter you used earlier in the workshop. Use this converter to create grid coordinate values stored in `northing` and `easting` columns of the `infected_df` you created in the last step.

In [10]:
# https://www.ordnancesurvey.co.uk/docs/support/guide-coordinate-systems-great-britain.pdf

def latlong2osgbgrid_cupy(lat, long, input_degrees=True):
    '''
    Converts latitude and longitude (ellipsoidal) coordinates into northing and easting (grid) coordinates, using a Transverse Mercator projection.

    Inputs:
    lat: latitude coordinate (N)
    long: longitude coordinate (E)
    input_degrees: if True (default), interprets the coordinates as degrees; otherwise, interprets coordinates as radians

    Output:
    (northing, easting)
    '''

    if input_degrees:
        lat = lat * cp.pi/180
        long = long * cp.pi/180

    a = 6377563.396
    b = 6356256.909
    e2 = (a**2 - b**2) / a**2

    N0 = -100000 # northing of true origin
    E0 = 400000 # easting of true origin
    F0 = .9996012717 # scale factor on central meridian
    phi0 = 49 * cp.pi / 180 # latitude of true origin
    lambda0 = -2 * cp.pi / 180 # longitude of true origin and central meridian

    sinlat = cp.sin(lat)
    coslat = cp.cos(lat)
    tanlat = cp.tan(lat)

    latdiff = lat-phi0
    longdiff = long-lambda0

    n = (a-b) / (a+b)
    nu = a * F0 * (1 - e2 * sinlat ** 2) ** -.5
    rho = a * F0 * (1 - e2) * (1 - e2 * sinlat ** 2) ** -1.5
    eta2 = nu / rho - 1
    M = b * F0 * ((1 + n + 5/4 * (n**2 + n**3)) * latdiff -
                  (3*(n+n**2) + 21/8 * n**3) * cp.sin(latdiff) * cp.cos(lat+phi0) +
                  15/8 * (n**2 + n**3) * cp.sin(2*(latdiff)) * cp.cos(2*(lat+phi0)) -
                  35/24 * n**3 * cp.sin(3*(latdiff)) * cp.cos(3*(lat+phi0)))
    I = M + N0
    II = nu/2 * sinlat * coslat
    III = nu/24 * sinlat * coslat ** 3 * (5 - tanlat ** 2 + 9 * eta2)
    IIIA = nu/720 * sinlat * coslat ** 5 * (61-58 * tanlat**2 + tanlat**4)
    IV = nu * coslat
    V = nu / 6 * coslat**3 * (nu/rho - cp.tan(lat)**2)
    VI = nu / 120 * coslat ** 5 * (5 - 18 * tanlat**2 + tanlat**4 + 14 * eta2 - 58 * tanlat**2 * eta2)

    northing = I + II * longdiff**2 + III * longdiff**4 + IIIA * longdiff**6
    easting = E0 + IV * longdiff + V * longdiff**3 + VI * longdiff**5

    return(northing, easting)

In [11]:
cupy_lat = cp.asarray(infected_df["lat"])
cupy_long = cp.asarray(infected_df["long"])

infected_df['northing'], infected_df['easting'] = latlong2osgbgrid_cupy(cupy_lat, cupy_long)
infected_df

Unnamed: 0,lat,long,infected,northing,easting
0,54.472766,-1.654932,True,508670.060234,422359.759523
1,54.529717,-1.667143,True,515002.666798,421538.547038
2,54.512986,-1.589866,True,513167.535850,426549.874086
3,54.522322,-1.380694,True,514305.280055,440081.234798
4,54.541660,-1.613490,True,516349.132042,425003.005560
...,...,...,...,...,...
18143,52.428347,-3.322932,True,282016.338253,310060.098268
18144,52.415895,-3.263942,True,280559.681381,314046.146547
18145,52.539934,-3.617128,True,294832.815870,290338.202721
18146,52.435490,-3.597263,True,283187.465568,291428.293249


## Find Clusters of Infected People

Use DBSCAN to find clusters of at least 25 infected people where no member is more than 2000m from at least one other cluster member. Create a new column in `infected_df` which contains the cluster to which each infected person belongs.

In [12]:
dbscan = cuml.DBSCAN(eps=2000, min_samples=25)
infected_df['cluster'] = dbscan.fit_predict(infected_df[["northing", "easting"]])
infected_df

Unnamed: 0,lat,long,infected,northing,easting,cluster
0,54.472766,-1.654932,True,508670.060234,422359.759523,-1
1,54.529717,-1.667143,True,515002.666798,421538.547038,-1
2,54.512986,-1.589866,True,513167.535850,426549.874086,-1
3,54.522322,-1.380694,True,514305.280055,440081.234798,-1
4,54.541660,-1.613490,True,516349.132042,425003.005560,-1
...,...,...,...,...,...,...
18143,52.428347,-3.322932,True,282016.338253,310060.098268,-1
18144,52.415895,-3.263942,True,280559.681381,314046.146547,-1
18145,52.539934,-3.617128,True,294832.815870,290338.202721,-1
18146,52.435490,-3.597263,True,283187.465568,291428.293249,-1


## Find the Centroid of Each Cluster

Use grouping to find the mean `northing` and `easting` values for each cluster identified above.

In [13]:
centroids_df = infected_df[['northing', 'easting', 'cluster']].groupby("cluster").mean()
centroids_df

Unnamed: 0_level_0,northing,easting
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,378085.504251,401877.070477
0,397661.052147,371410.022807
1,436475.467158,332980.455514
2,347062.237166,389386.821165
3,359668.63842,379638.020073
4,391630.079963,431158.142881
5,386471.292123,426559.09188
6,434970.33495,406985.282976
7,412772.647531,410069.665645
8,415807.314112,414765.634582


Find the number of people in each cluster by counting the number of appearances of each cluster's label in the column produced by DBSCAN.

### BEGIN: MWE

In [None]:
infected_df.head()

Unnamed: 0,age,sex,lat,long,employment,infected,northing,easting,cluster
0,44,f,54.472766,-1.654932,Q,True,508670.060234,422359.759523,-1
1,47,f,54.529717,-1.667143,Q,True,515002.666798,421538.547038,-1
2,47,f,54.512986,-1.589866,I,True,513167.53585,426549.874086,-1
3,49,f,54.522322,-1.380694,Q,True,514305.280055,440081.234798,-1
4,51,f,54.54166,-1.61349,Q,True,516349.132042,425003.00556,-1


### END: MWE

In [14]:
infected_df['cluster'].value_counts()

Unnamed: 0_level_0,count
cluster,Unnamed: 1_level_1
0,8638
-1,8449
2,403
8,94
12,72
13,71
1,68
11,68
4,66
10,64


## Find the Centroid of the Cluster with the Most Members ##

Use the cluster label for with the most people to filter `centroid_df` and write the answer to `my_assessment/question_1.json`.

In [17]:
centroids_df.loc[0]

Unnamed: 0,0
northing,397661.052147
easting,371410.022807


In [19]:
centroids_df.loc[0].to_json(Path( base_path, "my_assessment", "question_1.json" ))



## Check Submission ##

In [None]:
!cat my_assessment/question_1.json

**Tip**: Your submission file should contain one line of text, similar to:

```
{'northing':XXX.XX,'easting':XXX.XX}
```

<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<a href="https://www.nvidia.com/dli"><img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/></a>