<a href="https://www.nvidia.com/dli"><img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from pathlib import Path

str_path = "/content/drive/MyDrive/NVIDIA/Fundamentals_of_Accelerated_Data_Science/Assessment"
base_path = Path(str_path)

In [3]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 621, done.[K
remote: Counting objects: 100% (187/187), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 621 (delta 143), reused 86 (delta 85), pack-reused 434 (from 3)[K
Receiving objects: 100% (621/621), 205.72 KiB | 8.23 MiB/s, done.
Resolving deltas: 100% (317/317), done.
Installing RAPIDS remaining 25.10 libraries
Using Python 3.12.12 environment at: /usr
Resolved 175 packages in 2.00s
Prepared 18 packages in 26.43s
Uninstalled 11 packages in 193ms
Installed 18 packages in 53ms
 - bokeh==3.7.3
 + bokeh==3.6.3
 + cucim-cu12==25.10.0
 + cugraph-cu12==25.10.1
 + cuxfilter-cu12==25.10.0
 + datashader==0.18.2
 - holoviews==1.22.1
 + holoviews==1.20.2
 + jupyter-server-proxy==4.4.0
 - nvidia-cublas-cu12==12.6.4.1
 + nvidia-cublas-cu12==12.9.1.4
 - nvidia-cuda-nvcc-cu12==12.5.82
 + nvidia-cuda-nvcc-cu12==12.9.86
 - nvidia-cuda-nvrtc-cu12==12.6.77
 + nvidia-cuda-nvrtc-cu12==12.9.86
 - nvidia-cuf

# Week 3: Identify Risk Factors for Infection

<span style="color:red">
**UPDATE**

Thank you again for the previous analysis. We will next be publishing a public health advisory that warns of specific infection risk factors of which individuals should be aware. Please advise as to which population characteristics are associated with higher infection rates.
</span>

Your goal for this notebook will be to identify key potential demographic and economic risk factors for infection by comparing the infected and uninfected populations.

## Imports

In [4]:
%load_ext cudf.pandas
import pandas as pd
import cuml

## Load Data

Begin by loading the data you've received about week 3 of the outbreak into a cuDF-accelerated pandas DataFrame. The data is located at `./data/week3.csv`. For this notebook you will need all columns of the data.

In [5]:
#df = pd.read_csv("./data/week3.csv")
df = pd.read_csv( Path(base_path, "data", "week3.csv"))
df

Unnamed: 0,age,sex,employment,infected
0,0,m,U,0.0
1,0,m,U,0.0
2,0,m,U,0.0
3,0,m,U,0.0
4,0,m,U,0.0
...,...,...,...,...
58479889,90,f,V,0.0
58479890,90,f,V,0.0
58479891,90,f,V,0.0
58479892,90,f,V,0.0


## Calculate Infection Rates by Employment Code

Convert the `infected` column to type `float32`. For people who are not infected, the float32 `infected` value should be `0.0`, and for infected people it should be `1.0`.

In [6]:
df['infected'] = df['infected'].astype("float32")
df

Unnamed: 0,age,sex,employment,infected
0,0,m,U,0.0
1,0,m,U,0.0
2,0,m,U,0.0
3,0,m,U,0.0
4,0,m,U,0.0
...,...,...,...,...
58479889,90,f,V,0.0
58479890,90,f,V,0.0
58479891,90,f,V,0.0
58479892,90,f,V,0.0


Now, produce a list of employment types and their associated **rates** of infection, sorted from highest to lowest rate of infection.

**NOTE**: The infection **rate** for each employment type should be the percentage of total individuals within an employment type who are infected. Therefore, if employment type "X" has 1000 people, and 10 of them are infected, the infection **rate** would be .01. If employment type "Z" has 10,000 people, and 50 of them are infected, the infection rate would be .005, and would be **lower** than for type "X", even though more people within that employment type were infected.

#### BEGIN: MWE

In [None]:
df["employment"].value_counts()

Unnamed: 0_level_0,count
employment,Unnamed: 1_level_1
U,12459115
V,10098466
Z,7161907
Q,3802602
G,3549465
P,3006149
C,2653753
M,2214336
F,2075628
O,1843446


In [None]:
df_temp = df[["employment", "infected"]]

In [None]:
df_temp[df_temp["employment"] == "A"]["infected"].mean()

np.float32(0.0038527579)

In [None]:
emp_groups = df[["employment","infected"]].groupby("employment")
emp_groups.mean()

Unnamed: 0_level_0,infected
employment,Unnamed: 1_level_1
A,0.003853
"B, D, E",0.003774
C,0.003882
F,0.003182
G,0.004948
H,0.003388
I,0.010354
J,0.003939
K,0.004772
L,0.00497


#### END: MWE

In [7]:
emp_groups = df[["employment", "infected"]].groupby("employment")
emp_rate_df = emp_groups.mean()
emp_rate_df

Unnamed: 0_level_0,infected
employment,Unnamed: 1_level_1
A,0.003853
"B, D, E",0.003774
C,0.003882
F,0.003182
G,0.004948
H,0.003388
I,0.010354
J,0.003939
K,0.004772
L,0.00497


Finally, read in the employment codes guide from `./data/code_guide.csv` to interpret which employment types are seeing the highest rates of infection.

In [8]:
#emp_codes = pd.read_csv("./data/code_guide.csv")
emp_codes = pd.read_csv( Path( base_path, "data", "code_guide.csv" ))
emp_codes

Unnamed: 0,Code,Field
0,A,"Agriculture, forestry & fishing"
1,"B, D, E","Mining, energy and water supply"
2,C,Manufacturing
3,F,Construction
4,G,"Wholesale, retail & repair of motor vehicles"
5,H,Transport & storage
6,I,Accommodation & food services
7,J,Information & communication
8,K,Financial & insurance activities
9,L,Real estate activities


### Get Top 2 Employment Type with Highest Rate of Infection ###

Here we ask you to get the top two employment types that have the highest rate of infection. We start by using `.sort_values()` to sort `emp_rate_df` by the rate of infection. We then take the first 2 results.

We will also need to index `emp_codes` to get the respeictve field name.

#### BEGIN: MWE

In [9]:
emp_rate_df.columns

Index(['infected'], dtype='object')

In [14]:
emp_rate_df.sort_values(["infected"], ascending=False)

Unnamed: 0_level_0,infected
employment,Unnamed: 1_level_1
Q,0.012756
I,0.010354
V,0.00759
P,0.00619
Z,0.005655
"R, S, T",0.00539
O,0.005284
L,0.00497
G,0.004948
N,0.004784


In [15]:
top_inf_emp = emp_rate_df.sort_values(["infected"], ascending=False).iloc[:2].index

In [16]:
top_inf_emp

Index(['Q', 'I'], dtype='object', name='employment')

In [17]:
top_inf_emp_df = emp_codes.loc[emp_codes['Code'].isin(top_inf_emp), 'Field']

In [18]:
top_inf_emp_df

Unnamed: 0,Field
6,Accommodation & food services
14,Human health & social work activities


#### END: MWE

In [None]:
top_inf_emp = emp_rate_df.sort_values(["infected"], ascending=False).iloc[:2].index
top_inf_emp_df = emp_codes.loc[emp_codes['Code'].isin(top_inf_emp), 'Field']
top_inf_emp_df

In [19]:
#top_inf_emp_df.to_json('my_assessment/question_3.json', orient='records')
top_inf_emp_df.to_json( Path(base_path, "my_assessment", "question_3.json" ), orient='records')



## Calculate Infection Rates by Employment Code and Sex

We want to see if there is an effect of `sex` on infection rate, either in addition to `employment` or confounding it. Group by both `employment` and `sex` simultaneously to get the infection rate for the intersection of those categories.

In [20]:
simul_groups = df.groupby(['employment', 'sex'])
simul_groups.mean().sort_values('infected', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,age,infected
employment,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
I,f,41.377627,0.015064
Q,f,41.3854,0.014947
V,f,76.022214,0.010852
"B, D, E",f,41.425618,0.007973
"R, S, T",f,41.371672,0.007748
O,f,41.396246,0.007719
K,f,41.377495,0.007672
M,f,41.401898,0.007645
J,f,41.385772,0.007645
C,f,41.391365,0.00763


## Check Submission ##

In [None]:
!cat my_assessment/question_3.json

**Tip**: Your submission file should contain one line of text, similar to:

```
["Agriculture, forestry & fishing","Mining, energy and water supply"]

<div align="center"><h2>Please Restart the Kernel</h2></div>

If you plan to continue work in other notebooks, please shutdown the kernel.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<a href="https://www.nvidia.com/dli"><img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/></a>