## Registry Data usage example (ESR6)

I will try to display a simplified example of my usage of healthcare registries data. I make use of individual data just as a basis to aggregate and obtain incidence counts per *spatial unit* (zip-code, province, electoral district) and *time-unit* (daily, weekly, monthly) based on each patients residence and date of onset/diagnosis information.

To illustrate the linkage process I will generate an environmental and healthcare record toy dataset and perform the linkage as I usually would:

In [1]:
import numpy as np
import pandas as pd

### Environmental dataset

In general, I fetch different datasets of publicly available or self-generated daily observations of several environmental variables:
+ Weather
+ Pollution
+ Biological air diversity
+ Chemical composition (via LIDAR or inplace sampling).

A toy example would be the following table, spanning only 5 days for two different regions, A and B:

In [2]:
environment_df = pd.DataFrame(dict(
    date=np.repeat(pd.date_range('2021-01-01', '2021-01-05'), 2),
    region=np.tile(['A', 'B'], 5),
    temperature=np.random.normal(20, 5, 10),
    no2=np.random.normal(5, 1, 10),
    fungal_species_1=np.random.normal(1000, 100, 10).astype(int),
    bacterial_species_2=np.random.normal(750, 75, 10).astype(int)))

environment_df.set_index('date')

Unnamed: 0_level_0,region,temperature,no2,fungal_species_1,bacterial_species_2
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-01,A,21.347166,4.98315,1087,828
2021-01-01,B,17.176902,4.450865,965,874
2021-01-02,A,22.338724,4.416838,992,869
2021-01-02,B,16.863509,6.446523,827,780
2021-01-03,A,16.065507,3.814912,894,850
2021-01-03,B,24.239069,4.750215,966,718
2021-01-04,A,11.192028,5.777668,863,798
2021-01-04,B,23.825289,6.599546,1073,796
2021-01-05,A,15.194472,5.062128,1037,851
2021-01-05,B,18.382841,3.336658,1128,832


### Healthcare records dataset

The minimal example of a healthcare records dataset that I use would contain, at the individual level, the patient's residence region, and the (vasculitis) onset date recorded.

In [3]:
healthcare_records = pd.DataFrame(dict(
    patient_id=range(1, 16),
    region=np.random.choice(['A', 'B'], 15),
    onset_date=np.random.choice(pd.date_range('2021-01-01', '2021-01-05'), 15))
)

healthcare_records.set_index('patient_id')

Unnamed: 0_level_0,region,onset_date
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A,2021-01-01
2,B,2021-01-01
3,A,2021-01-02
4,B,2021-01-02
5,A,2021-01-01
6,A,2021-01-02
7,B,2021-01-04
8,B,2021-01-04
9,B,2021-01-03
10,A,2021-01-03


I then go from individual level record to population level records aggregating by date and region, such that the data table I use looks like the following:

In [4]:
daily_cases = (healthcare_records
             .groupby(['onset_date', 'region'])
             .size()
             .rename('cases')
             .reset_index()
             .rename(columns={'onset_date': 'date'})
)
daily_cases

Unnamed: 0,date,region,cases
0,2021-01-01,A,2
1,2021-01-01,B,1
2,2021-01-02,A,2
3,2021-01-02,B,1
4,2021-01-03,A,2
5,2021-01-03,B,2
6,2021-01-04,A,1
7,2021-01-04,B,2
8,2021-01-05,A,1
9,2021-01-05,B,1


### Linkage

The final linkage, which leads us to the table on which most of the analyses will be made, is based on merging both the environmental and epidemiological daily incidence counts in a single table based on the `date` and `region` columns, such that:

In [5]:
(environment_df
 .merge(daily_cases, on=['date', 'region'], how='left')
 .fillna(0)
 .sort_values(['region', 'date'])
)

Unnamed: 0,date,region,temperature,no2,fungal_species_1,bacterial_species_2,cases
0,2021-01-01,A,21.347166,4.98315,1087,828,2
2,2021-01-02,A,22.338724,4.416838,992,869,2
4,2021-01-03,A,16.065507,3.814912,894,850,2
6,2021-01-04,A,11.192028,5.777668,863,798,1
8,2021-01-05,A,15.194472,5.062128,1037,851,1
1,2021-01-01,B,17.176902,4.450865,965,874,1
3,2021-01-02,B,16.863509,6.446523,827,780,1
5,2021-01-03,B,24.239069,4.750215,966,718,2
7,2021-01-04,B,23.825289,6.599546,1073,796,2
9,2021-01-05,B,18.382841,3.336658,1128,832,1
