---
title: "Registry Data usage (ESR6)"
subtitle: "How do we link health registry data to environmental exposures?"
categories: [WP1, python, data, registry, health, environment, exposure]
author:
  - name: "<b>ESR6</b>: Alejandro Fontal"
    orcid: 0000-0003-1689-0557
    email: alejandro.fontal@isglobal.org
format: 
    html: 
        number-sections: false
        warning: false 
        smooth-scroll: true 
        toc: true 
        toc-location: right
        code-tools:
            source: true
            toggle: true
            
comments:
  giscus: 
    repo: helical-itn/helical-itn.github.io
        
---

## Introduction

I will use this blog post as a way to showcase the basic usage of registry data and linkage to environmental data typically done as part of my work as a member of HELICAL'S Work Package 1, whose main
objective is to help understand the relationship between environmental exposures and vasculitis onset.



I will try to display a simplified example of my usage of healthcare registries data. I make use of individual data just as a basis to aggregate and obtain incidence counts per *spatial unit* (zip-code, province, electoral district) and *time-unit* (daily, weekly, monthly) based on each patients residence and date of onset/diagnosis information.

To illustrate the linkage process I will generate an environmental and healthcare record toy dataset and perform the linkage as I usually would:

<details>
<summary>Show Python Imports</summary>

In [2]:
import numpy as np
import pandas as pd

</details>

## Environmental dataset

In general, I fetch different datasets of publicly available or self-generated daily observations of several environmental variables:
+ Weather
+ Pollution
+ Biological air diversity
+ Chemical composition (via LIDAR or inplace sampling).

A toy example would be the following table, spanning only 5 days for two different regions, A and B:

In [17]:
#| code-fold: true

environment_df = pd.DataFrame(dict(
    date=np.repeat(pd.date_range('2021-01-01', '2021-01-05'), 2),
    region=np.tile(['A', 'B'], 5),
    temperature=np.random.normal(20, 5, 10).round(2),
    no2=np.random.normal(5, 1, 10).round(2),
    fungal_species_1=np.random.normal(1000, 100, 10).astype(int),
    bacterial_species_2=np.random.normal(750, 75, 10).astype(int)))

environment_df.set_index('date')

Unnamed: 0_level_0,region,temperature,no2,fungal_species_1,bacterial_species_2
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-01,A,19.57,3.94,1010,787
2021-01-01,B,24.32,4.15,946,746
2021-01-02,A,11.28,6.17,1012,787
2021-01-02,B,15.43,5.19,1003,852
2021-01-03,A,20.92,4.01,944,736
2021-01-03,B,30.2,5.76,999,734
2021-01-04,A,18.71,4.37,971,737
2021-01-04,B,21.85,3.68,858,724
2021-01-05,A,17.3,4.76,1028,821
2021-01-05,B,15.03,5.57,1020,793


## Healthcare records dataset

The minimal example of a healthcare records dataset that I use would contain, at the individual level, the patient's residence region, and the (vasculitis) onset date recorded.

In [4]:
#| code-fold: true

healthcare_records = pd.DataFrame(dict(
    patient_id=range(1, 16),
    region=np.random.choice(['A', 'B'], 15),
    onset_date=np.random.choice(pd.date_range('2021-01-01', '2021-01-05'), 15))
)

healthcare_records.set_index('patient_id')

Unnamed: 0_level_0,region,onset_date
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A,2021-01-04
2,A,2021-01-04
3,B,2021-01-02
4,B,2021-01-04
5,A,2021-01-02
6,B,2021-01-03
7,A,2021-01-03
8,B,2021-01-05
9,B,2021-01-02
10,B,2021-01-01


I then go from individual level record to population level records aggregating by date and region, such that the data table I use looks like the following:

In [18]:
#| code-fold: true

daily_cases = (healthcare_records
             .groupby(['onset_date', 'region'])
             .size()
             .rename('cases')
             .astype(int)
             .reset_index()
             .rename(columns={'onset_date': 'date'})
)
daily_cases.set_index('date')

Unnamed: 0_level_0,region,cases
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,A,1
2021-01-01,B,1
2021-01-02,A,1
2021-01-02,B,4
2021-01-03,A,1
2021-01-03,B,1
2021-01-04,A,3
2021-01-04,B,2
2021-01-05,B,1


## Linkage

The final linkage, which leads us to the table on which most of the analyses will be made, is based on merging both the environmental and epidemiological daily incidence counts in a single table based on the `date` and `region` columns, such that:

In [25]:
#| code-fold: true

(environment_df
 .merge(daily_cases, on=['date', 'region'], how='left')
 .fillna(0)
 .assign(cases=lambda df: df.cases.astype(int))
 .sort_values(['region', 'date'])
 .set_index(['region', 'date'])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,no2,fungal_species_1,bacterial_species_2,cases
region,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,2021-01-01,19.57,3.94,1010,787,1
A,2021-01-02,11.28,6.17,1012,787,1
A,2021-01-03,20.92,4.01,944,736,1
A,2021-01-04,18.71,4.37,971,737,3
A,2021-01-05,17.3,4.76,1028,821,0
B,2021-01-01,24.32,4.15,946,746,1
B,2021-01-02,15.43,5.19,1003,852,4
B,2021-01-03,30.2,5.76,999,734,1
B,2021-01-04,21.85,3.68,858,724,2
B,2021-01-05,15.03,5.57,1020,793,1
