# Working with data: FHIRflat

This Jupyter notebook shows how to load a sample FHIRflat folder and do simple statistics and plots. You can view a live version of this notebook on Google Colab or MyBinder by clicking the 'Launch' button (rocket icon) in the top right corner.

```{note}
On Google Colab, you will need to install the polyflame package first.
You can use `pip` to install the package by typing into an empty code cell:

    !pip install git+https://github.com/globaldothealth/polyflame
```

First we import the necessary functions:

In [None]:
import pandas as pd
import polyflame.samples
from polyflame import load_taxonomy, plot, plot_unpacked, with_readable_terms
from polyflame.fhirflat import (
    use_source,
    list_parts,
    read_part,
    condition_proportion,
    condition_upset,
    age_pyramid
)

## Loading a source

Then we load a source using the <project:#polyflame.fhirflat.use_source> function. A checksum **must** be specified. This is to ensure reproducibility of outputs by being able to verify data integrity of FHIRflat data.

In [None]:
source = use_source(polyflame.samples.fhirflat, checksum=polyflame.samples.checksum_fhirflat)
tx = load_taxonomy("fhirflat-isaric3")
source

A `source` is a Python dictionary with pre-specified keys that tells data processing and visualization functions where to get information from. Some source types, such as FHIRflat, also have *parts*, which can be read in separately -- in the case of FHIRflat, parts correspond to [FHIR resources](https://hl7.org/fhir/resourcelist.html), with one parquet file for each resource. A list of parts for a source can be obtained using the <project:#polyflame.fhirflat.list_parts> function:

In [None]:
list_parts(source)

We can read parts as a DataFrame using the <project:#polyflame.fhirflat.read_part> function:

In [None]:
read_part(source, "patient")

The column names in FHIRflat resource parquet files are named after the nested FHIR attribute, such as `extension.birthSex.code`. These dotted fields can be cumbersome to work with, which is why `read_part()` provides a way to map columns:

In [None]:
patient = read_part(
    source, "patient",
    {
        "extension.birthSex.code": "gender",
        "extension.age.value": "age",
        "extension.age.code": "age_unit",
        "id": "subject",
    }
)
patient

This is more readable, however the field values are all coded into numerical terms from standard terminologies such as SNOMED and LOINC. While this is good for reproducibility and precision, it is easier for us to work with readable names. A helper function <project:#polyflame.fhirflat.with_readable_terms> maps clinical coded terms to readable terms given a taxonomy file. A taxonomy is a TOML file containing these mappings with sections for each type of variable:

```toml
[outcome]
"https://snomed.info/sct|371827001" = "alive"
"https://snomed.info/sct|32485007" = "censored"    # still hospitalised
"https://snomed.info/sct|306685000" = "censored"   # transferred
"https://snomed.info/sct|419099009" = "death"
"https://snomed.info/sct|306237005" = "censored"   # palliative care
"https://snomed.info/sct|225928004" = "discharged"

[gender]
"http://snomed.info/sct|248153007" = "male"
"http://snomed.info/sct|248152002" = "female"

[presenceAbsence]
"https://snomed.info/sct|373066003" = true
"https://snomed.info/sct|373067005" = false
```

PolyFLAME ships with a small taxonomy file to work with sample data. In actual use cases, you would have to provide this file yourself.

In [None]:
with_readable_terms(patient, tx, [{"term_column": "gender"}])

Most standard analysis such as those described in the next section shouldn't require you to perform these transformations yourself as they will be handled by the FHIRflat adapter. These are useful when you want to develop your own analyses using FHIRflat data.

## Analysis

Once we have a source, we can start looking at standard analyses, such as the proportion of patients having a particular condition:

In [None]:
plot(condition_proportion(source, tx))

Or, an [UpSet](https://en.wikipedia.org/wiki/UpSet_plot) plot showing top conditions and their co-occurrence:

In [None]:
plot(condition_upset(source))

We can also look at the age pyramid, grouped by outcome type:

In [None]:
plot(age_pyramid(source))

While we have shown examples using the standard FHIRflat analyses above, the plotting functions can take any generic dataframe as an input as long as they follow a particular *shape*. Here, we will use the `plot_unpacked()` function which allows us to pass dataframes directly, instead of expecting them as part of a dictionary like `plot()`. For example, to show a hypothetical UpSet plot showing frequency of intersection of movie genres: 

In [None]:
df = pd.DataFrame({'crime': [1, 0, 1], 'fantasy': [0, 1, 1], 'drama': [1, 0, 0]})
df

In [None]:
plot_unpacked(df, "upset")

Having `plot_unpacked()` be a generic function makes PolyFLAME easy to extend to other data source types, like REDCap, or your own source.

The [API reference](/api/fhirflat) contains the full list of analyses that this adapter supports.