# Using DuckDB to query FHIR Parquet 

One of the benefits of using Parquet format in [fhir-data-pipes](https://github.com/google/fhir-data-pipes)
is that there are many tools that can be used to query those files.
[Apache Spark](https://spark.apache.org/docs/latest/index.html) is one such tool for which
we have good support, including
[documentation](https://github.com/google/fhir-data-pipes/wiki/Analytics-on-a-single-machine-using-Docker),
[integration](https://github.com/google/fhir-data-pipes/blob/498d7c1b336bf77915fc39ca973b4231b594bf28/pipelines/controller/config/application.yaml#L124),
and sample [deployment](https://github.com/google/fhir-data-pipes/blob/master/docker/compose-controller-spark-sql-released.yaml)
[options](https://github.com/google/fhir-data-pipes/blob/master/docker/compose-controller-spark-sql.yaml).

[DuckDB](https://duckdb.org/) is another popular query engine that can be used for Parquet files.
The main appeal of DuckDB is its [simplicity](https://duckdb.org/why_duckdb), in particular it is
very easy to "deploy" it. It can even be embedded into the
[pipeline controller](https://github.com/google/fhir-data-pipes/tree/master/pipelines/controller)
removing the need for a separate query engine process.

However, that means that DuckDB won't be horizontally scalable. One of the design principles in 
`fhir-data-pipes` design is that everything we do should be easy to deploy on a single machine
but also horizontally scalable. Another problem with DuckDB is that even on a single machine, we have 
found that similar flattening queries are significantly more performant with Spark as noted below.

This notebook serves as an example for addressing some of the challenging DuckDB SQL issues
that may arise when working with Parquet files for FHIR resources. This is similar to
[queries_and_views](https://github.com/google/fhir-data-pipes/blob/master/query/queries_and_views.ipynb)
but for DuckDB instead of Spark.

## Setup
As mentioned above, the setup of DuckDB is very simple. Here since we have a Python
environment, we simply add `duckdb` to `requirements.txt` and `import` it:

In [3]:
# Reference doc: https://duckdb.org/2021/06/25/querying-parquet.html

import duckdb
import glob

# some DuckDB setup 
con = duckdb.connect()
# enable automatic query parallelization
con.execute("PRAGMA threads=40")
# enable caching of parquet metadata
con.execute("PRAGMA enable_object_cache")

<duckdb.duckdb.DuckDBPyConnection at 0x7facb97ecab0>

## Defining Parquet-based tables and simple queries
In the following examples, we use a synthetic dataset for ~80K Patients and ~17M Observations.
This dataset is generated using the [synthea-hiv](https://github.com/google/fhir-data-pipes/tree/master/synthea-hiv)
module and then transformed to Parquet using `fhir-data-pipes`. The output directory is then mounted
into our custom Jupyter [docker image](https://github.com/google/fhir-data-pipes/blob/master/query/Dockerfile):
```shell
docker run -p 10002:8888 -v "[PATH]/dwh:/dwh"
```
Inside `/dwh` we have a directory named `OUT_79370_patients_new` which has the usual structure of
our data-warehouse, i.e., one directory per resource type, e.g., `Patient`.

In [4]:
con.execute("SELECT * FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' LIMIT 5").df()

Unnamed: 0,id,meta,implicitRules,language,text,contained,identifier,active,name,telecom,...,deceased,address,maritalStatus,multipleBirth,photo,contact,communication,generalPractitioner,managingOrganization,link
0,82328,"{'id': None, 'versionId': '1', 'lastUpdated': ...",,,"{'id': None, 'status': 'generated', 'div': None}",,"[{'id': None, 'use': None, 'type': None, 'syst...",,"[{'id': None, 'use': 'official', 'text': None,...","[{'id': None, 'system': 'phone', 'value': '555...",...,,"[{'id': None, 'use': None, 'type': None, 'text...","{'id': None, 'coding': [{'id': None, 'system':...","{'boolean': False, 'integer': None}",,,"[{'id': None, 'language': {'id': None, 'coding...",,,
1,135848,"{'id': None, 'versionId': '1', 'lastUpdated': ...",,,"{'id': None, 'status': 'generated', 'div': None}",,"[{'id': None, 'use': None, 'type': None, 'syst...",,"[{'id': None, 'use': 'official', 'text': None,...","[{'id': None, 'system': 'phone', 'value': '555...",...,,"[{'id': None, 'use': None, 'type': None, 'text...","{'id': None, 'coding': [{'id': None, 'system':...","{'boolean': False, 'integer': None}",,,"[{'id': None, 'language': {'id': None, 'coding...",,,
2,192168,"{'id': None, 'versionId': '1', 'lastUpdated': ...",,,"{'id': None, 'status': 'generated', 'div': None}",,"[{'id': None, 'use': None, 'type': None, 'syst...",,"[{'id': None, 'use': 'official', 'text': None,...","[{'id': None, 'system': 'phone', 'value': '555...",...,,"[{'id': None, 'use': None, 'type': None, 'text...","{'id': None, 'coding': [{'id': None, 'system':...","{'boolean': False, 'integer': None}",,,"[{'id': None, 'language': {'id': None, 'coding...",,,
3,197688,"{'id': None, 'versionId': '1', 'lastUpdated': ...",,,"{'id': None, 'status': 'generated', 'div': None}",,"[{'id': None, 'use': None, 'type': None, 'syst...",,"[{'id': None, 'use': 'official', 'text': None,...","[{'id': None, 'system': 'phone', 'value': '555...",...,,"[{'id': None, 'use': None, 'type': None, 'text...","{'id': None, 'coding': [{'id': None, 'system':...","{'boolean': False, 'integer': None}",,,"[{'id': None, 'language': {'id': None, 'coding...",,,
4,349288,"{'id': None, 'versionId': '1', 'lastUpdated': ...",,,"{'id': None, 'status': 'generated', 'div': None}",,"[{'id': None, 'use': None, 'type': None, 'syst...",,"[{'id': None, 'use': 'official', 'text': None,...","[{'id': None, 'system': 'phone', 'value': '555...",...,,"[{'id': None, 'use': None, 'type': None, 'text...","{'id': None, 'coding': [{'id': None, 'system':...","{'boolean': False, 'integer': None}",,,"[{'id': None, 'language': {'id': None, 'coding...",,,


In [5]:
con.execute("SELECT COUNT(*) FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet'").df()

Unnamed: 0,count_star()
0,79370


In [6]:
con.execute("SELECT COUNT(*) FROM '/dwh/OUT_79370_patients_new/Observation/*.parquet'").df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,count_star()
0,16928057


## Creating flat views
Like any other query engine, the main challenge to deal with highly nested and repeated
schema of FHIR resources is efficient "flattening". Let's work with
[Patient](https://hl7.org/fhir/patient.html) and its `name` field which is a list of
[HumanName](https://hl7.org/fhir/datatypes.html#HumanName).

In [10]:
con.sql("""
SELECT P.id, P.name
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P
WHERE id = '999806'
; """).fetchall()

[('999806',
  [{'id': None,
    'use': 'official',
    'text': None,
    'family': 'Volkman526',
    'given': ['Ozell178'],
    'prefix': ['Mrs.'],
    'suffix': None,
    'period': None},
   {'id': None,
    'use': 'maiden',
    'text': None,
    'family': 'Nolan344',
    'given': ['Ozell178'],
    'prefix': ['Mrs.'],
    'suffix': None,
    'period': None}])]

In [58]:
# NOTE: This is not the right way for flattening; see next cells.
con.execute("""
SELECT P.id, PN.name.family AS family, PN.name.given
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P, unnest(P.name) AS PN
WHERE id = '999806' AND family='Nolan344'
; """).df()

Unnamed: 0,id,family,given
0,999806,Nolan344,[Ozell178]


### Performance issues and query plans

The above query works for flattening the `name` field. However it is not performant
because it does not do the `unnest` within the context of a row and instead does a
full JOIN; this can be seen both in the query plan and also its performance for larger
tables like `Observation`:

In [59]:
res = con.sql("""
EXPLAIN
SELECT P.id, PN.name.family AS family, PN.name.given
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P, unnest(P.name) AS PN
WHERE id = '999806' AND family='Nolan344'
; """).fetchall()
res[0][1].split('\n')

['┌───────────────────────────┐                                                          ',
 '│         PROJECTION        │                                                          ',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                                                          ',
 '│             id            │                                                          ',
 '│           family          │                                                          ',
 "│struct_extract(name, 'given│                                                          ",
 "│             ')            │                                                          ",
 '└─────────────┬─────────────┘                                                                                       ',
 '┌─────────────┴─────────────┐                                                          ',
 '│         PROJECTION        │                                                          ',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                    

To address this issue there are multiple options. Here we show two examples, one is using
DuckDB's [list functions](https://duckdb.org/docs/sql/functions/nested#list-functions)
and the other is to use [`LATERAL` join](https://duckdb.org/docs/sql/query_syntax/from.html#lateral-joins).

In [26]:
con.execute("""
SELECT P.id, (list_filter(P.name, x->x.family='Nolan344')[1]).family AS family
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P
WHERE id='999806' AND len(list_filter(P.name, x->x.family='Nolan344')) > 0
; """).df()

Unnamed: 0,id,family
0,999806,Nolan344


In [44]:
res = con.sql("""
EXPLAIN
SELECT P.id, (list_filter(P.name, x->x.family='Nolan344')[1]).family AS family
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P
WHERE id='999806' AND len(list_filter(P.name, x->x.family='Nolan344')) > 0
; """).fetchall()
res[0][1].split('\n')

['┌───────────────────────────┐',
 '│         PROJECTION        │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│             id            │',
 '│           family          │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│           FILTER          │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│(len(list_filter(name)) > 0│',
 '│             )             │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│          EC: 3186         │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│       PARQUET_SCAN        │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│             id            │',
 '│            name           │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│ Filters: id=999806 AND id │',
 '│         IS NOT NULL       │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│         EC: 15930         │',
 '└───────────────────────────┘                             ',
 '']

In [30]:
con.execute("""
SELECT P.id, PN.family
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P, LATERAL (SELECT unnest(P.name) AS PN)
WHERE id='999806'
; """).df()

Unnamed: 0,id,family
0,999806,Volkman526
1,999806,Nolan344


In [29]:
con.execute("""
SELECT P.id, PN.family
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P, LATERAL (SELECT unnest(P.name) AS PN)
WHERE id='999806' AND PN.family='Nolan344'
; """).df()

Unnamed: 0,id,family
0,999806,Nolan344


In [32]:
res = con.sql("""
EXPLAIN
SELECT P.id, PN.family
FROM '/dwh/OUT_79370_patients_new/Patient/*.parquet' AS P, LATERAL (SELECT unnest(P.name) AS PN)
WHERE id='999806' AND PN.family='Nolan344'
; """).fetchall()
res[0][1].split('\n')

['┌───────────────────────────┐',
 '│         PROJECTION        │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│             id            │',
 "│struct_extract(PN, 'family'│",
 '│             )             │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│           FILTER          │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 "│   ((id = '999806') AND    │",
 "│(struct_extract(PN, 'f...  │",
 "│        'Nolan344'))       │",
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│         EC: 79650         │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│         PROJECTION        │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│             #1            │',
 '│             PN            │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│           UNNEST          │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',


### Going with LATERL joins
Given that the `LATERAL` approach is closer to what we use for Spark SQL (and BigQuery), we will
continue with that. Notice that in both of the above two approaches, we avoid the expensive
`HASH JOIN` step. This is crucial for creating performant flattening views. For example, the
following query on the Observation table takes about 25 minutes if we don't use `LATERAL` but
finishes in slightly over 3 minutes with `LATERAL`:

In [41]:
con.execute("""
SELECT COUNT(*) FROM(
SELECT
  O.id AS obs_id, OCC.system, OCC.code, O.status AS status,
  O.subject.PatientId AS patient_id, OVCC.code AS val_code
FROM '/dwh/OUT_79370_patients_new/Observation/*.parquet' AS O,
  LATERAL (SELECT unnest(O.code.coding) AS OCC),
  LATERAL (SELECT unnest(O.value.codeableConcept.coding) AS OVCC)
WHERE OCC.code LIKE '1255%'
  AND OVCC.code LIKE '1256%'
); """).df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,count_star()
0,60023


In [43]:
res = con.sql("""
EXPLAIN
SELECT COUNT(*) FROM(
SELECT
  O.id AS obs_id, OCC.system, OCC.code, O.status AS status,
  O.subject.PatientId AS patient_id, OVCC.code AS val_code
FROM '/dwh/OUT_79370_patients_new/Observation/*.parquet' AS O,
  LATERAL (SELECT unnest(O.code.coding) AS OCC),
  LATERAL (SELECT unnest(O.value.codeableConcept.coding) AS OVCC)
WHERE OCC.code LIKE '1255%'
  AND OVCC.code LIKE '1256%'
); """).fetchall()
res[0][1].split('\n')

['┌───────────────────────────┐',
 '│    UNGROUPED_AGGREGATE    │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│        count_star()       │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│         PROJECTION        │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│             42            │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│           FILTER          │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│(prefix(struct_extract(OCC,│',
 "│    'code'), '1255') AND   │",
 '│ prefix(struct_extract...  │',
 "│     'code'), '1256'))     │",
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│        EC: 14550950       │',
 '└─────────────┬─────────────┘                             ',
 '┌─────────────┴─────────────┐',
 '│         PROJECTION        │',
 '│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │',
 '│            OCC            │',
 '│            OVCC           │',
 '└─────────────┬─────────────┘                             ',


### Comparison with Spark
As a final comment about performance, note that the above query is similar to some of the
Spark SQL queries we have experimented with in
[queries_and_views](https://github.com/google/fhir-data-pipes/blob/master/query/queries_and_views.ipynb).
**On Spark a similar query takes about 6 seconds**.
We have not found any way to write the above query to be anywhere close to 6 seconds, unless if
we use the [read_parquet](https://duckdb.org/docs/guides/file_formats/query_parquet.html) function
which basically loads the entire Parquet content in memory (and hence not scalable to large datasets).
This is another reason why we do not recommend DuckDB even for a single machine deployment.

Here is another example with a 10x bigger dataset, i.e,. ~800K Patients and ~170M Observations. This
query takes over **23 minutes** with DuckDB, but only **20 seconds** with Spark (on the same machine).

In [45]:
con.execute("""
SELECT COUNT(*) FROM(
SELECT
  O.id AS obs_id, OCC.system, OCC.code, O.status AS status,
  O.subject.PatientId AS patient_id, OVCC.code AS val_code
FROM '/dwh/OUT_from-json_791562/Observation/*.parquet' AS O,
  LATERAL (SELECT unnest(O.code.coding) AS OCC),
  LATERAL (SELECT unnest(O.value.codeableConcept.coding) AS OVCC)
WHERE OCC.code LIKE '1255%'
  AND OVCC.code LIKE '1256%'
); """).df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,count_star()
0,597553


### CREATE VIEW
Similar to our approach with other data-warehouse options, our recommended way for writing
queries is to create flat views first. Here is one example similar
to what we have done for Spark in
[queries_and_views](https://github.com/google/fhir-data-pipes/blob/master/query/queries_and_views.ipynb).
As shown there, it is simple to generate reports like `TX_CURR` of
[PEPFAR](https://www.state.gov/pepfar-fy-2023-mer-indicators/) from these flat views.

In [56]:
con.execute("""
CREATE OR REPLACE VIEW Observation_flat AS
      SELECT O.id AS obs_id, O.subject.patientId AS patient_id,
        O.encounter.encounterId as encounter_id,
        O.status, OCC.code, OCC.system AS code_sys,
        O.value.quantity.value AS val_quantity,
        OVCC.code AS val_code, OVCC.system AS val_sys,
        O.effective.dateTime AS obs_date,
        OCatC.system AS category_sys,
        OCatC.code AS category_code
      FROM '/dwh/OUT_79370_patients_new/Observation/*.parquet' AS O,
        LATERAL (SELECT unnest(O.code.coding) AS OCC),
        LATERAL (SELECT unnest(O.value.codeableConcept.coding) AS OVCC),
        LATERAL (SELECT unnest(O.category) AS OCat),
        LATERAL (SELECT unnest(OCat.coding) AS OCatC)
      ;
      """
)

<duckdb.duckdb.DuckDBPyConnection at 0x7facb97ecab0>

In [57]:
con.execute("""
SELECT * FROM Observation_flat LIMIT 5
;""").df()

Unnamed: 0,obs_id,patient_id,encounter_id,status,code,code_sys,val_quantity,val_code,val_sys,obs_date,category_sys,category_code
0,98181,97385,98121,final,1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,70056AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2007-05-12T06:03:09+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
1,115110,112753,115091,final,1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,104567AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2008-01-10T08:17:39+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
2,335187,333709,335168,final,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,140238AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2005-11-01T20:31:36+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
3,504477,501810,504408,final,1250AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,630AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2010-10-02T07:01:41+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
4,419832,415143,419783,final,1111AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,160093AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2004-05-27T17:02:59+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
