# Queries and views for FHIR Data Pipes
This notebook includes examples for how to query data-warehouse files created
by FHIR Data Pipes. For the query engine, it is assumed that Spark SQL is
being used (see section below). But hopefully the SQL queries show the generic
pattern that can be used for any query engine that supports SQL on Parquet
files.

The main recommended pattern is to create simple and flat views out of the
complex nested/repeated schema that SQL-on-FHIR has. We first show how to do
this with SQL (Spark flavor of it). At the end, we also show how do to this with
[FHIR-views](https://github.com/google/fhir-py/tree/main/google-fhir-views).
Note that FHIR-views runner for Spark is still an experimental feature.

# Prerequisites:

**Note**: All commands need to be run from the root directory of the
fhir-data-pipes repo. We are setting up fhir-data-pipes controller and
a single node Spark process. The input synthetic data is uploaded to a local
HAPI server.

1. Bring up a Hive ThriftServer and the pipeline controller (note that by
   default, the pipeline only fetches Patient, Encounter, Observation, and
   Condition resources. If you want other resources to be included as well,
   you can edit
   [this line](https://github.com/google/fhir-data-pipes/blob/055ecaa043bfaa9736d857ac20d142f67c67fa61/docker/config/application.yaml#L34)
   before running the controller):
    ```
    docker-compose  -f ./docker/compose-controller-spark-sql-single.yaml up --build  --force-recreate -d
    ```

3. Bring up a HAPI FHIR server:
    ```
    docker-compose  -f ./docker/hapi-compose.yml up --force-recreate -d
    ```

4. Load data into the HAPI FHIR server:
    ```
    python3 ./synthea-hiv/uploader/main.py HAPI http://localhost:8091/fhir --input_dir ./synthea-hiv/sample_data
    ```

5. In the FHIR Pipelines Control Panel page, http://localhost:8090, click on the
   “Run Full” button to convert the FHIR data to Parquet files with SQL-on-FHIR
   schema. These files will be stored under the `docker/dwh` directory.

# Installation

**NOTE: If you are running this notebook inside the corresponding docker
container, you can skip the installation section as the requirements are
already included in that docker image.**

It's recommended to set up a virtual environment before starting the notebook
kernel to avoid any dependency version issues with your native environment.

This can be done with the following commands in a terminal, if you have Conda
installed (you can also use `virtualenv` instead of Conda):

```
    conda create -n python310 python=3.10 -y
    conda activate python310
    conda install ipykernel -y
    ipython kernel install --user --name=python310
    pip install -r requirements.txt
```

# Environment setup
The cell below sets up a Spark client and creates a Spark view "runner",
which is used to apply declarative views of FHIR in Spark. Note you may
need to do adjustments for the Thrift Sever address (`hive://`).
The value below assumes you are running this notebook in a docker container
on the same network as [this config](https://github.com/google/fhir-data-pipes/blob/3694f1394d0b9011ab480ca61f0bd0568bca2f53/docker/compose-controller-spark-sql-single.yaml#L71).

In [1]:
import pandas

from sqlalchemy import dialects
from sqlalchemy import engine

dialects.registry.register("hive", "pyhive.sqlalchemy_hive", "HiveDialect")

# The endpoint of the Hive ThriftServer to connect to; you may need to
# adjust this if you are not running this through the default docker container.
#query_engine = engine.create_engine("hive://localhost:10001/default")
query_engine = engine.create_engine("hive://spark-thriftserver:10000/default")

The cell below loads the parquet files we've created into our Spark ThriftServer.

**Note 1**: These tables are automatically created if `createHiveResourceTables`
is set to `true` like
[here](https://github.com/google/fhir-data-pipes/blob/3694f1394d0b9011ab480ca61f0bd0568bca2f53/docker/config/application.yaml#L34).

**Note 2**: If you choose to run the following cell and recreate the tables,
you probably need to override `destination_directory_path` to point to the
specific path of the last pipeline run, e.g.,
`/dwh/controller_DWH_TIMESTAMP_2023_09_07T18_11_10_988888411Z`.
The default value below, i.e., `/dwh/controller_*`, assigns _all_ Parquet files
in _all_ of such directories into a _single_ table for each resource.

**Note 3**: The full name of tables have a database name as well which is
`default` but default; so instead of `default.Patient` we simply use `Patient`.

In [2]:
# Please read the note above before recreating resource tables!

destination_directory_path = "/dwh/controller_*"
with query_engine.connect() as con:
    # The following lines are commented out to prevent unintentional deletion.
    # con.execute(f"DROP TABLE IF EXISTS Encounter;")
    # con.execute(f"DROP TABLE IF EXISTS Observation;")
    # con.execute(f"DROP TABLE IF EXISTS Patient;")
    con.execute(
        f"CREATE TABLE IF NOT EXISTS Encounter USING"
        f" PARQUET LOCATION '{destination_directory_path}/Encounter/*.parquet';"
    )
    con.execute(
        f"CREATE TABLE IF NOT EXISTS Observation USING PARQUET"
        f" LOCATION '{destination_directory_path}/Observation/*.parquet';"
    )
    con.execute(
        f"CREATE TABLE IF NOT EXISTS Patient USING PARQUET"
        f" LOCATION '{destination_directory_path}/Patient/*.parquet';"
    )

# Explore the data with SQL
At this point you should be able to run any SQL query against the resource
tables. The following subsections show some sample queries. If you want, you can
skip directly to the [FHIR-vewis section](#FHIR-Views).

In [3]:
pandas.read_sql_query(
    sql=f"SELECT COUNT(*) FROM Patient",
    con=query_engine,
)

Unnamed: 0,count(1)
0,79


In [4]:
pandas.read_sql_query(
    sql=(f"SELECT COUNT(*) FROM Observation AS O"
      " WHERE O.effective.DateTime > '2010-01-01'"),
    con=query_engine,
)

Unnamed: 0,count(1)
0,5680


## Exploring Observation codes
The [SQL-on-FHIR schema](https://github.com/FHIR/sql-on-fhir/blob/master/sql-on-fhir.md)
resembles the JSON structure of FHIR resources and hence has many nested and
repeated structures. It usually makes sense to flatten the columns we need to
deal with. This flattening process varies depending on the SQL dialect.
Here is an example for Observation codes.

_Note_: The actual codes and systems should be ignored as this is for a
synthetic datasets that has not gone through concept mapping.

In [5]:
pandas.read_sql_query(
    sql="""
      SELECT OCC.`system` AS code_sys, OCC.code, OCC.display,
        COUNT(*) AS num_obs, AVG(O.value.quantity.value) AS avg_val
      FROM Observation AS O LATERAL VIEW explode(code.coding) AS OCC
      GROUP BY OCC.`system`, OCC.code, OCC.display
      ORDER BY num_obs DESC
      LIMIT 20; """,
    con=query_engine,
)

Unnamed: 0,code_sys,code,display,num_obs,avg_val
0,http://loinc.org,1271AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TESTS ORDERED,4949,
1,http://loinc.org,1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,CURRENT ANTIRETROVIRAL DRUGS USED FOR TREATMENT,2387,
2,http://loinc.org,1111AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,PATIENT REPORTED CURRENT TB TREATMENT,2043,
3,http://loinc.org,1250AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,ANTIRETROVIRALS STARTED,1805,
4,http://loinc.org,1270AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TUBERCULOSIS TREATMENT STARTED,1121,
5,http://loinc.org,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,REVIEW OF TUBERCULOSIS SCREENING QUESTIONS,738,
6,http://loinc.org,5085AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,LOINC Code,530,163.291043
7,http://loinc.org,159911AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,PATIENT REPORTED CURRENT ANTIRETROVIRAL TREATMENT,408,
8,http://loinc.org,1261AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,PCP PROPHYLAXIS PLAN,340,
9,http://loinc.org,1265AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TUBERCULOSIS PROPHYLAXIS PLAN,340,


## Indicator example
The following query counts number of patients that have had an observation
with a specific code (HIV viral load), with a value below a certain threshold
(400000), during a specific reporting period (year 2010 in this example).
This is a useful pattern in many cases, e.g., calculating TX_CURR or TX_PVLS
indicators of [PEPFAR](https://www.state.gov/pepfar-fy-2023-mer-indicators/)
(for TX_CURR we need to look at `O.value.codeableConcept.coding`).

In [6]:
pandas.read_sql_query(
    sql="""
      SELECT COUNT(DISTINCT O.subject.PatientId) AS num_patients
      FROM Observation AS O LATERAL VIEW explode(code.coding) AS OCC
      WHERE OCC.code LIKE '856%%'
        AND OCC.`system` = 'http://loinc.org'
        AND O.value.quantity.value < 400000
        AND YEAR(O.effective.dateTime) = 2010; """,
    con=query_engine,
)

Unnamed: 0,num_patients
0,5


## Creating flat views
It is usually helpful to create flat views and build other queries
on top of those views. Here is an example for Observation:

In [7]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW flat_observation AS
      SELECT O.id AS obs_id, O.subject.PatientId AS patient_id,
        OCC.`system` AS code_sys, OCC.code,
        O.value.quantity.value AS val_quantity,
        OVCC.code AS val_code, OVCC.`system` AS val_sys,
        O.effective.dateTime AS obs_date
      FROM Observation AS O LATERAL VIEW OUTER explode(code.coding) AS OCC
        LATERAL VIEW OUTER explode(O.value.codeableConcept.coding) AS OVCC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM flat_observation LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,obs_id,patient_id,code_sys,code,val_quantity,val_code,val_sys,obs_date
0,10030,9040,http://loinc.org,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,,140238AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2009-12-19T03:22:24+00:00
1,10063,9040,http://loinc.org,5088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,36.921,,,2011-03-21T03:22:24+00:00
2,10075,9040,http://loinc.org,1271AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,,1107AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2011-03-21T03:22:24+00:00
3,10083,9040,http://loinc.org,1271AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,,305AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2011-03-21T03:22:24+00:00
4,10197,9040,http://loinc.org,159911AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,,1652AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2013-04-30T03:22:24+00:00


Repeating the same TX_CURR query with the new view is much more readable:

In [8]:
pandas.read_sql_query(
    sql="""SELECT COUNT(DISTINCT patient_id) AS num_patients
      FROM flat_observation
      WHERE code LIKE '856%%'
        AND code_sys = 'http://loinc.org'
        AND val_quantity < 400000
        AND YEAR(obs_date) = 2010
      LIMIT 100; """,
    con=query_engine,
)

Unnamed: 0,num_patients
0,5


# Sample flat views
In this section we create sample flat views for common resources. Flat views
are usually lossy transformations, i.e., we drop some fields of the original
FHIR resource and flatten those we need. So each subsection has a short
description of how each flat view is created. These are provided as examples
to be used as the basis of other purpose-built flat views.

## Observation
We already saw an example flat view for the Observation resource. Here we expand
that example by including a few more fields, e.g., `status` and `category`. We
also add `encounter_id` as it is usually useful for joining Observation and
Encounter tables/views.

The main idea of "flattening" is to "explode" each row based on its repeated
fields to eliminate the array. For example, `Observation.code` has a `coding`
array. So if an Observation has 3 `code.coding`, conceptually, the
`LATERAL VIEW OUTER explode(code.coding)` expression below creates 3 copies
of that row where each have one of those `code.coding`s. When `explode` is
done on multiple fields, it is conceptually like the cartesian product of those
arrays. In practical scenarios, usually there are also constrains in the `WHERE`
clause to limit the number of rows in the view to those relevant to the
use-case (for example only pick rows with certain codes).

In [9]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Observation_flat AS
      SELECT O.id AS obs_id, O.subject.patientId AS patient_id,
        O.encounter.encounterId as encounter_id,
        O.status, OCC.code, OCC.`system` AS code_sys,
        O.value.quantity.value AS val_quantity,
        OVCC.code AS val_code, OVCC.`system` AS val_sys,
        O.effective.dateTime AS obs_date,
        OCatC.`system` AS category_sys,
        OCatC.code AS category_code
      FROM Observation AS O LATERAL VIEW OUTER explode(code.coding) AS OCC
        LATERAL VIEW OUTER explode(O.value.codeableConcept.coding) AS OVCC
        LATERAL VIEW OUTER explode(O.category) AS OCat
        LATERAL VIEW OUTER explode(OCat.coding) AS OCatC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Observation_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,obs_id,patient_id,encounter_id,status,code,code_sys,val_quantity,val_code,val_sys,obs_date,category_sys,category_code
0,10030,9040,10021,final,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,140238AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2009-12-19T03:22:24+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
1,10063,9040,10062,final,5088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,36.921,,,2011-03-21T03:22:24+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
2,10075,9040,10062,final,1271AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,1107AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2011-03-21T03:22:24+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
3,10083,9040,10062,final,1271AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,305AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2011-03-21T03:22:24+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey
4,10197,9040,10166,final,159911AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://loinc.org,,1652AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,http://snomed.info/sct,2013-04-30T03:22:24+00:00,http://terminology.hl7.org/CodeSystem/observat...,survey


In [10]:
with pandas.option_context('display.max_colwidth', None):
    print(pandas.read_sql_query(
        sql="SELECT category_sys FROM Observation_flat LIMIT 5;",
        con=query_engine,
    ))

                                                 category_sys
0  http://terminology.hl7.org/CodeSystem/observation-category
1  http://terminology.hl7.org/CodeSystem/observation-category
2  http://terminology.hl7.org/CodeSystem/observation-category
3  http://terminology.hl7.org/CodeSystem/observation-category
4  http://terminology.hl7.org/CodeSystem/observation-category


As mentioned above, because of the explode effect, the number of rows in the
view can be much larger than the original number of resources (for example if
therea are multiple codes for an Observation). So care should be taken with
duplicated rows. But that's not the case in our synthetics observations and
the above view:

In [11]:
pandas.read_sql_query(
    sql="SELECT COUNT(*) FROM Observation_flat;",
    con=query_engine,
)

Unnamed: 0,count(1)
0,17279


In [12]:
pandas.read_sql_query(
    sql="SELECT COUNT(DISTINCT obs_id) FROM Observation_flat;",
    con=query_engine,
)

Unnamed: 0,count(DISTINCT obs_id)
0,17279


## Patient
We follow the same pattern for Patient resources. Here we are picking some
of the commonly used fields and some of the reference IDs, e.g.,
`generalPractitioner` and `managingOrganization`. Note that in this example
we also have a computed field `age` which is derived from `birthDate`. This is
only provided as an example as the logic is incomplete without taking into
account the `deceased` field:

In [13]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Patient_flat AS
      SELECT P.id AS pat_id, P.active, PN.family, PNG AS given, P.gender,
        P.deceased.Boolean AS deceased,
        YEAR(current_date()) - YEAR(P.birthDate) AS age,
        PA.country, PG.practitionerId AS practitioner_id,
        P.managingOrganization.organizationId AS organization_id
      FROM Patient AS P LATERAL VIEW OUTER explode(name) AS PN
        LATERAL VIEW OUTER explode(PN.given) AS PNG
        LATERAL VIEW OUTER explode(P.address) AS PA
        LATERAL VIEW OUTER explode(P.generalPractitioner) AS PG
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Patient_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,pat_id,active,family,given,gender,deceased,age,country,practitioner_id,organization_id
0,27113,,Rogahn59,Heriberto162,male,,61,US,,
1,38462,,Weissnat378,Lanie389,female,,62,US,,
2,39570,,Gutmann970,Gregg522,male,,108,US,,
3,9040,,Turner526,Isaura563,female,,66,US,,
4,9040,,Vandervort697,Isaura563,female,,66,US,,


In [14]:
pandas.read_sql_query(
    sql="SELECT COUNT(*) FROM Patient_flat;",
    con=query_engine,
)

Unnamed: 0,count(1)
0,106


In [15]:
pandas.read_sql_query(
    sql="SELECT COUNT(DISTINCT pat_id) FROM Patient_flat;",
    con=query_engine,
)

Unnamed: 0,count(DISTINCT pat_id)
0,79


## Encounter

In [16]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Encounter_flat AS
      SELECT E.id AS enc_id, E.status, ETC.system AS type_sys,
        ETC.code AS type_code, E.subject.PatientId AS patient_id,
        EP.individual.practitionerId AS practitioner_id,
        EL.location.locationId AS location_id,
        E.serviceProvider.organizationId AS service_org_id,
        E.period.start, E.period.end, E.episodeOfCare.EpisodeOfCareId
      FROM Encounter AS E LATERAL VIEW OUTER explode(type) AS ET
        LATERAL VIEW OUTER explode(ET.coding) AS ETC
        LATERAL VIEW OUTER explode(E.participant) AS EP
        LATERAL VIEW OUTER explode(E.location) AS EL
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Encounter_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,enc_id,status,type_sys,type_code,patient_id,practitioner_id,location_id,service_org_id,start,end,EpisodeOfCareId
0,10357,finished,http://snomed.info/sct,162673000,9040,2791,410,409,2019-01-03T03:22:24+00:00,2019-01-03T03:37:24+00:00,
1,10372,finished,http://snomed.info/sct,162673000,9040,2791,410,409,2021-01-14T03:22:24+00:00,2021-01-14T03:37:24+00:00,
2,10523,finished,http://snomed.info/sct,410620009,10392,3445,1064,1063,1963-12-17T03:03:25+00:00,1963-12-17T03:18:25+00:00,
3,10571,finished,http://snomed.info/sct,162673000,10392,3445,1064,1063,1980-01-22T03:03:25+00:00,1980-01-22T03:18:25+00:00,
4,11846,finished,http://snomed.info/sct,162673000,10669,4409,2028,2027,2017-12-29T20:07:11+00:00,2017-12-29T20:22:11+00:00,


## Condition

In [17]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Condition_flat AS
      SELECT C.id AS cond_id, C.subject.patientId AS patient_id,
        C.encounter.encounterId AS encounter_id, CCC.system, CCC.code,
        CClC.code AS clinical_status, CVC.code AS verification_status,
        C.onset.DateTime AS onset_datetime
      FROM Condition AS C LATERAL VIEW OUTER explode(C.code.coding) AS CCC
        LATERAL VIEW OUTER explode(C.category) AS CCat
        LATERAL VIEW OUTER explode(CCat.coding) AS CCatC
        LATERAL VIEW OUTER explode(C.clinicalStatus.coding) AS CClC
        LATERAL VIEW OUTER explode(C.verificationStatus.coding) AS CVC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Condition_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,cond_id,patient_id,encounter_id,system,code,clinical_status,verification_status,onset_datetime
0,17639,16947,17637,http://snomed.info/sct,112141AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,active,confirmed,2012-09-06T04:45:23+00:00
1,20074,19393,20073,http://snomed.info/sct,112141AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,active,confirmed,1992-03-15T02:48:07+00:00
2,23763,23396,23762,http://snomed.info/sct,112141AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,active,confirmed,1996-11-06T16:21:43+00:00
3,28638,26670,28636,http://snomed.info/sct,230690007,active,confirmed,2001-05-16T19:45:21+00:00
4,5485,4766,5480,http://snomed.info/sct,112141AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,active,confirmed,1995-02-03T06:21:13+00:00


## DiagnosticReport

In [18]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW DiagnosticReport_flat AS
      SELECT D.id AS dr_id, D.subject.patientId AS patient_id,
        D.encounter.EncounterId AS encounter_id,
        DCC.system, DCC.code, DR.observationId AS result_obs_id,
        D.status, DP.PractitionerId AS practitioner_id,
        DCatC.system AS category_sys, DCatC.code AS category_code,
        DConC.system AS conclusion_sys, DConC.code AS conclusion_code,
        D.conclusion
      FROM DiagnosticReport AS D LATERAL VIEW OUTER explode(D.result) AS DR
        LATERAL VIEW OUTER explode(D.code.coding) AS DCC
        LATERAL VIEW OUTER explode(D.performer) AS DP
        LATERAL VIEW OUTER explode(D.category) AS DCat
        LATERAL VIEW OUTER explode(DCat.coding) AS DCatC
        LATERAL VIEW OUTER explode(D.conclusionCode) AS DCon
        LATERAL VIEW OUTER explode(DCon.coding) AS DConC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM DiagnosticReport_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,dr_id,patient_id,encounter_id,system,code,result_obs_id,status,practitioner_id,category_sys,category_code,conclusion_sys,conclusion_code,conclusion
0,11216,11149,11207,http://loinc.org,34117-2,,final,4285,http://loinc.org,34117-2,,,
1,11216,11149,11207,http://loinc.org,34117-2,,final,4285,http://loinc.org,51847-2,,,
2,11216,11149,11207,http://loinc.org,51847-2,,final,4285,http://loinc.org,34117-2,,,
3,11216,11149,11207,http://loinc.org,51847-2,,final,4285,http://loinc.org,51847-2,,,
4,11633,10669,11625,http://loinc.org,34117-2,,final,4409,http://loinc.org,34117-2,,,


## Immunization

In [19]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Immunization_flat AS
      SELECT I.id AS imm_id, I.patient.patientId AS patient_id,
        I.encounter.encounterId AS encounter_id, I.status,
        ISC.system AS statusReason_sys, ISC.code AS statusReason_code,
        IVC.system AS vaccine_sys, IVC.code AS vaccine_code,
        I.occurrence.DateTime, I.location.LocationId AS location_id,
        IP.actor.PractitionerId, IP.actor.OrganizationId
      FROM Immunization AS I
        LATERAL VIEW OUTER explode(I.statusReason.coding) AS ISC
        LATERAL VIEW OUTER explode(I.vaccineCode.coding) AS IVC
        LATERAL VIEW OUTER explode(I.performer) AS IP
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Immunization_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,imm_id,patient_id,encounter_id,status,statusReason_sys,statusReason_code,vaccine_sys,vaccine_code,DateTime,location_id,PractitionerId,OrganizationId
0,11638,10669,11637,completed,,,http://hl7.org/fhir/sid/cvx,140,2005-10-21T20:07:11+00:00,2028,,
1,12880,12348,12879,completed,,,http://hl7.org/fhir/sid/cvx,140,1994-05-17T20:51:22+00:00,506,,
2,13136,12988,13135,completed,,,http://hl7.org/fhir/sid/cvx,140,1972-04-04T08:42:20+00:00,736,,
3,13800,13173,13799,completed,,,http://hl7.org/fhir/sid/cvx,140,1997-01-19T06:45:44+00:00,1344,,
4,14158,13173,14157,completed,,,http://hl7.org/fhir/sid/cvx,140,2012-04-15T06:45:44+00:00,1344,,


## Location

In [20]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Location_flat AS
      SELECT L.id AS loc_id, L.status, L.name, L.address.city,
        L.address.country, L.managingOrganization.organizationId AS org_id,
        L.position.longitude, L.position.latitude, L.position.altitude
      FROM Location AS L
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Location_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,loc_id,status,name,city,country,org_id,longitude,latitude,altitude
0,1048,active,PCP83180,EVERETT,US,,-71.054649,42.405938,
1,1064,active,PHYSICAL THERAPY AND FITNESS CENTER OF RAYNHAM...,RAYNHAM,US,,-71.046214,41.930477,
2,112,active,PCP77,PITTSFIELD,US,,-73.260685,42.45184,
3,1236,active,PCP129774,REVERE,US,,-70.99036,42.421005,
4,1292,active,PCP144578,BILLERICA,US,,-71.260947,42.559673,


## MedicationRequest

In [21]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW MedicationRequest_flat AS
      SELECT M.id AS med_req_id, M.subject.patientId AS patient_id, M.status,
        MSC.system AS statusReason_sys, MSC.code AS statusReason_code,
        MMCC.system, MMCC.code, M.intent, M.doNotPerform,
        M.performer.practitionerId AS practitioner_id
      FROM MedicationRequest AS M
        LATERAL VIEW OUTER explode(M.statusReason.coding) AS MSC
        LATERAL VIEW OUTER explode(M.medication.codeableConcept.coding) AS MMCC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM MedicationRequest_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,med_req_id,patient_id,status,statusReason_sys,statusReason_code,system,code,intent,doNotPerform,practitioner_id
0,10592,10392,stopped,,,http://www.nlm.nih.gov/research/umls/rxnorm,834357,order,,
1,10594,10392,stopped,,,http://www.nlm.nih.gov/research/umls/rxnorm,1190795,order,,
2,29022,26670,stopped,,,http://www.nlm.nih.gov/research/umls/rxnorm,1804799,order,,
3,29277,29122,stopped,,,http://www.nlm.nih.gov/research/umls/rxnorm,834357,order,,
4,29279,29122,stopped,,,http://www.nlm.nih.gov/research/umls/rxnorm,1190795,order,,


## Organization

In [22]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Organization_flat AS
      SELECT O.id AS org_id, O.active, O.name, OA.city, OA.country,
        OTC.system AS type_sys, OTC.code AS type_code,
        O.partOf.OrganizationId AS partOf_org_id
      FROM Organization AS O LATERAL VIEW OUTER explode(O.address) AS OA
         LATERAL VIEW OUTER explode(O.type) AS OT
         LATERAL VIEW OUTER explode(OT.coding) AS OTC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Organization_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,org_id,active,name,city,country,type_sys,type_code,partOf_org_id
0,1003,True,ASSOCIATES OF SOUTH SHORE DERMATOLOGY LLC,MILTON,US,http://terminology.hl7.org/CodeSystem/organiza...,prov,
1,1055,True,HELLER EYECARE INC,WILMINGTON,US,http://terminology.hl7.org/CodeSystem/organiza...,prov,
2,1079,True,PCP90763,RAYNHAM,US,http://terminology.hl7.org/CodeSystem/organiza...,prov,
3,1211,True,PCP124160,HYDE PARK,US,http://terminology.hl7.org/CodeSystem/organiza...,prov,
4,1295,True,EYE ASSOCIATES OF SOMERVILLE INC.,SOMERVILLE,US,http://terminology.hl7.org/CodeSystem/organiza...,prov,


## Practitioner

In [23]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Practitioner_flat AS
      SELECT P.id AS prac_id, P.active, PA.city, PA.country, P.gender,
        PQCC.system AS qualification_system, PQCC.code AS qualification_code
      FROM Practitioner AS P LATERAL VIEW OUTER explode(P.address) AS PA
        LATERAL VIEW OUTER explode(P.qualification) AS PQ
        LATERAL VIEW OUTER explode(PQ.code.coding) AS PQCC
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM Practitioner_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,prac_id,active,city,country,gender,qualification_system,qualification_code
0,2455,True,GARDNER,US,female,,
1,2557,True,MIDDLETON,US,female,,
2,2563,True,BROOKLINE,US,female,,
3,2617,True,GRAFTON,US,male,,
4,2635,True,MARSHFIELD,US,male,,


## PractitionerRole

In [24]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW PractitionerRole_flat AS
      SELECT P.id AS pr_id, P.practitioner.practitionerId as practitioner_id,
        P.active, P.organization.organizationId AS organization_id,
        PCC.system, PCC.code,
        PSC.system AS specialty_sys, PSC.code AS specialty_code,
        PL.LocationId, PH.HealthcareServiceId
      FROM PractitionerRole AS P
        LATERAL VIEW OUTER explode(P.code) AS PC
        LATERAL VIEW OUTER explode(PC.coding) AS PCC
        LATERAL VIEW OUTER explode(P.specialty) AS PS
        LATERAL VIEW OUTER explode(PS.coding) AS PSC
        LATERAL VIEW OUTER explode(P.location) AS PL
        LATERAL VIEW OUTER explode(P.healthcareService) AS PH
      ;
      """
    )

pandas.read_sql_query(
    sql="SELECT * FROM PractitionerRole_flat LIMIT 5;",
    con=query_engine,
)

Unnamed: 0,pr_id,practitioner_id,active,organization_id,system,code,specialty_sys,specialty_code,LocationId,HealthcareServiceId
0,2588,,,,http://nucc.org/provider-taxonomy,208D00000X,http://nucc.org/provider-taxonomy,208D00000X,,
1,2624,,,,http://nucc.org/provider-taxonomy,208D00000X,http://nucc.org/provider-taxonomy,208D00000X,,
2,2682,,,,http://nucc.org/provider-taxonomy,208D00000X,http://nucc.org/provider-taxonomy,208D00000X,,
3,2688,,,,http://nucc.org/provider-taxonomy,208D00000X,http://nucc.org/provider-taxonomy,208D00000X,,
4,2702,,,,http://nucc.org/provider-taxonomy,208D00000X,http://nucc.org/provider-taxonomy,208D00000X,,


## Procedure

In [25]:
with query_engine.connect() as con:
    con.execute("""
      CREATE OR REPLACE VIEW Procedure_flat AS
      SELECT P.id AS proc_id, P.subject.patientId AS patient_id,
        P.encounter.encounterId AS encounter_id, PCC.system, PCC.code,
        PP.actor.practitionerId AS practitioner_id,
        P.performed.period.start AS period_start,
        P.performed.period.`end` AS period_end,
        P.location.locationId AS location_id, P.status
      FROM Procedure AS P LATERAL VIEW OUTER explode(P.code.coding) AS PCC
        LATERAL VIEW OUTER explode(P.performer) AS PP
      ;
      """
    )

pandas.read_sql_query(
    # TODO: check why including period_end fails, while it can be selected separately!
    sql="""
      SELECT proc_id, patient_id, encounter_id, system, code, practitioner_id,
        period_start, location_id, status FROM Procedure_flat LIMIT 5;
      """,
    con=query_engine,
)

Unnamed: 0,proc_id,patient_id,encounter_id,system,code,practitioner_id,period_start,location_id,status
0,29019,26670,29017,http://snomed.info/sct,433112001,,2019-02-20T19:45:21+00:00,56,completed
1,29272,29122,29268,http://snomed.info/sct,447365002,,1972-10-14T09:12:43+00:00,24,completed
2,28867,26670,28866,http://snomed.info/sct,180325003,,2008-05-14T19:45:21+00:00,1640,completed
3,29085,26670,29084,http://snomed.info/sct,180325003,,2021-07-28T19:45:21+00:00,1640,completed
4,10344,9040,10343,http://snomed.info/sct,18286008,,2016-12-22T03:22:24+00:00,410,completed


# FHIR Views

**Note**: The Spark-runner for FHIR-views is in experimental mode and is not
ready for production use.

[FHIR Views](https://github.com/google/fhir-py/tree/main/google-fhir-views)
is an abstraction layer to simplify the above view creation pattern. It helps
creating flat views for FHIR resources using
[FHIRPath](http://hl7.org/fhirpath/N1/) statements. It separates
view definition from view creation, the latter being dependent on the
underlying data representation. For example, there is a BigQuery-runner,
Spark-runner, etc. each of which transforms the view definition into the
corresponding SQL dialect.

The FHIR-Views pattern is being standardized in
[SQL-on-FHIR v2](https://build.fhir.org/ig/FHIR/sql-on-fhir-v2/)
as a language independent FHIR spec (i.e., JSON instead of Python).

In [26]:
# In addition to the sqlalchemy imports we did above, we also need the
# FHIR-views library for the rest of this notebook.

from google.fhir.views import r4
from google.fhir.views import spark_runner

# The Spark dataset containing FHIR data. This may be read-only to the user.
fhir_dataset = "default"

# The Spark dataset where we will create views, value sets, and other derived tables
# as needed. This must be writeable by the user.
analysis_dataset = "demo_example"

# Create a runner to execute the views over Spark.
runner = spark_runner.SparkRunner(
    query_engine=query_engine,
    fhir_dataset=fhir_dataset,
    view_dataset=analysis_dataset,
    snake_case_resource_tables=True,
)

## Explore the data with FHIR Views
Similar to the SQL examples above, first we try to get a general feeling of
the data. So let's create some views of FHIR resources and take a look at
common fields.

In [27]:
# Load views based on the base FHIR R4 profile definitions.
views = r4.base_r4()

# Create shorthand names for resources we will work with.
obs = views.view_of("Observation")
pats = views.view_of("Patient")

Let's take a look at some observation resources by creating FHIRPath expressions to select the items we're interested in.

Notice that tab completion works for FHIR fields and FHIRPath functions, so users don't need to switch back and forth to FHIR documentation as much.

In [28]:
runner.to_dataframe(
    obs.select(
        {
            "id": obs.id,
            "category": obs.category.coding.code,
            "code_display": obs.code.coding.display,
            "status": obs.status,
        }
    ),
    limit=10,
)

Unnamed: 0,id,category,code_display,status
0,10929,"[""survey""]","[""CURRENT ANTIRETROVIRAL DRUGS USED FOR TREATM...",final
1,10063,"[""survey""]","[""TEMPERATURE""]",final
2,10798,"[""survey""]","[""TEMPERATURE""]",final
3,10030,"[""survey""]","[""REVIEW OF TUBERCULOSIS SCREENING QUESTIONS""]",final
4,10803,"[""survey""]","[""REVIEW OF TUBERCULOSIS SCREENING QUESTIONS""]",final
5,10880,"[""survey""]","[""PATIENT REPORTED CURRENT TB TREATMENT""]",final
6,10075,"[""survey""]","[""TESTS ORDERED""]",final
7,10083,"[""survey""]","[""TESTS ORDERED""]",final
8,10825,"[""survey""]","[""TESTS ORDERED""]",final
9,10197,"[""survey""]","[""PATIENT REPORTED CURRENT ANTIRETROVIRAL TREA...",final


Of course just a sample of codes isn't too useful -- we really want a summary of what codes exist in a field and how many there are. Fortunately, the Spark runner supports a `summarize_codes` method that accepts a view and a field name and does exactly that. Note that we needed a non-trivial SQL query to do something similar.

In [29]:
runner.summarize_codes(obs, obs.code)

Unnamed: 0,system,code,display,count
0,http://loinc.org,1271AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TESTS ORDERED,4949
1,http://loinc.org,1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,CURRENT ANTIRETROVIRAL DRUGS USED FOR TREATMENT,2387
2,http://loinc.org,1111AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,PATIENT REPORTED CURRENT TB TREATMENT,2043
3,http://loinc.org,1250AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,ANTIRETROVIRALS STARTED,1805
4,http://loinc.org,1270AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TUBERCULOSIS TREATMENT STARTED,1121
5,http://loinc.org,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,REVIEW OF TUBERCULOSIS SCREENING QUESTIONS,738
6,http://loinc.org,5085AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,LOINC Code,530
7,http://loinc.org,159911AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,PATIENT REPORTED CURRENT ANTIRETROVIRAL TREATMENT,408
8,http://loinc.org,1268AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TUBERCULOSIS TREATMENT PLAN,340
9,http://loinc.org,1265AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TUBERCULOSIS PROPHYLAXIS PLAN,340


## Creating Views

As discussed above, to handle the complexity of FHIR data, we can usually create
a flat, tabular form of a resource type that can satisfy a given use case.

For example, imagine we need a simple table of patients with their current
address. This isn't trivial to query since the address is a nested repeated
field. Fortunately we can build a FHIRPath expression to find the current
address and create a flattened view using that.

(This can vary by dataset, but in this case we determine the current address
by finding the first address that does not have a period attached to it.)

In [30]:
# For this dataset we interpret the current address as one where period is empty.

current = pats.address.where(pats.address.period.empty())

simple_pats = pats.select(
    {
        "id": pats.id,
        "gender": pats.gender,
        "birthdate": pats.birthDate,
        "street": current.line,
        "city": current.city,
        "state": current.state,
        "zip": current.postalCode,
    }
)

runner.to_dataframe(simple_pats, limit=5)

Unnamed: 0,id,gender,birthdate,street,city,state,zip
0,38462,female,1961-10-01,"[""376 Mohr Annex Suite 88""]","[""Northampton""]","[""MA""]","[""01060""]"
1,9040,female,1957-10-31,"[""751 Kessler Divide Unit 49""]","[""Brookline""]","[""MA""]","[""02445""]"
2,39570,male,1915-04-20,"[""988 Ryan Burg Apt 59""]","[""North Reading""]","[""MA""]",[]
3,27113,male,1962-04-23,"[""609 Steuber Crossroad Unit 49""]","[""Westfield""]","[""MA""]","[""01085""]"
4,30404,male,1941-10-24,"[""617 D'Amore Course""]","[""Brewster""]","[""MA""]","[""02631""]"


That's nice, but suppose we want to create it as an actual Spark view, i.e.,
a virtual table that can be easily used by any application that uses Spark.
This is basically what we did in the previous SQL section using `CREATE VIEW`.
We can simply turn the above definition into a Spark/Hive view as well:

In [31]:
runner.create_database_view(simple_pats, "patient_current_address")

Now we can run simple SQL queries against this view.

_Note_: Since we set `view_dataset='demo_example'` when creating the Spark
runner in the [Environment setup section](#Environment-setup), we need to
include that in the SQL queries we write against these views.

In [32]:
pandas.read_sql_query(
    sql="SELECT * FROM demo_example.patient_current_address LIMIT 5",
    con=query_engine,
)

Unnamed: 0,id,gender,birthdate,street,city,state,zip
0,38462,female,1961-10-01,"[""376 Mohr Annex Suite 88""]","[""Northampton""]","[""MA""]","[""01060""]"
1,9040,female,1957-10-31,"[""751 Kessler Divide Unit 49""]","[""Brookline""]","[""MA""]","[""02445""]"
2,39570,male,1915-04-20,"[""988 Ryan Burg Apt 59""]","[""North Reading""]","[""MA""]",[]
3,27113,male,1962-04-23,"[""609 Steuber Crossroad Unit 49""]","[""Westfield""]","[""MA""]","[""01085""]"
4,30404,male,1941-10-24,"[""617 D'Amore Course""]","[""Brewster""]","[""MA""]","[""02631""]"


Now we have a nice, flattened patients table that meets the needs of our system and data.

## Observation view for indicators
Here we try to recreate a view for calculating some PEPFAR indicators.
This is similar to what we did with direct SQL above; but now we use
FHIR-Views for the heavy lifting.

In [33]:
# TODO: fix the `where` clause Spark SQL issues.

obs_code_val = obs.select(
    {
        "obs_id": obs.id,
        "patient_id": obs.subject.idFor('Patient'),
        "code_sys": obs.code.coding.system.first(),
        "code": obs.code.coding.code.first(),
        "val_quantity": obs.value.ofType('quantity').value,
    }).where(
        # obs.code.coding.system=='http://loinc.org'
        # obs.code.coding.code=='1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
)

runner.to_dataframe(obs_code_val, limit=20)
#print(runner.to_sql(obs_code_val))

Unnamed: 0,obs_id,patient_id,code_sys,code,val_quantity
0,10929,10392,http://loinc.org,1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
1,11327,10669,http://loinc.org,1088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
2,12026,11149,http://loinc.org,5085AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,188.66
3,10063,9040,http://loinc.org,5088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,36.921
4,10798,10392,http://loinc.org,5088AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,36.833
5,10030,9040,http://loinc.org,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
6,10803,10392,http://loinc.org,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
7,11995,11149,http://loinc.org,159800AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
8,11554,10669,http://loinc.org,1261AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
9,10880,10392,http://loinc.org,1111AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,


## Checking the underlying SQL statements
It is easy to expose the underlying SQL query that the FHIR-views library generates. This is done using the `to_sql` method that is part of the runner API. We show below the SQL queries generated of the two examples we used above:

In [34]:
print(runner.to_sql(
    obs.select(
        {
            "id": obs.id,
            "category": obs.category.coding.code,
            "code_display": obs.code.coding.display,
            "status": obs.status,
        }
    ),
    limit=10,
))

SELECT (SELECT id) AS id,(SELECT COLLECT_LIST(code)
FROM (SELECT coding_element_.code
FROM (SELECT category_element_
FROM (SELECT EXPLODE(category_element_) AS category_element_ FROM (SELECT category AS category_element_))) LATERAL VIEW POSEXPLODE(category_element_.coding) AS index_coding_element_, coding_element_)
WHERE code IS NOT NULL) AS category,(SELECT COLLECT_LIST(display)
FROM (SELECT coding_element_.display
FROM (SELECT code) LATERAL VIEW POSEXPLODE(code.coding) AS index_coding_element_, coding_element_)
WHERE display IS NOT NULL) AS code_display,(SELECT status) AS status,(SELECT subject.patientId AS idFor_) AS __patientId__ FROM `default`.observation LIMIT 10


In [35]:
current = pats.address.where(pats.address.period.empty())

simple_pats = pats.select(
    {
        "id": pats.id,
        "gender": pats.gender,
        "birthdate": pats.birthDate,
        "street": current.line,
        "city": current.city,
        "state": current.state,
        "zip": current.postalCode,
    }
)

print(runner.to_sql(simple_pats, limit=5))

SELECT (SELECT id) AS id,(SELECT gender) AS gender,(SELECT CAST(birthDate AS TIMESTAMP) AS birthDate) AS birthdate,(SELECT COLLECT_LIST(line_element_)
FROM (SELECT line_element_
FROM (SELECT address_element_
FROM (SELECT EXPLODE(address_element_) AS address_element_ FROM (SELECT address AS address_element_))
WHERE (SELECT CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END AS empty_
FROM (SELECT address_element_.period)
WHERE period IS NOT NULL)) LATERAL VIEW POSEXPLODE(address_element_.line) AS index_line_element_, line_element_)
WHERE line_element_ IS NOT NULL) AS street,(SELECT COLLECT_LIST(city)
FROM (SELECT address_element_.city
FROM (SELECT EXPLODE(address_element_) AS address_element_ FROM (SELECT address AS address_element_))
WHERE (SELECT CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END AS empty_
FROM (SELECT address_element_.period)
WHERE period IS NOT NULL))
WHERE city IS NOT NULL) AS city,(SELECT COLLECT_LIST(state)
FROM (SELECT address_element_.state
FROM (SELECT EXPLODE(addres

# Appendix
Here are some more helper queries.

In [None]:
df = pandas.read_sql_query(
    sql="SHOW TABLES;",
    con=query_engine,
)

df[~df.tableName.str.contains('_2023_')]