# Introduction to using FHIR Data Pipes and FHIR Views
In this example we will show how to use FHIR Views with data generated using the FHIR Data Pipes.

# Prerequisites:

**Note**: All commands need to be run from the root directory of the fhir-data-pipes repo

1. Bring up a Hive ThriftServer:
    ```
    docker-compose  -f ./docker/compose-controller-spark-sql-single.yaml up --build  --force-recreate -d
    ```

2. Bring up a HAPI FHIR server:
    ```
    docker-compose  -f ./docker/hapi-compose.yml up --force-recreate -d
    ```

3. Load data into the HAPI FHIR server:
    ```
    python3 ./synthea-hiv/uploader/main.py HAPI http://localhost:8091/fhir --input_dir ./synthea-hiv/sample_data
    ```

4. In the FHIR Pipelines Control Panel page, http://localhost:8090, click on “Run Full” button to convert the data to Parquet files with SQL-on-FHIR schema. The exported Parquet files will be stored under the `docker/dwh` directory.


# Installation

It's recommended to set up a virtual environment before starting the notebook kernel to avoid any dependency version issues with your native environment.

This can be done with the following commands in a terminal, if you have Conda installed:

```
    conda create -n python310 python=3.10 -y
    conda activate python310
    conda install ipykernel -y
    ipython kernel install --user --name=python310
    pip install google-fhir-views[r4,spark]
    pip install cryptography
```

# Environment setup
The cell below sets up a Spark client and creates a Spark view "runner", which is used to apply declarative views of FHIR in Spark:

In [19]:
import pandas

from sqlalchemy import dialects
from sqlalchemy import engine

from google.fhir.views import r4
from google.fhir.views import spark_runner

# The Spark dataset containing FHIR data. This may be read-only to the user.
fhir_dataset = "default"

# The Spark dataset where we will create views, value sets, and other derived tables
# as needed. This must be writeable by the user. This will use the default project
# where this notebook is running.
analysis_dataset = "demo_example"

dialects.registry.register("hive", "pyhive.sqlalchemy_hive", "HiveDialect")

# The endpoint of the Hive ThriftServer to connect to
query_engine = engine.create_engine("hive://localhost:10001/default")

# Create a runner to execute the views over Spark.
runner = spark_runner.SparkRunner(
    query_engine=query_engine,
    fhir_dataset=fhir_dataset,
    view_dataset=analysis_dataset,
    snake_case_resource_tables=True,
)

The cell below loads the parquet files we've created into our Spark ThriftServer:

In [20]:
destination_directory_path = "/dwh/controller_*"
with query_engine.connect() as curs:
    curs.execute(f"DROP TABLE IF EXISTS {fhir_dataset}.encounter;")
    curs.execute(f"DROP TABLE IF EXISTS {fhir_dataset}.observation;")
    curs.execute(f"DROP TABLE IF EXISTS {fhir_dataset}.patient;")
    curs.execute(
        f"CREATE TABLE IF NOT EXISTS {fhir_dataset}.encounter USING"
        f" PARQUET LOCATION '{destination_directory_path}/Encounter/*.parquet';"
    )
    curs.execute(
        f"CREATE TABLE IF NOT EXISTS {fhir_dataset}.observation USING PARQUET"
        f" LOCATION '{destination_directory_path}/Observation/*.parquet';"
    )
    curs.execute(
        f"CREATE TABLE IF NOT EXISTS {fhir_dataset}.patient USING PARQUET"
        f" LOCATION '{destination_directory_path}/Patient/*.parquet';"
    )

    curs.execute(f"DROP DATABASE IF EXISTS {analysis_dataset} CASCADE;")
    curs.execute(f"CREATE DATABASE IF NOT EXISTS {analysis_dataset};")

**Note**: These tables are automatically created if `createHiveResourceTables` is set to `true` like [here](https://github.com/google/fhir-data-pipes/blob/3694f1394d0b9011ab480ca61f0bd0568bca2f53/docker/config/application.yaml#L34).

# Explore the data

Getting a general feeling for the data is an important first step in any analysis. So let's create some views of FHIR resources and take a look at common fields.



In [21]:
# Load views based on the base FHIR R4 profile definitions.
views = r4.base_r4()

# Create shorthand names for resources we will work with.
obs = views.view_of("Observation")
pats = views.view_of("Patient")

Let's take a look at some observation resources by creating FHIRPath expressions to select the items we're interested in.

Notice that tab completion works for FHIR fields and FHIRPath functions, so users don't need to switch back and forth to FHIR documentation as much.

In [22]:
runner.to_dataframe(
    obs.select(
        {
            "id": obs.id,
            "category": obs.category.coding.code,
            "code_display": obs.code.coding.display,
            "status": obs.status,
        }
    ),
    limit=10,
)

Unnamed: 0,id,category,code_display,status
0,6097,"[""survey""]","[""CURRENT ANTIRETROVIRAL DRUGS USED FOR TREATM...",final
1,6173,"[""survey""]","[""LOINC Code""]",final
2,6508,"[""survey""]","[""LOINC Code""]",final
3,6175,"[""survey""]","[""TESTS ORDERED""]",final
4,6182,"[""survey""]","[""TESTS ORDERED""]",final
5,6409,"[""survey""]","[""TESTS ORDERED""]",final
6,6534,"[""survey""]","[""ANTIRETROVIRALS STARTED""]",final
7,6493,"[""survey""]","[""TUBERCULOSIS TREATMENT PLAN""]",final
8,6314,"[""survey""]","[""TUBERCULOSIS PROPHYLAXIS PLAN""]",final
9,6513,"[""survey""]","[""TUBERCULOSIS TREATMENT STARTED""]",final


Of course just a sample of codes isn't too useful -- we really want a summary of what codes exist in a field and how many there are. Fortunately, the Spark runner supports a summarize_codes method that accepts a view and a field name and does exactly that.

In [23]:
runner.summarize_codes(obs, obs.category)

Unnamed: 0,system,code,display,count
0,http://terminology.hl7.org/CodeSystem/observat...,survey,survey,675
1,http://terminology.hl7.org/CodeSystem/observat...,exam,exam,13


# Creating Views

FHIR Data is complicated, but there is usually a flat, tabular form of it that can satisfy a given use case and data set.

For example, imagine we need a simple table of patients with their current address. This isn't trivial to query since the address is a nested repeated field. Fortunately we can build a FHIRPath expression to find the current address and create a flattened view using that.

(This can vary by dataset, but in this case we determine the current address by finding the first address that does not have a period attached to it.)

In [24]:
# For this dataset we interpret the current address as one where period is empty.

current = pats.address.where(pats.address.period.empty())

simple_pats = pats.select(
    {
        "id": pats.id,
        "gender": pats.gender,
        "birthdate": pats.birthDate,
        "street": current.line,
        "city": current.city,
        "state": current.state,
        "zip": current.postalCode,
    }
)

runner.to_dataframe(simple_pats, limit=5)

Unnamed: 0,id,gender,birthdate,street,city,state,zip
0,4765,male,1984-09-09,"[""754 Feil Tunnel Unit 36""]","[""Springfield""]","[""MA""]","[""01119""]"
1,4767,female,1999-04-26,"[""766 King Landing Suite 14""]","[""Spencer""]","[""MA""]","[""01562""]"
2,4768,female,1999-02-12,"[""508 Auer Lodge""]","[""Wilbraham""]","[""MA""]",[]
3,4770,male,1999-07-01,"[""230 Olson Fort Suite 50""]","[""Dennis""]","[""MA""]",[]
4,4776,female,1993-07-31,"[""612 Kuhlman Skyway Apt 45""]","[""Amherst""]","[""MA""]",[]


That's nice, but suppose we want to create it as an actual Spark view -- basically a virtual table that can be easily used by any application that uses Spark. We can simply turn that definition into a Spark view

In [25]:
runner.create_database_view(simple_pats, "patient_current_address")

In [26]:
pandas.read_sql_query(
    sql="SELECT * FROM demo_example.patient_current_address LIMIT 5",
    con=query_engine,
)

Unnamed: 0,id,gender,birthdate,street,city,state,zip
0,4765,male,1984-09-09,"[""754 Feil Tunnel Unit 36""]","[""Springfield""]","[""MA""]","[""01119""]"
1,4767,female,1999-04-26,"[""766 King Landing Suite 14""]","[""Spencer""]","[""MA""]","[""01562""]"
2,4768,female,1999-02-12,"[""508 Auer Lodge""]","[""Wilbraham""]","[""MA""]",[]
3,4770,male,1999-07-01,"[""230 Olson Fort Suite 50""]","[""Dennis""]","[""MA""]",[]
4,4776,female,1993-07-31,"[""612 Kuhlman Skyway Apt 45""]","[""Amherst""]","[""MA""]",[]


Now we have a nice, flattened patients table that meets the needs of our system and data.

# Seeing underlying SQL
It is easy to expose the underlying SQL query that the FHIR-views library generates. This is done using the `to_sql` method that is part of the runner API. We show below the SQL queries generated of the two examples we used above:

In [27]:
print(runner.to_sql(
    obs.select(
        {
            "id": obs.id,
            "category": obs.category.coding.code,
            "code_display": obs.code.coding.display,
            "status": obs.status,
        }
    ),
    limit=10,
))

SELECT (SELECT id) AS id,(SELECT COLLECT_LIST(code)
FROM (SELECT coding_element_.code
FROM (SELECT category_element_
FROM (SELECT EXPLODE(category_element_) AS category_element_ FROM (SELECT category AS category_element_))) LATERAL VIEW POSEXPLODE(category_element_.coding) AS index_coding_element_, coding_element_)
WHERE code IS NOT NULL) AS category,(SELECT COLLECT_LIST(display)
FROM (SELECT coding_element_.display
FROM (SELECT code) LATERAL VIEW POSEXPLODE(code.coding) AS index_coding_element_, coding_element_)
WHERE display IS NOT NULL) AS code_display,(SELECT status) AS status,(SELECT subject.patientId AS idFor_) AS __patientId__ FROM `default`.observation LIMIT 10


In [28]:
current = pats.address.where(pats.address.period.empty())

simple_pats = pats.select(
    {
        "id": pats.id,
        "gender": pats.gender,
        "birthdate": pats.birthDate,
        "street": current.line,
        "city": current.city,
        "state": current.state,
        "zip": current.postalCode,
    }
)

print(runner.to_sql(simple_pats, limit=5))

SELECT (SELECT id) AS id,(SELECT gender) AS gender,(SELECT CAST(birthDate AS TIMESTAMP) AS birthDate) AS birthdate,(SELECT COLLECT_LIST(line_element_)
FROM (SELECT line_element_
FROM (SELECT address_element_
FROM (SELECT EXPLODE(address_element_) AS address_element_ FROM (SELECT address AS address_element_))
WHERE (SELECT CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END AS empty_
FROM (SELECT address_element_.period)
WHERE period IS NOT NULL)) LATERAL VIEW POSEXPLODE(address_element_.line) AS index_line_element_, line_element_)
WHERE line_element_ IS NOT NULL) AS street,(SELECT COLLECT_LIST(city)
FROM (SELECT address_element_.city
FROM (SELECT EXPLODE(address_element_) AS address_element_ FROM (SELECT address AS address_element_))
WHERE (SELECT CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END AS empty_
FROM (SELECT address_element_.period)
WHERE period IS NOT NULL))
WHERE city IS NOT NULL) AS city,(SELECT COLLECT_LIST(state)
FROM (SELECT address_element_.state
FROM (SELECT EXPLODE(addres