# Walkthrough: Merging Synthea FHIR and PacBio VCF Data
### Overview
In **Sections 1-5**, this notebook generates synthetic FHIR data using Synthea, uploads the data to a new FHIR server, and downloads the data in Parquet format.  
In **Section 6**, this notebook assumes that PacBio VCFs are available as a second input, and converts them to Parquet using `bcftools` and `pandas`.  
In **Section 7**, Azure Synapse Analytics is then used to perform joint queries on both datasets.

### Contents
[1. Create an AzureML Environment](#1)  
[2. Generate Synthetic FHIR Data with Synthea](#2)  
[3. Configure a FHIR Server](#3)  
[4. Import Synthea Data to the FHIR Server](#4)  
[5. Set up a FHIR->Synapse Sync Agent](#5)  
[6. Convert PacBio VCF data to Parquet](#6)  
[7. Use Azure Synapse for Analytics](#7)  
[References](#references)

## 1. Create an AzureML Environment <a id="1"></a>

**1.0) Create a new "Resource Group"** named `<resource_group>` (optional, can also use an existing group)
- Create Group: "Home" -> "Resource Groups" -> "+ Create" -> select subscription -> name `resource_group` -> "Review + Create" -> "Create"

**1.1) Create an "Azure Machine Learning" environment** (default should be Ubuntu 18.04, Python 3.8).
- Create AzureML: "Home" -> "+ Create a Resource" -> "Azure Machine Learning"[<sup>[0]</sup>](#r0) -> "Create" -> select `<resource_group>` -> name `<azure_ml>` -> "Review + Create" -> "Create" -> wait to deploy
- Launch AzureML Studio: "Notifications" -> "Go to resource" -> "Launch Studio"
- Create Compute: `<azure_ml>` left pane "Compute" -> "+ New" -> name -> select VM Size "Standard_DS12_v2" -> "Create" -> wait to deploy

**1.2) Mount a blob storage container** 
- Create Storage Account: "Home" -> "+ Create a Resource" -> "Storage Account" -> select `<resource_group>` -> name `<storage_account>` -> "Review + Create" -> wait to deploy -> "Go to resource"
- Create Container: `<storage_account>` left pane "Containers" -> top bar "+ Container" -> name (suggested: `synthea`) -> "Create"
- Navigate to AzureML Studio: "Home" -> "Azure Machine Learning" instance `<azure_ml>` -> "Overview" -> "Launch Studio"
- Add Datastore: `<azure_ml>` left pane "Datastores" -> "+ New datastore" -> name `synthea` -> "Azure Blob Storage" -> `<storage_account>` -> `synthea` -> `<keys from next step>` -> "Create"
- Get Storage keys: `<storage_account>` left pane "Access keys" -> "Show" -> "Copy to clipboard"
- Mount Datastore: `<azure_ml>` left pane "Compute" -> `<azure_ml>` instance -> top bar "Data (preview)" -> "Mount" -> "Azure Storage" -> select `synthea` -> name `synthea`
    
The container should now be mounted at `/home/azureuser/cloudfiles/data/datastore/synthea`.

**1.3) Clone this notebook** from Github
- Launch Terminal: `<azure_ml>` left pane "Compute" -> select instance "Terminal" 
- Clone Notebook: `git clone https://github.com/microsoft/genomicsnotebook`
- Open Notebook: `<azure_ml>` left pane "Compute" -> select instance "JupyterLab" -> left pane "File Browser" -> open `Users/<username>/genomicsnotebook/sample-notebooks/genomics-fhir-vcf-walkthrough.ipynb`

## 2. Generate Synthetic FHIR Data with Synthea<a id="2"></a> 

According to their project's Github,
> Synthea[<sup>[1]</sup>](#r1) is a Synthetic Patient Population Simulator.  
> The goal is to output synthetic, realistic (but not real), patient data and associated health records in a variety of formats.  

The health record output format used in this notebook is "Fast Healthcare Interoperability Resources", or FHIR[<sup>[2]</sup>](#r2).  
FHIR is the leading standard for health care data exchange, published by HL7.  
In order to set up Synthea[<sup>[1]</sup>](#r1), first reinstall JDK (otherwise `javadoc` fails during Gradle build).

In [None]:
!sudo apt-get install -y default-jdk

Download the latest stable version of the Synthea[<sup>[1]</sup>](#r1) git repository.

In [None]:
!git clone https://github.com/synthetichealth/synthea.git -b v3.0.0

Move to `synthea/` prior to building/running Synthea[<sup>[1]</sup>](#r1)

In [None]:
%cd /home/azureuser/cloudfiles/code/synthea

Build Synthea[<sup>[1]</sup>](#r1) using Gradle 7.0.2 and Java 11.0.15. On Azure DS11_V2, this takes 30-50 minutes.

In [None]:
!./gradlew build check test

Verify that the build/installation has succeeded.

In [None]:
!./run_synthea -h

Generate a test dataset of 10 patients (Synthea[<sup>[1]</sup>](#r1) only counts `Alive` patients, so there will likely be more).

In [None]:
subprocess.run(["./run_synthea",
    "-s", "42",
    "-cs", "99",
    "-p", "10",
    f'--exporter.baseDirectory=/home/azureuser/cloudfiles/data/datastore/synthea'
]);

Verify that the Synthea[<sup>[1]</sup>](#r1) data generation succeeded:

In [None]:
!ls /home/azureuser/cloudfiles/data/datastore/synthea/fhir | wc -l

## 3. Configure a FHIR Server <a id="3"></a>

**3.0) Identify Resource Group and Subscription**
- Navigate to your desired Azure "Resource Group", note the name -> "Overview" -> copy "Subscription ID"
- Set `resource_group` and `sub_id` in [Section 4.1](#globals)

**3.1) Create an "Azure API for FHIR"[<sup>[3]</sup>](#r3) instance**, named `<fhir_server>`
- Navigate to `https://<fhir_server>.azurehealthcareapis.com/metadata` and verify a "Capability Statement" is retrieved.  
That means the FHIR server[<sup>[3]</sup>](#r3) is running.
- Set `fhir_server` in [Section 4.1](#globals)

**3.2) Register an App** with permission to read/write data to the FHIR server[<sup>[3]</sup>](#r3) (this notebook will be that "app" and use those permissions)
- Create App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations -> top bar "New Registration" -> name `<fhir_app>` -> "Register"
- Navigate to App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations" -> select `<fhir_app>` -> left pane "Overview"
- Copy the "Application (client) ID" and "Directory (tenant) ID", then set `client_id` and `tenant_id` in [Section 4.1](#globals)
- *More information on app registration:* [[5]](#r5)

**3.3) Create Client Secret** for this notebook to prove that it is the "app", or client <a id="secret"></a>
- Navigate to App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations" -> select `<fhir_app>`
- Create Secret: left pane "Certificates & Secrets" -> "+ New Client Secret" -> name `<jnb_secret>` -> Add
- Save Secret: copy `<jnb_secret>`'s `Value`, and store as `client_secret` in [Section 4.1](#globals). If you do not copy the `Value` immediately after creation, you will no longer be able to access it, and will need to create a new secret.
- *More information on app registration:* [[5]](#r5)

**3.4) Add Permissions** for this notebook to POST/GET data from the FHIR server[<sup>[3]</sup>](#r3)
- Navigate to "Azure API for FHIR" server[<sup>[3]</sup>](#r3) named `<fhir_server>`
- Select Role: left pane "Access Control (IAM)" -> top bar "+ Add" -> "Add role assignment" -> Role="FHIR Data Contributor"
- Select Members: middle tab "Members" -> "Assign access to: User..." -> "+ Select members" -> search `<fhir_app>` (created in step 2) -> "Select" -> Review & Assign
- *More information on Azure Role-Based Access Control:* [[6]](#r6)

## 4. Import Synthea Data to the FHIR Server <a id="4"></a>
Microsoft does have a `fhir-loader` Github project[<sup>[8]</sup>](#r8) for bulk importing data to a FHIR server[<sup>[3]</sup>](#r3), but this workflow wasn't working for me. According to Microsoft documentation[<sup>[9]</sup>](#r9), there is also a `bulk-import` FHIR data option, but the changelog for Azure Healthcare APIs[<sup>[10]</sup>](#r10) shows that this feature was removed in the 2022-05-15 update.  
Instead, I wrote the script below based on an auto-generated Postman[<sup>[11]</sup>](#r11) template.  
Postman[<sup>[11]</sup>](#r11) is a platform for using REST APIs, and there's a tutorial for using it with FHIR here: [[12]](#r12).

**4.1) Set globals** for querying the FHIR API <a id="globals"></a>

In [None]:
resource_group = "<resource_group>"
sub_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

fhir_server = "<fhir_server>"
fhir_url = f"https://{fhir_server}.azurehealthcareapis.com"

tenant_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
client_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
client_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

In [None]:
resource_group = "timdunn"
sub_id = "b169b46b-07a3-47dd-9e01-4dd36f2b6c3b"

fhir_server = "timdunn-fhir"
fhir_url = f"https://{fhir_server}.azurehealthcareapis.com"

tenant_id = "72f988bf-86f1-41af-91ab-2d7cd011db47"
client_id = "71ea5f3e-104d-4a11-a0e5-b374694a9a73"
client_secret = "FFM8Q~UNUReJOKO1lEfdslqTiHBo-ezXKCRqMazl"

In [None]:
import requests, json
from glob import glob
from urllib.parse import urlencode

**4.2) Request an access token**, using the [previously-generated client secret](#secret).
- *More information on Azure AD Access Tokens:* [[7]](#r7)

In [None]:
# set request parameters
token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
payload = {
    'grant_type': 'Client_Credentials',
    'client_id': client_id,
    'client_secret': client_secret,
    'resource': fhir_url
}
headers = {
    'Content-Type': 'application/x-www-form-urlencoded'
}

# request token from server
response = requests.request("POST", token_url, headers=headers, data=urlencode(payload))
content = json.loads(response.content)

# save token from response
if response.status_code == 200:
    print(f"{content['token_type']} access token retrieved for {content['resource']} successfully.")
    bearer_token = content["access_token"]
else:
    print(f"ERROR: unexpected status code {response.status_code}.")    

**4.3) Add patient data** generated by Synthea[<sup>[1]</sup>](#r1).

In [None]:
patient_url = f"{fhir_url}/Patient"

# parse all Synthea-generated JSON data
for filename in glob(f"/home/azureuser/cloudfiles/data/datastore/synthea/fhir/*.json"):
    json_file = open(filename, 'r')
    json_obj = json.load(json_file)['entry']
    for resource in json_obj:
        
        # skip all non-patient resources (encounters, providers, etc...)
        if resource["resource"]["resourceType"] == "Patient":
            
            # add patient to FHIR database using REST API
            payload = json.dumps(resource["resource"])
            headers = {
              'Authorization': f'Bearer {bearer_token}',
              'Content-Type': 'application/json'
            }

            # verify success
            response = requests.request("POST", patient_url, headers=headers, data=payload)
            if response.status_code >= 200 and response.status_code < 300:
                print(f"Patient {' '.join(json.loads(response.content)['name'][0]['given'])} {json.loads(response.content)['name'][0]['family']} added successfully.")
            else:
                print(f"ERROR: unexpected status code {response.status_code}.")
                print(response.text)
                break

Retrieve the first 10 `Patient`s from the FHIR database to verify the previous step succeeded.

In [None]:
url = f"{fhir_url}/Patient"
headers = {'Authorization': f'Bearer {bearer_token}'}
response = requests.request("GET", url, headers=headers, data={})
try:
    print(f"{len(json.loads(response.content)['entry'])} patients total")
except KeyError:
    print("No patients in FHIR database.")
#json.loads(response.content)

Delete all `Patient`s (use ONLY for resetting FHIR database)

In [None]:
# # looping is required, since GET will only fetch first 10 patients
# while True:
#     # query for list of all patients
#     url = f"{fhir_url}/Patient"
#     headers = {'Authorization': f'Bearer {bearer_token}'}
#     response = requests.request("GET", url, headers=headers, data={})
#     fhir_data = json.loads(response.content)

#     # delete each patient (in chunk of 10)
#     try:
#         for patient in fhir_data["entry"]:
#             url = f"{fhir_url}/Patient/{patient['resource']['id']}"
#             response = requests.request("DELETE", url, headers=headers, data={})
#             if response.status_code >= 200 and response.status_code < 300:
#                 print(f"Patient {' '.join(patient['resource']['name'][0]['given'])} {patient['resource']['name'][0]['family']} deleted successfully.")
#             else:
#                 print(f"ERROR: unexpected status code {response.status_code}.")
#                 print(response.text)
#                 break
#     except KeyError:
#         print("Done!")
#         break

## 5. Set up the FHIR->Synapse Sync Agent <a id="5"></a>
This notebook section follows the "FHIR to Synapse Sync Agent" tutorial provided Microsoft's "FHIR Analytics Pipelines" Github repository[<sup>[13]</sup>](#r13).

**5.1) Deploy the custom Azure template** provided by the "FHIR to Synapse Sync Agent" tutorial[<sup>[13]</sup>](#r13).
- Navigate to the Github repo by clicking [this link](https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deployment.md).
- Scroll down to "Deployment", then "1. Deploy the Pipeline", and click the blue button "Deploy to Azure"

**5.2) Add Permissions** for Function App `<sync_agent>` to read FHIR data
- `<sync agent>` left pane "Identity" ->  "Azure role assignments" -> add "FHIR Data Reader" (`<resource_group>`)

**5.3) Verify Parquet Conversion**
- Add Datastore: `<azure_ml>` left pane "Datastores" -> "New datastore" -> `fhir_parquet`
- Mount Datastore: `<azure_ml>` left pane "Compute" -> select instance -> "Data (preview)" -> "Mount" -> "Azure Storage" -> `fhir_parquet`

## 6. Convert PacBio VCF data to Parquet <a id="6"></a>

**6.1) Mount the PacBio container** on the running AzureML[<sup>[0]</sup>](#r0) machine.
- **Add DataStore:** "AzureML"[<sup>[0]</sup>](#r0) -> "Datastores" -> "New datastore" -> `pacbio`
- **Mount:** "AzureML"[<sup>[0]</sup>](#r0) -> "Compute" -> select instance -> "Data (preview)" -> "Mount" -> "Azure Storage" -> `pacbio`

**6.2) Install `bcftools`**
- The `bcftools` documentation[<sup>[14]</sup>](#r14) can be found [here](https://samtools.github.io/bcftools/bcftools.html).
- Try `sudo apt install bcftools`. If that doesn't work, then use the following:

In [None]:
%cd /home/azureuser/cloudfiles/code
!sudo apt-get -y install liblzma-dev libbz2-dev
!wget https://github.com/samtools/bcftools/releases/download/1.15.1/bcftools-1.15.1.tar.bz2
!tar -xf bcftools-1.15.1.tar.bz2
%cd bcftools-1.15.1
!sudo make install
!bcftools

In [None]:
!bcftools view -H /home/azureuser/cloudfiles/data/datastore/pacbio/cmh001440/joint/whatshap/cmh001440.joint.vcf.gz | head

**6.3) Convert all PacBio VCFs to TSV**

- Create a `pacbio_tsv` container in the same storage account as the `pacbio` VCF data. 
- Mount it to the AzureML instance.
- Convert all VCF data to TSV:

In [None]:
from glob import glob
import os, subprocess
import pandas as pd
pb_dir = "/home/azureuser/cloudfiles/data/datastore/pacbio"
max_patients = 10

In [None]:
patients = 0
for vcf_fn in glob(f"{pb_dir}/**/*.vcf.gz", recursive=True): # no gvcfs
#for vcf_fn in glob(f"{pb_dir}/**/*.*vcf.gz", recursive=True): # allow gvcfs
    patient = vcf_fn.split("/")[7]
    print(f"Converting Patient {patient} data to TSV.")
    subprocess.run(["bcftools", "query", 
        "--print-header",
        "-f", "%CHROM\t%POS\t%TYPE\t%REF\t%ALT[\t%GT]\n", vcf_fn,
        "-o", f"{pb_dir}_tsv/{patient}.tsv"
    ])
    patients += 1
    if patients >= max_patients: 
        print(f"Max patient count of {max_patients} reached.")
        break

In [None]:
!head /home/azureuser/cloudfiles/data/datastore/pacbio_tsv/cmh001440.tsv

**6.4) Convert TSVs to Parquet**  
TSV data can be converted to Parquet[<sup>[15]</sup>](#r15) format using `pandas`[<sup>[16]</sup>](#r16).

In [None]:
for tsv_fn in glob(f"{pb_dir}_tsv/*.tsv"):
    patient = os.path.splitext(tsv_fn.split("/")[-1])[0]
    print(f"converting {patient}...", end="")
    pb_df = pd.read_csv(tsv_fn, delimiter="\t")
    pb_df.columns = [col.split("]")[1] for col in pb_df.columns] # remove prefixes
    pb_df.to_parquet(f"{pb_dir}_parquet/{patient}.parquet")
    print("done!")
pb_df.head()

In [None]:
for parquet_fn in glob(f"{pb_dir}_parquet/*.parquet"):
    patient = os.path.splitext(parquet_fn.split("/")[-1])[0]
    pb_df = pd.read_parquet(parquet_fn)
pb_df.head()

## 7. Use Azure Synapse for Analytics <a id="7"></a>
Azure Synapse Analytics[<sup>[17]</sup>](#r17) is an enterprise-scale data analytics service, perfect for working with large datasets.  
Please transfer over to **this notebook** to continue the walkthrough and load our Parquet data into a Synapse workspace.

## References <a id="references"></a>
[0]  <a id="r0"></a> Azure Machine Learning: https://docs.microsoft.com/en-us/azure/machine-learning/  
[1]  <a id="r1"></a> Walonoski, Jason, et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25.3 (2018): 230-238. The MITRE Corporation. https://github.com/synthetichealth/synthea  
[2]  <a id="r2"></a> FHIR HL7: http://hl7.org/fhir/index.html  
[3]  <a id="r3"></a> Azure API for FHIR: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/overview  
[4] <a id="r4"></a> Azure Active Directory: https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis  
[5] <a id="r5"></a> Azure App Registration: https://docs.microsoft.com/en-us/azure/healthcare-apis/register-application  
[6] <a id="r6"></a> Azure Role Based Access Control (RBAC): https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/configure-azure-rbac  
[7] <a id="r7"></a> Azure AD Access Tokens: https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/azure-active-directory-identity-configuration  
[8] <a id="r8"></a> Microsoft `fhir-loader`: https://github.com/microsoft/fhir-loader  
[9] <a id="r9"></a> Azure `bulk-import`: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/configure-import-data  
[10] <a id="r10"></a> Azure Healthcare APIs changelog: https://docs.microsoft.com/en-us/azure/templates/microsoft.healthcareapis/change-log/services  
[11] <a id="r11"></a> Postman API Platform: https://www.postman.com/  
[12] <a id="r12"></a> Postman FHIR Tutorial: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/use-postman  
[13] <a id="r13"></a> FHIR to Synapse Sync Agent Tutorial: https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deployment.md  
[14] <a id="r14"></a> `bcftools` Documentation: https://samtools.github.io/bcftools/bcftools.html  
[15] <a id="r15"></a> Parquet File Format: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html  
[16] <a id="r16"></a> Python `pandas` library: https://pandas.pydata.org/  
[17] <a id="r17"></a> Azure Synapse Analytics: https://docs.microsoft.com/en-us/azure/synapse-analytics/