# Convert Synthetic FHIR and PacBio VCF Data to parquet and Explore with Azure Synapse Analytics
### Overview
In **Sections 1-4**, this notebook generates synthetic FHIR data using Synthea, uploads the data to a new FHIR server, and downloads the data in Parquet format.  
In **Section 5**, this notebook assumes that PacBio VCFs are available as a second input, and converts them to Parquet using `bcftools` and `pandas`.  
In **Section 6**, Azure Synapse Analytics is then used to perform joint queries on both datasets.

#### Please click this [link](https://github.com/microsoft/genomicsnotebook/blob/main/docs/fhir_long_read_1.JPG) to view the high level implementation design of this notebook.
### Contents
[1. Generate Synthetic FHIR Data with Synthea](#1)  
[2. Configure a FHIR Server](#2)  
[3. Import Synthea Data to the FHIR Server](#3)  
[4. Set up a FHIR->Synapse Sync Agent](#4)  
[5. Convert PacBio VCF data to Parquet](#5)  
[6. Use Azure Synapse for Analytics](#6)  
[References](#references)

## 1. Generate Synthetic FHIR Data with Synthea<a id="2"></a> 

According to their project's Github,
> Synthea[<sup>[1]</sup>](#r1) is a Synthetic Patient Population Simulator.  
> The goal is to output synthetic, realistic (but not real), patient data and associated health records in a variety of formats.  

The health record output format used in this notebook is "Fast Healthcare Interoperability Resources", or FHIR[<sup>[2]</sup>](#r2).  
FHIR is the leading standard for health care data exchange, published by HL7.  
In order to set up Synthea[<sup>[1]</sup>](#r1), first reinstall JDK (otherwise `javadoc` fails during Gradle build).

In [None]:
!sudo apt-get install -y default-jdk

Download the latest stable version of the Synthea[<sup>[1]</sup>](#r1) git repository.

In [None]:
!git clone https://github.com/synthetichealth/synthea.git -b v3.0.0

Move to `synthea/` prior to building/running Synthea[<sup>[1]</sup>](#r1)

In [None]:
%cd synthea

Build Synthea[<sup>[1]</sup>](#r1) using Gradle 7.0.2 and Java 11.0.15. On Azure `DS11_V2`, this takes anywhere from 30 minutes to 2 hours.

In [None]:
!./gradlew build test check

Verify that the build/installation has succeeded.

In [None]:
!./run_synthea -h

Generate a test dataset of 10 patients (Synthea[<sup>[1]</sup>](#r1) only counts `Alive` patients, so there will likely be more).

In [None]:
import subprocess

subprocess.run(["./run_synthea",
    "-s", "42",
    "-cs", "99",
    "-p", "10",
    f'--exporter.baseDirectory=/mnt/batch/tasks/shared/LS_root/mounts/clusters/<USERNAME>/code'
]);

Verify that the Synthea[<sup>[1]</sup>](#r1) data generation succeeded:

In [None]:
!ls /mnt/batch/tasks/shared/LS_root/mounts/clusters/<USERNAME>/code | wc -l

## 2. Configure a FHIR Server <a id="3"></a>

**2.0) Identify Resource Group and Subscription**
- Navigate to your desired Azure "Resource Group", note the name -> "Overview" -> copy "Subscription ID"
- Set `resource_group` and `sub_id` in [Section 3.1](#globals)

**2.1) Create an "Azure API for FHIR"[<sup>[3]</sup>](#r3) instance**, named `<fhir_server>`
- Navigate to `https://<fhir_server>.azurehealthcareapis.com/metadata` and verify a "Capability Statement" is retrieved.  
That means the FHIR server[<sup>[3]</sup>](#r3) is running.
- Set `fhir_server` in [Section 3.1](#globals)
- Use RBAC[<sup>[6]</sup>](#r6): `<fhir_server>` left pane "Identity" -> "On" -> "Save"

**2.2) Register an App** with permission to read/write data to the FHIR server[<sup>[3]</sup>](#r3) (this notebook will be that "app" and use those permissions)
- Create App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations -> top bar "New Registration" -> name `<fhir_app>` -> "Register"
- Navigate to App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations" -> select `<fhir_app>` -> left pane "Overview"
- Copy the "Application (client) ID" and "Directory (tenant) ID", then set `client_id` and `tenant_id` in [Section 3.1](#globals)
- *More information on app registration:* [[5]](#r5)

**2.3) Create Client Secret** for this notebook to prove that it is the "app", or client <a id="secret"></a>
- Navigate to App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations" -> select `<fhir_app>`
- Create Secret: left pane "Certificates & Secrets" -> "+ New Client Secret" -> name `<jnb_secret>` -> Add
- Save Secret: copy `<jnb_secret>`'s `Value`, and store as `client_secret` in [Section 3.1](#globals). If you do not copy the `Value` immediately after creation, you will no longer be able to access it, and will need to create a new secret.
- *More information on app registration:* [[5]](#r5)

**2.4) Add Permissions** for this notebook to POST/GET data from the FHIR server[<sup>[3]</sup>](#r3)
- Navigate to "Azure API for FHIR" server[<sup>[3]</sup>](#r3) named `<fhir_server>`
- Select Role: left pane "Access Control (IAM)" -> top bar "+ Add" -> "Add role assignment" -> Role="FHIR Data Contributor"
- Select Members: middle tab "Members" -> "Assign access to: User..." -> "+ Select members" -> search `<fhir_app>` (created in step 1) -> "Select" -> Review & Assign
- *More information on Azure Role-Based Access Control:* [[6]](#r6)

## 3. Import Synthea Data to the FHIR Server <a id="4"></a>
We wrote the script below based on an auto-generated Postman[<sup>[11]</sup>](#r11) template.  
Postman[<sup>[11]</sup>](#r11) is a platform for using REST APIs, and there's a tutorial for using it with FHIR here: [[12]](#r12).

**3.1) Set globals** for querying the FHIR API <a id="globals"></a>

In [None]:
resource_group = "<resource_group>"
sub_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

fhir_server = "<fhir_server>"
fhir_url = f"https://{fhir_server}.azurehealthcareapis.com"

tenant_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
client_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
client_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

In [None]:
import requests, json
from glob import glob
from urllib.parse import urlencode

**3.2) Request an access token**, using the [previously-generated client secret](#secret).
- *More information on Azure AD Access Tokens:* [[7]](#r7)

In [None]:
# set request parameters
token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
payload = {
    'grant_type': 'Client_Credentials',
    'client_id': client_id,
    'client_secret': client_secret,
    'resource': fhir_url
}
headers = {
    'Content-Type': 'application/x-www-form-urlencoded'
}

# request token from server
response = requests.request("POST", token_url, headers=headers, data=urlencode(payload))
content = json.loads(response.content)

# save token from response
if response.status_code == 200:
    print(f"{content['token_type']} access token retrieved for {content['resource']} successfully.")
    bearer_token = content["access_token"]
else:
    print(f"ERROR: unexpected status code {response.status_code}.")    

**3.3) Add patient data** generated by Synthea[<sup>[1]</sup>](#r1).

In [None]:
patient_url = f"{fhir_url}/Patient"

# parse all Synthea-generated JSON data
for filename in glob(f"/home/azureuser/cloudfiles/data/datastore/synthea/fhir/*.json"):
    json_file = open(filename, 'r')
    json_obj = json.load(json_file)['entry']
    for resource in json_obj:
        
        # skip all non-patient resources (encounters, providers, etc...)
        if resource["resource"]["resourceType"] == "Patient":
            
            # add patient to FHIR database using REST API
            payload = json.dumps(resource["resource"])
            headers = {
              'Authorization': f'Bearer {bearer_token}',
              'Content-Type': 'application/json'
            }

            # verify success
            response = requests.request("POST", patient_url, headers=headers, data=payload)
            if response.status_code >= 200 and response.status_code < 300:
                print(f"Patient {' '.join(json.loads(response.content)['name'][0]['given'])} {json.loads(response.content)['name'][0]['family']} added successfully.")
            else:
                print(f"ERROR: unexpected status code {response.status_code}.")
                print(response.text)
                break

Retrieve the first 10 `Patient`s from the FHIR database to verify the previous step succeeded.

In [None]:
url = f"{fhir_url}/Patient"
headers = {'Authorization': f'Bearer {bearer_token}'}
response = requests.request("GET", url, headers=headers, data={})
try:
    print(f"{len(json.loads(response.content)['entry'])} patients total")
except KeyError:
    print("No patients in FHIR database.")
#json.loads(response.content)

Delete all `Patient`s (use ONLY for resetting FHIR database)

In [None]:
# # looping is required, since GET will only fetch first 10 patients
# while True:
#     # query for list of all patients
#     url = f"{fhir_url}/Patient"
#     headers = {'Authorization': f'Bearer {bearer_token}'}
#     response = requests.request("GET", url, headers=headers, data={})
#     fhir_data = json.loads(response.content)

#     # delete each patient (in chunk of 10)
#     try:
#         for patient in fhir_data["entry"]:
#             url = f"{fhir_url}/Patient/{patient['resource']['id']}"
#             response = requests.request("DELETE", url, headers=headers, data={})
#             if response.status_code >= 200 and response.status_code < 300:
#                 print(f"Patient {' '.join(patient['resource']['name'][0]['given'])} {patient['resource']['name'][0]['family']} deleted successfully.")
#             else:
#                 print(f"ERROR: unexpected status code {response.status_code}.")
#                 print(response.text)
#                 break
#     except KeyError:
#         print("Done!")
#         break

## 4. Set up the FHIR->Synapse Sync Agent <a id="5"></a>
This notebook section follows the "FHIR to Synapse Sync Agent" tutorial provided Microsoft's "FHIR Analytics Pipelines" Github repository[<sup>[13]</sup>](#r13).

**4.1) Deploy the custom Azure template** provided by the "FHIR to Synapse Sync Agent" tutorial[<sup>[13]</sup>](#r13).
- Navigate to the Github repo by clicking [this link](https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deploy-FhirToDatalake.md).
- Scroll down to "Deployment", then "1. Deploy the Pipeline", and click the blue button "Deploy to Azure"
- App Name `<sync_agent>` -> set FHIR URL -> Authentication `true` -> Container Name `fhir` -> "Review + Create" -> "Create"
- Note: DO NOT have any dashes or underscores in the container name, or else it will fail silently

**4.2) Add Permissions** for Function App `<sync_agent>` to read FHIR data
- `<sync agent>` left pane "Identity" ->  "Azure role assignments" -> "Resource Group" -> `<resource_group>` -> "FHIR Data Contributor"

**4.3) Mount Parquet Data** on the running AzureML[<sup>[0]</sup>](#r0) machine.
- Add Datastore: `<azure_ml>` left pane "Datastores" -> "+ New datastore" -> name `fhir` -> "Azure Blob Storage" -> `<sync_agent_storage>` -> `fhir` -> `<keys from next step>` -> "Create"
- Get Storage keys: `<sync_agent_storage>` left pane "Access keys" -> "Show" -> "Copy to clipboard"
- Mount Datastore: `<azure_ml>` left pane "Compute" -> `<azure_ml>` instance -> top bar "Data (preview)" -> "Mount" -> "Azure Storage" -> select `fhir` -> name `fhir`

## 5. Convert PacBio VCF data to Parquet <a id="6"></a>

**5.1) Mount the PacBio container** on the running AzureML[<sup>[0]</sup>](#r0) machine.

- Add Datastore: `<azure_ml>` left pane "Datastores" -> "+ New datastore" -> name `pacbio` -> "Azure Blob Storage" -> `<storage_account>` -> `pacbio` -> `<keys from next step>` -> "Create"
- Get Storage keys: `<storage_account>` left pane "Access keys" -> "Show" -> "Copy to clipboard"
- Mount Datastore: `<azure_ml>` left pane "Compute" -> `<azure_ml>` instance -> top bar "Data (preview)" -> "Mount" -> "Azure Storage" -> select `pacbio` -> name `pacbio`

**5.2) Install `bcftools`**  
The `bcftools` documentation[<sup>[14]</sup>](#r14) can be found [here](https://samtools.github.io/bcftools/bcftools.html).

In [None]:
!sudo apt-get install -y bcftools

**5.3) Convert all PacBio VCFs to TSV**

- Create `pacbio_tsv` and `pacbio_parquet` containers in the same storage account as the `pacbio` VCF data. 
- Mount them to the AzureML instance (as in Section 5.1).
- Convert all VCF data to TSV:

In [None]:
from glob import glob
import os, subprocess
import pandas as pd
pb_dir = "/***/pacbio"

In [None]:
for vcf_fn in glob(f"{pb_dir}/**/*.vcf.gz", recursive=True): # no gvcfs
#for vcf_fn in glob(f"{pb_dir}/**/*.*vcf.gz", recursive=True): # allow gvcfs
    patient = vcf_fn.split("/")[7]
    print(f"Converting Patient {patient} data to TSV.")
    subprocess.run(["bcftools", "query", 
        "--print-header",
        "-f", "%CHROM\t%POS\t%TYPE\t%REF\t%ALT[\t%GT]\n", vcf_fn,
        "-o", f"{pb_dir}_tsv/{patient}.tsv"
    ])

**5.4) Convert TSVs to Parquet**  
TSV data can be converted to Parquet[<sup>[15]</sup>](#r15) format using `pandas`[<sup>[16]</sup>](#r16).

In [None]:
all_pb_dfs = []
for tsv_fn in glob(f"{pb_dir}_tsv/*.tsv"):
    
    # read input .tsv
    patient = os.path.splitext(tsv_fn.split("/")[-1])[0]
    print(f"converting {patient}...", end="")
    pb_df = pd.read_csv(tsv_fn, delimiter="\t")
    
    # new column names
    cols = []
    for col in pb_df.columns:
        new_col = col.split("]")[1]
        if ":" in new_col:
            new_col = new_col[list(new_col).index(":")+1:]
        cols.append(new_col)
    pb_df.columns = cols
    pb_df['PATIENT'] = patient
    
    # add df to list
    all_pb_dfs.append(pb_df)
    print("done!")
    
# output to .parquet
pb_df = pd.concat(all_pb_dfs)
pb_df.to_parquet(f"{pb_dir}_parquet/10_patients.parquet")
print(pb_df.head())

In [None]:
pb_df = pd.read_parquet(f"{pb_dir}_parquet/10_patients.parquet")
pb_df.head()

## 6. Use Azure Synapse for Analytics to Explore Parquet files <a id="7"></a>
Azure Synapse Analytics[<sup>[17]</sup>](#r17) is an enterprise-scale data analytics service, perfect for working with large datasets.  

Please transfer over to Azure Synapse to load our Parquet data into a Synapse workspace with the following sample commands:

For further sample tables and queries, please visit https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/combine-and-explore-fhir-server-and-genomics-data-in-azure/ba-p/3298335

Sample command: Import synthetic FHIR data into Azure Synapse Studio:

In [None]:
%%pyspark
df_fhir = spark.read.load('abfss://<PARQUET LOCATION>@<SYNAPSE STORAGE>.dfs.core.windows.net/FILENAME.parquet', format='parquet')
df_fhir.createOrReplaceTempView("fhir_table")
display(df_fhir.limit(15))

Sample Command: Import sample PacBio data to into Azure Synapse Studio:

In [None]:
%%pyspark
blob_account_name = "<STORAGE ACCOUNT NAME>"
blob_container_name = "pacbio-parquet"
from pyspark.sql import SparkSession

sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
blob_sas_token = token_library.getConnectionString("<GENOMICS PARQUET FILE LOCATION>")

spark.conf.set(
    'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
    blob_sas_token)
df_pacbio = spark.read.load('wasbs://pacbio-parquet@<STORAGE ACCOUNT NAME>.blob.core.windows.net/FILENAME.parquet', format='parquet')
df_pacbio.createOrReplaceTempView("pacbio_table")
display(df_pacbio.limit(5))

## References <a id="references"></a>
[0]  <a id="r0"></a> Azure Machine Learning: https://docs.microsoft.com/en-us/azure/machine-learning/  
[1]  <a id="r1"></a> Walonoski, Jason, et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25.3 (2018): 230-238. The MITRE Corporation. https://github.com/synthetichealth/synthea  
[2]  <a id="r2"></a> FHIR HL7: http://hl7.org/fhir/index.html  
[3]  <a id="r3"></a> Azure API for FHIR: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/overview  
[4] <a id="r4"></a> Azure Active Directory: https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis  
[5] <a id="r5"></a> Azure App Registration: https://docs.microsoft.com/en-us/azure/healthcare-apis/register-application  
[6] <a id="r6"></a> Azure Role Based Access Control (RBAC): https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/configure-azure-rbac  
[7] <a id="r7"></a> Azure AD Access Tokens: https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/azure-active-directory-identity-configuration  
[8] <a id="r8"></a> Microsoft `fhir-loader`: https://github.com/microsoft/fhir-loader  
[9] <a id="r9"></a> Azure `bulk-import`: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/configure-import-data  
[10] <a id="r10"></a> Azure Healthcare APIs changelog: https://docs.microsoft.com/en-us/azure/templates/microsoft.healthcareapis/change-log/services  
[11] <a id="r11"></a> Postman API Platform: https://www.postman.com/  
[12] <a id="r12"></a> Postman FHIR Tutorial: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/use-postman  
[13] <a id="r13"></a> FHIR to Synapse Sync Agent Tutorial: https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deployment.md  
[14] <a id="r14"></a> `bcftools` Documentation: https://samtools.github.io/bcftools/bcftools.html  
[15] <a id="r15"></a> Parquet File Format: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html  
[16] <a id="r16"></a> Python `pandas` library: https://pandas.pydata.org/  
[17] <a id="r17"></a> Azure Synapse Analytics: https://docs.microsoft.com/en-us/azure/synapse-analytics/

# Notices


THIS NOTEBOOK JUST PROVIDE A SAMPLE CODES FOR EDUCATIONAL PURPOSES. MICROSOFT DOES NOT CLAIM ANY OWNERSHIP ON THESE CODES AND LIBRARIES. MICROSOFT PROVIDES THIS NOTEBOOK AND SAMPLE USE OF  LIBRARIES ON AN “AS IS” BASIS. DATA OR ANY MATERIAL ON THIS NOTEBOOK. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THIS NOTEBOOK. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THIS NOTEBOOK.

#### Notebook prepared by [Tim Dunn](https://github.com/TimD1)- Research Intern- Microsoft Biomedical Platforms and Genomics