# Example Pipeline: Pharmacogenomics Analysis on Breast Cancer Patients
### Overview
This notebook demonstrates an example end-to-end pipeline which combines (synthetic) FHIR data with PacBio long read sequencing data to perform a basic pharmacogenomic study. A custom Synthea module generates a synthetic cohort of breast cancer patients that have been split into two treatment groups (patients take either Epirubicin or Doxorubicin). This module models the fact that certain patients will respond better to one medication than the other, depending upon the presence/absence of a key variant. After merging FHIR and genomic data through a realistic and scalable pipeline, we perform basic statistical tests to discover the optimal treatment method for a given patient, based on their genetic profile. Please note that this is a contrived example for pipeline demonstration purposes only; a real pharmacogenomics application would require significantly more in-depth analyses.
 
**[Section 0](#0)** covers setting up a JupyterLab instance on an Azure Confidential Compute VM.  
**[Section 1](#1)** covers variant calling PacBio data using Cromwell on Azure.  
**[Section 2](#2)** covers generating FHIR data for a synthetic cohort of breast cancer patients using Synthea.  
**[Section 3](#3)** covers FHIR server deployment and configuration.  
**[Section 4](#4)** covers uploading Synthea data to the FHIR server.  
**[Section 5](#5)** covers deploying a custom "Sync Agent" to download FHIR data in parquet format.  
**[Section 6](#6)** covers parsing and converting parquet FHIR data into a pandas DataFrame.  
**[Section 7](#7)** covers merging individual patient VCFs into a single joint VCF.  
**[Section 8](#8)** covers converting the joint VCF into a pandas Dataframe.  
**[Section 9](#9)** covers merging the PacBio and FHIR data in a realistic manner.  
**[Section 10](#10)** covers basic pharmacogenomics analysis of the merged data.  
**[References](#references)**


***Disclaimer:** We are providing an example architectural design to illustrate how Microsoft tools can be utilized to connect the pieces together (data + interoperability + secure cloud + AI tools), enabling researchers to conduct research analyzing genomics and clinical data. We are not providing or recommending specific instructions for how investigators should conduct their research with this notebook – we will leave that to the professionals!*

## 0. Create a Confidential VM<a id="0"></a> 
First, create a confidential VM following this post:  
https://docs.microsoft.com/en-us/azure/confidential-computing/quick-create-confidential-vm-portal-amd

When you deploy the instance, it should prompt you with an alert to download the private key. Do this! Next, note the "Public IP address" shown under "Essentials" when you select your deployed VM's "Overview" in the left-pane menu.

Using this, you will be able to SSH into the machine from your Terminal. By default, however, the confidential VMs are highly secure and don't accept any connections. We need to first allow connections for our SSH session and JupyterLab server.

**Add SSH in:** Virtual machine -> left pane "Networking" -> "Inbound port rules" -> click "Add inbound port rule" -> Allow SSH on port 22 from your IP  
**Add JupyterLab in:** Virtual machine -> left pane "Networking" -> "Inbound port rules" -> click "Add inbound port rule" -> Allow Any protocol on port 8080 from your IP  
**Add JupyterLab out:** Virtual machine -> left pane "Networking" -> "Outbound port rules" -> click "Add outbound port rule" -> Allow Any protocol on port 8080 to your IP

Restart the VM by clicking left pane "Overview" then top pane "Restart", and you should be able to connect over SSH.
```
ssh -i <private_key.pem> azureuser@<ip_address>
```
Then, run the following commands in your Terminal to install up-to-date versions of Python3, pip, virtualenv, and JupyterLab on the confidential VM.
```
sudo apt update
sudo apt upgrade
sudo apt install python3-pip python3-dev
sudo -H pip3 install --upgrade pip

sudo -H pip3 install virtualenv
mkdir lab && cd lab
virtualenv jupyterenv
source jupyterenv/bin/activate

pip install jupyterlab pandas pyarrow plotly statsmodels
```
At this point, you can start the service with the command shown below. Launching the server will also print the required access token to `stdout`. `--no-browser` specifies to not launch a browser on the confidential VM.

**Note:** Don't use `--ip 0.0.0.0`, as some tutorials recommend. This will allow anyone to access the JupyterLab server at its public URL over HTTP, compromising security.
```
jupyter lab --no-browser --port 8080
```
Using a new Terminal, we need to set up SSH tunneling, which will bind port 8080 on the confidential VM to port 8080 on your local machine.
```
ssh -i <private_key.pem> -N -L localhost:8080:localhost:8080 azureuser@<ip_address>
```
Now, you can open JupyterLab in your local browser by navigating to `localhost:8080`.

## 1. Variant Calling with 'Cromwell on Azure'<a id="1"></a>

Please deploy Cromwell on Azure from a non-confidential VM, preferably an Azure ML VM. Deployment from a confidential VM seems to fail during authentication with Microsoft servers.

### 1.1 Deploy Cromwell on Azure
Cromwell is a workflow management system for scientific workflows, particularly genomics analysis. It was originally developed by the Broad Institute, is included in GATK's best practices pipeline, and runs on Microsoft Azure. First, we'll download the latest release of Cromwell on Azure:

In [None]:
!wget https://github.com/microsoft/CromwellOnAzure/releases/download/3.0.0/deploy-cromwell-on-azure-linux

Run the following cell to install Azure CLI `az`. This should already be installed on AzureML VMs.

In [None]:
%%bash
sudo apt-get update
sudo apt-get install -y ca-certificates curl apt-transport-https lsb-release gnupg
curl -sL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor | \
        sudo tee /etc/apt/trusted.gpg.d/microsoft.gpg > /dev/null
AZ_REPO=$(lsb_release -cs)
echo "deb [arch=amd64] https://packages.microsoft.com.repos.azure-cli/ $AZ_REPO main" | \
        sudo tee /etc/apt/sources.list.d/azure-cli.list
sudo apt-get update # fails on the new source we added, but next step succeeds...
sudo apt-get install -y azure-cli

Now we can log in with the proper permissions to modify our resource group and deploy Cromwell.

In [None]:
!az login

The following command will deploy the necessary storage accounts and compute instances within the specified resource group under the prefix `cvm-coa`.  
You can find more information about deploying Cromwell on Azure [here](https://github.com/microsoft/CromwellOnAzure#Deploy-your-instance-of-Cromwell-on-Azure). First, add permission to run the executable.

In [None]:
!chmod +x ./deploy-cromwell-on-azure-linux

**Note:** The Cromwell deployment appears to loop, printing the same output repeatedly. It will slowly get farther along in the deployment each time, taking several hours in total. Please be patient, and it will eventually succeed.

In [None]:
!./deploy-cromwell-on-azure-linux \
    --SubscriptionId '<########-####-####-####-############>'
    --ResourceGroupName '<resource_group_name>' \
    --RegionName eastus \
    --MainIdentifierPrefix '<prefix>'

### 1.2 Configure Input Data

We have set up the Github repository https://github.com/timd1/pb-human-wgs-workflow-wdl to run an example variant calling workflow using a small subset of a public dataset: **chromosome 22 of HG002**. This is for demonstration purposes only; you may wish to run a similar pipeline with your own input BAM files and reference FASTA. The following tutorials may also be helpful for getting started: [Cromwell on Azure](https://github.com/microsoft/CromwellOnAzure/blob/main/docs/managing-your-workflow.md) and [PacBio human WGS](https://github.com/PacificBiosciences/pb-human-wgs-workflow-wdl/blob/main/Getting%20Started.md).

First, we'll download the reference data:

In [None]:
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai
!gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
!wget https://raw.githubusercontent.com/TimD1/pb-human-wgs-workflow-wdl/main/chr_lengths.tsv
!wget https://raw.githubusercontent.com/PacificBiosciences/pbsv/master/annotations/human_GRCh38_no_alt_analysis_set.trf.bed

Next, we'll download an HG002 BAM from NCBI. Since this file is 65GB, it may take a while.

In [None]:
!wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_SequelII_CCS_11kb/HG002_GRCh38/HG002_GRCh38.haplotag.10x.bam

Then, we'll install `samtools`, which we can use to extract only reads aligning to `chr22`.

In [None]:
!wget https://github.com/samtools/samtools/releases/download/1.12/samtools-1.12.tar.bz2
!tar -xvf samtools-1.12.tar.bz2
%cd samtools-1.12
!sudo apt-get install -y libncurses-dev libbz2-dev liblzma-dev
!./configure
!make
!sudo make install

In [None]:
import subprocess

subprocess.run(["samtools", "view",
    "-h", "-b",
    "-o", "HG002_GRCh38.chr22.bam",
    "HG002_GRCh38.haplotag.10x.bam",
    "chr22"
]);
subprocess.run(["samtools", "index", "HG002_GRCh38.chr22.bam"]);

### 1.3 Triggering the Cromwell Workflow
All six files (reference and index FNA/FAI, aligned reads and index BAM/BAI, chromosome lengths TSV, and tandem repeat regions BED) from the previous step should be moved to the Cromwell storage account's `inputs` folder, under the subdirectory `HG2chr22`. You can do this by temporarily mounting the container to this AzureML compute instance. For example, the reference will be located at `/coa<uniq_id>/inputs/HG2chr22/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna`. Note that if you use different filenames, you will need to clone and modify the Github repository to change filepaths.

In Cromwell, a trigger file is used to initiate a Cromwell workflow. In order to start the Cromwell variant calling pipeline, download the following trigger file from Github and place it in the `workflows/new` directory of the Cromwell storage account.

In [None]:
!wget https://raw.githubusercontent.com/TimD1/pb-human-wgs-workflow-wdl/main/examples/AshkenazimTrio/sample/trigger_files/sample.AshkenaziSon.trigger.json

## 2. Generating Synthetic FHIR Data with Synthea<a id="2"></a> 

At this point, we can start working from the confidential VM deployed in Section 0.

Synthea[<sup>[1]</sup>](#r1) is a tool for outputting synthetic, realistic (but not real), patient data and associated health records in a variety of formats. This notebook deals with records in FHIR, or "Fast Healthcare Interoperability Resources", format[<sup>[2]</sup>](#r2).  

### 2.1 Build Synthea
In order to set up Synthea[<sup>[1]</sup>](#r1), first reinstall JDK (otherwise `javadoc` fails during Gradle build).

In [None]:
!sudo apt-get install -y default-jdk

Download the latest stable version of the Synthea[<sup>[1]</sup>](#r1) git repository.

In [None]:
!git clone https://github.com/synthetichealth/synthea.git -b v3.0.0

Move to `synthea/` prior to building/running Synthea[<sup>[1]</sup>](#r1)

In [None]:
%cd synthea

Build Synthea[<sup>[1]</sup>](#r1) using Gradle 7.0.2 and Java 11.0.15. On Azure `DS11_V2`, this takes anywhere from 2 minutes to 2.5 hours.

In [None]:
!./gradlew build test check

Verify that the build/installation has succeeded.

In [None]:
!./run_synthea -h

### 2.2. Generate Breast Cancer Cohort
Add the attached `simple_breast_cancer.json` module to the `synthea/src/main/resources/modules` directory. This is a basic custom module we designed which generates a cohort of breast cancer patients that have been split into two treatment groups (patients take either Epirubicin or Doxorubicin).

Optionally, you can delete all other modules from this directory, which will cause Synthea to ONLY model our simplistic breast cancer module. They will now be unrealistic patients (not suffering from any other afflictions), but it will reduce the volume of data that has to be processed by the server or parsed by the downstream pipeline.  
**Note:** you must leave the `lookup_tables` subfolder in the `modules` directory.

Generate a test dataset of 50 patients (Synthea[<sup>[1]</sup>](#r1) only counts `Alive` patients, so there will likely be considerably more).

In [None]:
import subprocess

synthea_dir = "/home/azureuser/notebook/synthea-output"
subprocess.run(["./run_synthea",
    "-g", "F", "-a", "30-90",
    "-p", "50",
    f'--exporter.baseDirectory={synthea_dir}'
]);

In [None]:
# after running Synthea, go back to the home directory
%cd ..

## 3. FHIR Server Deployment and Configuration <a id="3"></a>

### 3.1. Create an "Azure API for FHIR"[<sup>[3]</sup>](#r3) instance

**3.1.1) Identify Resource Group and Subscription**
- Navigate to your desired Azure "Resource Group", note the name -> "Overview" -> copy "Subscription ID"
- Set `resource_group` and `sub_id` in [Section 4.1](#globals)

**3.1.2) Create an "Azure API for FHIR"[<sup>[3]</sup>](#r3) instance**, named `<fhir_server>`
- Navigate to `https://<fhir_server>.azurehealthcareapis.com/metadata` and verify a "Capability Statement" is retrieved.  
That means the FHIR server[<sup>[3]</sup>](#r3) is running.
- Set `fhir_server` in [Section 4.1](#globals)
- Use RBAC[<sup>[6]</sup>](#r6): `<fhir_server>` left pane "Identity" -> "On" -> "Save" -> "Yes"

### 3.2 Give this JupyterNB access
**3.2.1) Register an App** with permission to read/write data to the FHIR server[<sup>[3]</sup>](#r3) (this notebook will be that "app" and use those permissions)
- Create App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations -> top bar "New Registration" -> name `<fhir_app>` -> "Register"
- Navigate to App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations" -> select `<fhir_app>` -> left pane "Overview"
- Copy the "Application (client) ID" and "Directory (tenant) ID", then set `client_id` and `tenant_id` in [Section 4.1](#globals)
- *More information on app registration:* [[5]](#r5)

**3.2.2) Create Client Secret** for this notebook to prove that it is the "app", or client <a id="secret"></a>
- Navigate to App: "Azure Active Directory"[<sup>[4]</sup>](#r4) -> left pane "App Registrations" -> select `<fhir_app>`
- Create Secret: left pane "Certificates & Secrets" -> "+ New Client Secret" -> name `<jnb_secret>` -> Add
- Save Secret: copy `<jnb_secret>`'s `Value`, and store as `client_secret` in [Section 4.1](#globals). If you do not copy the `Value` immediately after creation, you will no longer be able to access it, and will need to create a new secret.
- *More information on app registration:* [[5]](#r5)

**3.2.3) Add Permissions** for this notebook to POST/GET data from the FHIR server[<sup>[3]</sup>](#r3)
- Navigate to "Azure API for FHIR" server[<sup>[3]</sup>](#r3) named `<fhir_server>`
- Select Role: left pane "Access Control (IAM)" -> top bar "+ Add" -> "Add role assignment" -> Role="FHIR Data Contributor"
- Select Members: middle tab "Members" -> "Assign access to: User..." -> "+ Select members" -> search `<fhir_app>` (created in step 1) -> "Select" -> Review & Assign
- *More information on Azure Role-Based Access Control:* [[6]](#r6)

## 4. Upload Synthea Data to the FHIR Server <a id="4"></a>
The script below is based on an auto-generated Postman[<sup>[11]</sup>](#r11) template.  
Postman[<sup>[11]</sup>](#r11) is a platform for using REST APIs, and there's a tutorial for using it with FHIR here: [[12]](#r12).

### 4.1. Configuration<a id="globals"></a>

First, set global variables necessary for querying the FHIR API 

In [None]:
resource_group = "<resource_group_name>"
sub_id = "<########-####-####-####-############>"

fhir_server = "<server_name>"
fhir_url = f"https://{fhir_server}.azurehealthcareapis.com"

tenant_id = "<########-####-####-####-############>"
client_id = "<########-####-####-####-############>"
client_secret = "<client_secret>"

In [None]:
import requests, json
from glob import glob
from urllib.parse import urlencode

Request an access token, using the [previously-generated client secret](#secret). You can find more information on Azure AD Access Tokens [here](https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/azure-active-directory-identity-configuration)

In [None]:
# set request parameters
token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
payload = {
    'grant_type': 'Client_Credentials',
    'client_id': client_id,
    'client_secret': client_secret,
    'resource': fhir_url
}
headers = {
    'Content-Type': 'application/x-www-form-urlencoded'
}

# request token from server
response = requests.request("POST", token_url, headers=headers, data=urlencode(payload))
content = json.loads(response.content)

# save token from response
if response.status_code == 200:
    print(f"{content['token_type']} access token retrieved for {content['resource']} successfully.")
    bearer_token = content["access_token"]
else:
    print(f"ERROR: unexpected status code {response.status_code}.")    

### 4.2. Upload FHIR Data

Prior to running the following script, I have found it useful to increase the provisioned throughput of the FHIR server from 400 to 4000 RU/s. This prevents any rate limiting which can lead to dropped records. After the data has been transferred, I drop the throughput back to 400 RU/s. The setting is available on the left tab of the FHIR server instance, under "Database".

In [None]:
%time
# add all hospital/practitioner information first
for filename in glob(f"{synthea_dir}/fhir/*Information*.json"): # (hospital|practitioner)Information<###>
    print(f"Parsing file '{filename}'...")
    json_file = open(filename, 'r')
    json_obj = json.load(json_file)['entry']
    for resource in json_obj:
    
        # craft request to FHIR database using REST API
        payload = json.dumps(resource["resource"])
        headers = {
          'Authorization': f'Bearer {bearer_token}',
          'Content-Type': 'application/json'
        }

        # send request
        url = f"{fhir_url}/{resource['resource']['resourceType']}"
        response = requests.request("POST", url, headers=headers, data=payload)
        
        # verify success
        if response.status_code >= 200 and response.status_code < 300:
            pass
            #print(f"{resource['resource']['resourceType']} added successfully.")
        else:
            print(f"ERROR: unexpected status code {response.status_code}.")
            print(response.text)
            break

# parse all Synthea-generated JSON data
for filename in glob(f"{synthea_dir}/fhir/*_*_*.json"): # <firstname>_<lastname>_<id>
    print(f"Parsing file '{filename}'...")
    json_file = open(filename, 'r')
    json_obj = json.load(json_file)['entry']
    for resource in json_obj:
    
        # craft request to FHIR database using REST API
        payload = json.dumps(resource["resource"])
        headers = {
          'Authorization': f'Bearer {bearer_token}',
          'Content-Type': 'application/json'
        }

        # send request
        url = f"{fhir_url}/{resource['resource']['resourceType']}"
        response = requests.request("POST", url, headers=headers, data=payload)
        
        # verify success
        if response.status_code >= 200 and response.status_code < 300:
            pass
            #print(f"{resource['resource']['resourceType']} added successfully.")
        else:
            print(f"ERROR: unexpected status code {response.status_code}.")
            print(response.text)
            break

The following script can be used to delete all FHIR data from the server (ONLY for full database reset)

In [None]:
# resource_list = [ "AllergyIntolerance", "Encounter", "Observation", "CarePlan", "ExplanationOfBenefit", 
#                  "Organization", "CareTeam", "ImagingStudy", "Patient", "Claim", "Immunization",
#                  "Practitioner", "Condition", "Location", "PractitionerRole", "Device", "Medication", 
#                  "Procedure", "DiagnosticReport", "MedicationAdministration", "Provenance",
#                  "DocumentReference", "MedicationRequest", "SupplyDelivery"]

# # delete all resource types
# for resource_type in resource_list:
#     # looping is required, since GET will only fetch first 10 items
#     print(f"Deleting {resource_type}s...", end="")
#     while True:
#         # query for list of all resources
#         url = f"{fhir_url}/{resource_type}"
#         headers = {'Authorization': f'Bearer {bearer_token}'}
#         response = requests.request("GET", url, headers=headers, data={})
#         fhir_data = json.loads(response.content)

#         # delete each resource_type (in chunk of 10)
#         try:
#             for resource in fhir_data["entry"]:
#                 url = f"{fhir_url}/{resource_type}/{resource['resource']['id']}"
#                 response = requests.request("DELETE", url, headers=headers, data={})
#                 if response.status_code >= 200 and response.status_code < 300:
#                     # print(f"{resource_type} {resource['resource']['id']} deleted successfully.")
#                     pass
#                 else:
#                     print(f"ERROR: unexpected status code {response.status_code}.")
#                     print(response.text)
#         except KeyError:
#             print("done!")
#             break

## 5. Deploying the 'FHIR to Synapse Sync Agent' <a id="5"></a>
This notebook section follows the "FHIR to Synapse Sync Agent" tutorial provided Microsoft's "FHIR Analytics Pipelines" Github repository[<sup>[13]</sup>](#r13).

### 5.1. Deploy the "FHIR to Synapse Sync Agent" 
First, deploy the custom Azure template provided by the "FHIR to Synapse Sync Agent" tutorial[<sup>[13]</sup>](#r13).
- Navigate to the Github repo by clicking [this link](https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deployment.md).
- Scroll down to "Deployment", then "1. Deploy the Pipeline", and click the blue button "Deploy to Azure"
- App Name `<sync_agent>` -> set FHIR URL -> Authentication `true` -> Container Name `fhir` -> "Review + Create" -> "Create"
- Note: DO NOT have any dashes or underscores in the container name, or else it will fail silently

Then, add permissions for Function App `<sync_agent>` to read FHIR data
- `<sync agent>` left pane "Identity" ->  "Azure role assignments" -> "+Add role assignment" -> "Resource Group" -> `<resource_group>` -> "FHIR Data Contributor"


### 5.2. Download FHIR Data 
First, wait for the sync agent to complete a job, transferring records. Jobs are scheduled to run every five minutes, and this first job may take 10-15 minutes to complete.

Afterwards, mount the parquet data on the running AzureML[<sup>[0]</sup>](#r0) machine.
- Add Datastore: `<azure_ml>` left pane "Datastores" -> "+ New datastore" -> name `fhir` -> "Azure Blob Storage" -> `<sync_agent_storage>` -> `fhir` -> `<keys from next step>` -> "Create"
- Get Storage keys: `<sync_agent_storage>` left pane "Access keys" -> "Show" -> "Copy to clipboard"
- Mount Datastore: `<azure_ml>` left pane "Compute" -> `<azure_ml>` instance -> top bar "Data (preview)" -> "Mount" -> "Azure Storage" -> select `fhir` -> name `fhir`

If using a confidential VM, you should mount the container using [blobfuse](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-how-to-mount-container-linux):

In [None]:
!wget https://packages.microsoft.com/config/ubuntu/20.04/packages-microsoft-prod.deb
!sudo dpkg -i packages-microsoft-prod.deb
!sudo apt-get update
!sudo apt install blobfuse

Mount FHIR data

In [None]:
fhir_storage="<storage_account_name>"
fhir_key = "<storage_account_key>"
with open("fhir_blobfuse.cfg", "w") as fhir_blob_cfg:
    fhir_blob_cfg.write(f"accountName {fhir_storage}\n")
    fhir_blob_cfg.write(f"accountKey {fhir_key}\n")
    fhir_blob_cfg.write(f"containerName fhir")

In [None]:
!sudo mkdir -p /mnt/cvm_fhir
!sudo chown azureuser /mnt/cvm_fhir
!chmod 600 fhir_blobfuse.cfg
!mkdir -p ~/cvm_fhir
!blobfuse ~/cvm_fhir \
    --tmp-path=/mnt/cvm_fhir  \
    --config-file=fhir_blobfuse.cfg \
    -o attr_timeout=240 \
    -o entry_timeout=240 \
    -o negative_timeout=120

Mount PacBio Data

In [None]:
pb_storage="<storage_account_name>"
pb_key = "<storage_account_key>"
with open("pb_blobfuse.cfg", "w") as pb_blob_cfg:
    pb_blob_cfg.write(f"accountName {pb_storage}\n")
    pb_blob_cfg.write(f"accountKey {pb_key}\n")
    pb_blob_cfg.write(f"containerName pacbio")

In [None]:
!sudo mkdir -p /mnt/cvm_pb
!sudo chown azureuser /mnt/cvm_pb
!chmod 600 pb_blobfuse.cfg
!mkdir -p ~/cvm_pb
!blobfuse ~/cvm_pb \
    --tmp-path=/mnt/cvm_pb  \
    --config-file=pb_blobfuse.cfg \
    -o attr_timeout=240 \
    -o entry_timeout=240 \
    -o negative_timeout=120

## 6. Extract FHIR Data <a id="6"></a>
### from `.parquet` to `DataFrame` 
First, we'll import the necessary libraries and set some useful variables.

In [None]:
# manage imports
import pandas as pd
from glob import glob
from datetime import datetime
import os, json, subprocess

# filepaths
fhir_dir = "/home/azureuser/cvm_fhir/result"
date = "2022/08/26"
pb_dir = "/home/azureuser/cvm_pb"

# globals
max_patients = 100
max_variants = 50

Next, we'll parse all the FHIR `Patient` data:

In [None]:
# parse patient information from each .parquet file
fhir_dfs = []
for patient_file in glob(f"{fhir_dir}/Patient/{date}/**/*.parquet"):
    
    # create dataframe for this .parquet file
    fhir_df = pd.DataFrame()
    orig_fhir_df = pd.read_parquet(patient_file)

    # set basic fields
    fhir_df["name"]    = [f"{' '.join(name[0]['prefix']) if name[0]['prefix'] else ''} {' '.join(name[0]['given'])} {name[0]['family']}" for name in orig_fhir_df["name"]]
    fhir_df["gender"]  = orig_fhir_df["gender"]
    fhir_df["dead"]    = [dead is not None for dead in orig_fhir_df["deceased"]]
    
    # set age in years from birthDate
    ages = []
    for birthdate in orig_fhir_df["birthDate"]:
        year, month, day = map(int, birthdate.split("-"))
        ages.append((datetime.now()-datetime(year, month, day)).total_seconds() / (60*60*24*365.25))
    fhir_df["age"]     = ages

    # set "medical record number", from a host of other identifiers (social security, etc.)
    mrns = []
    for identifier_list in orig_fhir_df["identifier"]:
        for ident in identifier_list:
            if ident['type'] and json.loads(ident['type']['coding'])[0]['code'] == "MR":
                mrns.append(ident['value'])
    fhir_df["mrn"]     = mrns
    
# merge information
    fhir_dfs.append(fhir_df)
all_fhir_df = pd.concat(fhir_dfs).reset_index(drop=True)

Create a dictionary to map patients to medical record numbers (MRNs).

In [None]:
mrn_dict = {}
for idx, row in all_fhir_df.iterrows():
    mrn_dict[ row['mrn'] ] = idx

Using the MRN to patient mapping, we can scan through all `Condition` and `MedicationRequest` records to determine which patients have breast cancer, and if so, which medication they are taking.

In [None]:
# check if patients have breast cancer
has_breast_cancer = [False]*len(all_fhir_df)
for conds_file in glob(f"{fhir_dir}/Condition/{date}/**/*.parquet"):
    conds_df = pd.read_parquet(conds_file)
    for idx, row in conds_df.iterrows():
        code = "None" if row["code"] is None else row["code"]["coding"][0]["code"]
        mrn = row["subject"]["reference"].split(':')[2] # urn:uuid:mrn
        if code == "254837009":
            has_breast_cancer[ mrn_dict[mrn] ] = True
all_fhir_df['has_breast_cancer'] = has_breast_cancer

# get patient medications
used_epi = [False]*len(all_fhir_df)
used_doxo = [False]*len(all_fhir_df)
has_variant = [False]*len(all_fhir_df)
for meds_file in glob(f"{fhir_dir}/MedicationRequest/{date}/**/*.parquet"):
    meds_df = pd.read_parquet(meds_file)
    for idx, row in meds_df.iterrows():
        code = "None" if row["medication"]["codeableConcept"] is None else row["medication"]["codeableConcept"]["coding"][0]["code"]
        mrn = row["subject"]["reference"].split(':')[2] # urn:uuid:mrn
        if code == "1732186": # epirubicin
            used_epi[ mrn_dict[mrn] ] = True
        elif code == "1790099": # doxorubicin
            used_doxo[ mrn_dict[mrn] ] = True
        elif code in ["ALT", "REF"]:
            has_variant[ mrn_dict[mrn] ] = code
all_fhir_df['epirubicin'] = used_epi
all_fhir_df['doxorubicin'] = used_doxo
all_fhir_df['has_variant'] = has_variant
all_fhir_df

In [None]:
# remove patients without breast cancer, and with unset REF/ALT variants (both REF and ALT are True)
all_fhir_df = all_fhir_df.loc[all_fhir_df['has_breast_cancer'] & all_fhir_df['has_variant']]

# ensure exactly one of (epirubicin, doxorubicin) is set
all_fhir_df = all_fhir_df.loc[all_fhir_df['epirubicin'] ^ all_fhir_df['doxorubicin']]

# remove unnecessary columns
all_fhir_df['has_variant'] = all_fhir_df['has_variant'] == "ALT"
all_fhir_df.drop(['has_breast_cancer', 'doxorubicin'], axis=1, inplace=True)
all_fhir_df

Sanity check that variant presence is independent of patient treatment.

In [None]:
all_fhir_df[['epirubicin', 'has_variant']].value_counts()

Lastly, let's limit the dataset size to the specified maximum number of patients and preview the data.

In [None]:
all_fhir_df = all_fhir_df.drop(all_fhir_df.index[max_patients:])
all_fhir_df.head()

## 7. Merge individual VCFs into a joint VCF <a id="7"></a>
First, we need to install `bcftools` in order to process our VCF files. The `bcftools` documentation[<sup>[7]</sup>](#r7) can be found [here](https://samtools.github.io/bcftools/bcftools.html).  

Please note that `sudo apt-get install bcftools` is insufficient because the Ubuntu 18.04 package repository contains `bcftools-1.7`, which failed for us on this processing pipeline with an old bug. In the following cell, we install a more recent version, `bcftools-1.15`, from source.  

In [None]:
!sudo apt-get -y install liblzma-dev libbz2-dev libcurl4-nss-dev
!wget https://github.com/samtools/bcftools/releases/download/1.15.1/bcftools-1.15.1.tar.bz2
!tar -xf bcftools-1.15.1.tar.bz2
%cd bcftools-1.15.1
!sudo make install
!bcftools --version
%cd ..

We must also install `tabix`[<sup>[8]</sup>](#r8), a tool which indexes VCF files for faster processing.

In [None]:
!sudo apt-get install -y tabix

We're first going to pre-process our VCF files by splitting up sites with multiple alleles into separate records, each with a single allele. For example,

CHROM  | POS  | REF  | ALT 
-------|------|------|----
chr20  | 1232 | A    | T,C

will become

CHROM  | POS  | REF  | ALT  
-------|------|------|----
chr20  | 1232 | A    | T  
chr20  | 1232 | A    | C

This ensures that the number of fields is static for each entry, particularly for the `PL` and `AF` columns.

In [None]:
%%time
# skip VCFs which have already been split/pruned
orig_vcfs = set(glob(f"{pb_dir}/**/*.vcf.gz", recursive=True)) - \
            set(glob(f"{pb_dir}/**/*_split.vcf.gz", recursive=True)) - \
            set(glob(f"{pb_dir}/**/*_pruned.vcf.gz", recursive=True))

for vcf_fn in orig_vcfs:
    print(f"Splitting multi-alleles in '{vcf_fn}'")
    subprocess.run([
        "bcftools", "norm", 
        "--multiallelics", "-both",
        "--output-type", "z",
        "--output", f"{vcf_fn[:-7]}_split.vcf.gz", # remove .vcf.gz
        vcf_fn
    ])

Next, we perform LD (linkage disequilibrium) pruning[<sup>[9]</sup>](#r9) in order to remove variants with high covariance. Since we're only going to be looking at a small subset of each patient's variants, we want to make sure that the variants we do investigate are fairly independent.

In [None]:
%%time
for vcf_fn in glob(f"{pb_dir}/**/*_split.vcf.gz", recursive=True):
    print(f"Pruning variants in '{vcf_fn}'")
    subprocess.run([
        "bcftools", "+prune", 
        "-m", "0.2",
        vcf_fn,
        "--output-type", "z",
        "--output", f"{vcf_fn[:-13]}_pruned.vcf.gz", # remove _split.vcf.gz
    ])

Index all individual VCFs.

In [None]:
%%time 
for vcf_fn in glob(f"{pb_dir}/**/*_pruned.vcf.gz", recursive=True): # no gvcfs
    print(f"Indexing '{vcf_fn}'")
    subprocess.run(["tabix", "-f", "-p", "vcf", vcf_fn])

Merge VCF files into single joint VCF, and index it.

In [None]:
%%time
subprocess.run(["bcftools", "merge"] + 
        ["-m", "none"] +
        list(glob(f"{pb_dir}/**/*_pruned.vcf.gz", recursive=True))[:max_patients] +
        ["-o", f"{pb_dir}_joint/{max_patients}_patients.vcf.gz"]
)
subprocess.run(["tabix", "-f", "-p", "vcf", f"{pb_dir}_joint/{max_patients}_patients.vcf.gz"])

## 8. Extract joint VCF <a id="8"></a>
### from `.vcf` to `DataFrame`

First, we'll use `bcftools query` to convert our VCF to a more general format that works well with existing data science libraries: TSV.

In [None]:
%%time
# convert joint VCF to CSV
subprocess.run(["bcftools", "query", 
        "--print-header",
        "-f", "%CHROM\t%POS\t%TYPE\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO/DP\t%INFO/AF\t%INFO/AQ\t%INFO/AN\t%INFO/AC[\t%GT\t%AD\t%DP\t%GQ\t%PL]\n", 
        f"{pb_dir}_joint/{max_patients}_patients.vcf.gz",
        "-o", f"{pb_dir}_tsv/{max_patients}_patients.tsv"
])

Next, let's define a list of all patients, and functions for computing GT and PL.

In [None]:
# save list of patients
patients = [p.decode() for p in subprocess.run(
    ["bcftools", "query", 
         "--list-samples", f"{pb_dir}_joint/{max_patients}_patients.vcf.gz"], 
    stdout=subprocess.PIPE).stdout.splitlines()]

In [None]:
# replace unknown calls with ref-calls, and remove phasing info (1|0 -> 0/1)
def gt_type(gt_str):
    gt_str = gt_str.replace(".","0")
    if (gt_str[0] == '0' and gt_str[2] == '1') or (gt_str[0] == '1' and gt_str[2] == '0'): return '0/1'
    if gt_str[0] == '0' and gt_str[2] == '0': return '0/0'
    if gt_str[0] == '1' and gt_str[2] == '1': return '1/1'
    else: return '?/?'

In [None]:
# parse PL data if available; otherwise compute from other fields
def get_pl(gt, gq, pl, ac):
    if pl == '.': # missing, compute from GT/GQ
        if (ac == 0 and gt == '0/0') or (ac == 1 and gt == '0/1') or (ac == 2 and gt == '1/1'): # chosen allele
            return 0
        else: # not chosen allele
            return gq 
    else: # parse out of PL data
        return int(pl.split(',')[ac])

Afterwards, we load the TSV into a DataFrame, impute missing values, and perform filtering. Note that the requirement of minimum allele frequency to be between 0.3 and 0.7 is particularly restrictive, and is used to look at variants which are particularly divergent for this cohort.

In [None]:
%%time
# remove numbered prefix from column names (e.g. "[1]CHROM"->"CHROM")
vcf_df = pd.read_csv(f"{pb_dir}_tsv/{max_patients}_patients.tsv", delimiter="\t", nrows=1000000)
vcf_df.columns = [col.split("]")[1] for col in vcf_df.columns]

# filter out variants which are too homogenous across patients (for more interesting analysis, don't use for real datasets)
# a real application may instead have a whitelist/mask of genomic regions or variants of interest
min_freq = 0.3
max_freq = 0.7
gt_is_empty = pd.concat([vcf_df[f'{p}:GT'] == './.' for p in patients], axis=1)
gt_is_ref = pd.concat([vcf_df[f'{p}:GT'] == '0/0' for p in patients], axis=1)
n_valid = max_patients - (gt_is_empty | gt_is_ref).sum(axis=1)
vcf_df['AF'] = [x/max_patients for x in n_valid]
in_freq_range = [x/max_patients < max_freq and x/max_patients > min_freq for x in n_valid]
vcf_df = vcf_df[in_freq_range]

# print(f'selected variants: {len(vcf_df)}')

# impute missing values using median for all numeric fields
number_cols = ["QUAL", "DP", "AQ"] + [f"{patient}:GQ" for patient in patients] + [f"{patient}:DP" for patient in patients]
# print(f'Missing values in the following columns will be imputed using per-patient medians.\nThis is required because unlike GVCFs, VCFs do not contain depth and quality information for reference calls.')
for nc in number_cols:
    # print(f'{nc} missing: {len(vcf_df[vcf_df[nc] == "."])}')
    median = pd.to_numeric(vcf_df.loc[vcf_df[nc] != ".", nc]).median()
    if pd.isna(median) and nc[-2:] == 'GQ': median = 15 # no other values to infer from
    vcf_df.loc[vcf_df[nc]=='.', nc] = median
    vcf_df[nc] = pd.to_numeric(vcf_df[nc])

# compute phred likelihoods and allele frequencies for each sample
for p in patients:
    vcf_df[f'{p}:GT'] = vcf_df[f'{p}:GT'].map(gt_type)
    vcf_df[f'{p}:PL_0/0'] = vcf_df.apply(lambda row: get_pl(gt = row[f'{p}:GT'], gq = row[f'{p}:GQ'], pl = row[f'{p}:PL'], ac = 0), axis=1)
    vcf_df[f'{p}:PL_0/1'] = vcf_df.apply(lambda row: get_pl(gt = row[f'{p}:GT'], gq = row[f'{p}:GQ'], pl = row[f'{p}:PL'], ac = 1), axis=1)
    vcf_df[f'{p}:PL_1/1'] = vcf_df.apply(lambda row: get_pl(gt = row[f'{p}:GT'], gq = row[f'{p}:GQ'], pl = row[f'{p}:PL'], ac = 2), axis=1)
    
    vcf_df[f'{p}:AF_0'] = vcf_df.apply(lambda row: 1 if row[f'{p}:AD'] == '.' else int(row[f'{p}:AD'].split(",")[0]) / max(row[f'{p}:DP'], 1) , axis=1)
    vcf_df[f'{p}:AF_1'] = vcf_df.apply(lambda row: 0 if row[f'{p}:AD'] == '.' else int(row[f'{p}:AD'].split(",")[1]) / max(row[f'{p}:DP'], 1) , axis=1)

# filter by depth/quality, and remove complex variants
vcf_df = vcf_df[vcf_df['DP'] >= 15]
vcf_df = vcf_df[vcf_df['QUAL'] >= 20]
vcf_df = vcf_df[vcf_df['TYPE'] != 'OTHER']
print(f'passing variants: {len(vcf_df)}')

# limit to specified number of variants
vcf_df = vcf_df.drop(vcf_df.index[max_variants:])
vcf_df.reset_index(inplace=True)
print(f'selected variants: {len(vcf_df)}')
vcf_df

Lastly, we'll transpose the dataframe so that each row corresponds to a single patient.

In [None]:
%%time

# define the fields we'll be using in our final dataframe
fields = []
sample_fields = ["GT", "AF_0", "AF_1", "PL_0/0", "PL_0/1", "PL_1/1", "DP"]

# create empty dataframe
pb_df = pd.DataFrame(columns=[f"{var_id}:{fld}" for var_id in range(len(vcf_df)) for fld in fields + sample_fields], index=patients)

# fill dataframe
for idx, row in vcf_df.iterrows():
    for f in fields:
        for p in patients:
            pb_df.loc[p][f"{idx}:{f}"] = row[f]
    for f in sample_fields:
        for p in patients:
            pb_df.loc[p][f"{idx}:{f}"] = row[f"{p}:{f}"]

# new index
pb_df.reset_index(inplace=True)
pb_df = pb_df.rename(columns = {'index': 'patient_id'})
pb_df

## 9. Merge PacBio and FHIR Data <a id="9"></a>

Since one row corresponds to a single patient in both our PacBio and FHIR datasets, we can easily append these two datasets side by side.  

Care is taken to sort both datasets to ensure that the PacBio and FHIR datasets agree whether or not there is a variant at a particular position.

In [None]:
# merge PacBio+FHIR data for patients with variant
fhir_w_var = all_fhir_df.loc[all_fhir_df['has_variant'] == True]
fhir_w_var.reset_index(inplace=True, drop=True)
pb_w_var = pb_df.loc[pb_df['1:GT'] != '0/0']
pb_w_var.reset_index(inplace=True, drop=True)
min_w_var = min(len(fhir_w_var), len(pb_w_var))
w_var = pd.concat([fhir_w_var[:min_w_var], pb_w_var[:min_w_var]], axis=1)

# merge PacBio+FHIR data for patients without variant
fhir_wo_var = all_fhir_df.loc[all_fhir_df['has_variant'] == False]
fhir_wo_var.reset_index(inplace=True, drop=True)
pb_wo_var = pb_df.loc[pb_df['1:GT'] == '0/0']
pb_wo_var.reset_index(inplace=True, drop=True)
min_wo_var = min(len(fhir_wo_var), len(pb_wo_var))
wo_var = pd.concat([fhir_wo_var[:min_wo_var], pb_wo_var[:min_wo_var]], axis=1)

# merge
all_df = pd.concat([w_var, wo_var], axis=0)
all_df.reset_index(inplace=True, drop=True)
all_df

## 10. Basic Pharmacogenomics Analysis <a id="10"></a>

First, we'll extract only the relevant columns from the combined data, and count the number of patients in each category.

In [None]:
counts = pd.DataFrame(all_df[['has_variant', 'epirubicin', 'dead']].value_counts()).reset_index()
counts.columns = ['has_variant', 'epirubicin', 'dead', 'counts']
counts

This sankey plot can be used to more easily visualize the grouping of patients.

In [None]:
import plotly.graph_objects as go
fig = go.Figure(data=[go.Sankey(
    node = dict(
      thickness = 5,
      line = dict(color = "green", width = 0.1),
      label = ["Patients", "Variant", "No Variant", "Doxorubicin - Variant", "Doxorubicin - No Variant", 
               "Epirubicin - Variant", "Epirubicin - No Variant", "Dead", "Alive"],
      color = "blue"
    ),
    link = dict(
          
      # indices correspond to labels
      source = [0,   0,  1,  1,  2,  2, 3,  4,  5,  6, 3, 4, 5, 6], 
      target = [1,   2,  3,  5,  4,  6, 8,  8,  8,  8, 7, 7, 7, 7],
      value =  [
                counts[counts['has_variant']].sum()['counts'],
                counts[~counts['has_variant']].sum()['counts'],
                counts[counts['has_variant'] & ~counts['epirubicin']].sum()['counts'],
                counts[counts['has_variant'] & counts['epirubicin']].sum()['counts'],
                counts[~counts['has_variant'] & ~counts['epirubicin']].sum()['counts'],
                counts[~counts['has_variant'] & counts['epirubicin']].sum()['counts'],
                counts[counts['has_variant'] & ~counts['epirubicin'] & ~counts['dead']].sum()['counts'],
                counts[~counts['has_variant'] & ~counts['epirubicin'] & ~counts['dead']].sum()['counts'],
                counts[counts['has_variant'] & counts['epirubicin'] & ~counts['dead']].sum()['counts'],
                counts[~counts['has_variant'] & counts['epirubicin'] & ~counts['dead']].sum()['counts'],
                counts[counts['has_variant'] & ~counts['epirubicin'] & counts['dead']].sum()['counts'],
                counts[~counts['has_variant'] & ~counts['epirubicin'] & counts['dead']].sum()['counts'],
                counts[counts['has_variant'] & counts['epirubicin'] & counts['dead']].sum()['counts'],
                counts[~counts['has_variant'] & counts['epirubicin'] & counts['dead']].sum()['counts'],
               ]
  ))])
 
fig.show()

It is known that the survival rate of patients with and without the variant is 50%. Let's see if the medication had any effect on survivability.

In [None]:
from statsmodels.stats.weightstats import ztest

# 3 "Doxorubicin - Variant"
alive = [False]*counts[counts['has_variant'] & ~counts['epirubicin'] & counts['dead']].sum()['counts'] + \
        [True]*counts[counts['has_variant'] & ~counts['epirubicin'] & ~counts['dead']].sum()['counts']
tstat, pval = ztest(alive, value=0.5, alternative='larger')
print(f"There is a {(1-pval)*100:.5f}% chance Doxorubicin improves survivability for patients WITH Variant.")

# 4 "Doxorubicin - No Variant"
alive = [False]*counts[~counts['has_variant'] & ~counts['epirubicin'] & counts['dead']].sum()['counts'] + \
        [True]*counts[~counts['has_variant'] & ~counts['epirubicin'] & ~counts['dead']].sum()['counts']
tstat, pval = ztest(alive, value=0.5, alternative='larger')
print(f"There is a {(1-pval)*100:.5f}% chance Doxorubicin improves survivability for patients WITHOUT Variant.")

# 5 "Epirubicin - Variant"
alive = [False]*counts[counts['has_variant'] & counts['epirubicin'] & counts['dead']].sum()['counts'] + \
        [True]*counts[counts['has_variant'] & counts['epirubicin'] & ~counts['dead']].sum()['counts']
tstat, pval = ztest(alive, value=0.5, alternative='larger')
print(f"There is a {(1-pval)*100:.5f}% chance Epirubicin improves survivability for patients WITH Variant.")

# 6 "Epirubicin - No Variant"
alive = [False]*counts[~counts['has_variant'] & counts['epirubicin'] & counts['dead']].sum()['counts'] + \
        [True]*counts[~counts['has_variant'] & counts['epirubicin'] & ~counts['dead']].sum()['counts']
tstat, pval = ztest(alive, value=0.5, alternative='larger')
print(f"There is a {(1-pval)*100:.5f}% chance Epirubicin improves survivability for patients WITHOUT Variant.")

If we select a confidence threshold of p=0.01, then only results with greater than 99% confidence are significant.

We can conclude from the above results that Doxorubicin should be prescribed for patients without the variant, and epirubicin should be prescribed for patients with the variant, since both of these treatments result in statistically significant improved outcomes.

## References <a id="references"></a>
[0]  <a id="r0"></a> Azure Machine Learning: https://docs.microsoft.com/en-us/azure/machine-learning/  
[1]  <a id="r1"></a> Walonoski, Jason, et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25.3 (2018): 230-238. The MITRE Corporation. https://github.com/synthetichealth/synthea  
[2]  <a id="r2"></a> FHIR HL7: http://hl7.org/fhir/index.html  
[3]  <a id="r3"></a> Azure API for FHIR: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/overview  
[4] <a id="r4"></a> Azure Active Directory: https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis  
[5] <a id="r5"></a> Azure App Registration: https://docs.microsoft.com/en-us/azure/healthcare-apis/register-application  
[6] <a id="r6"></a> Azure Role Based Access Control (RBAC): https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/configure-azure-rbac  
[7] <a id="r7"></a> Azure AD Access Tokens: https://docs.microsoft.com/en-us/azure/healthcare-apis/azure-api-for-fhir/azure-active-directory-identity-configuration  
[8] <a id="r8"></a> Microsoft `fhir-loader`: https://github.com/microsoft/fhir-loader  
[9] <a id="r9"></a> Azure `bulk-import`: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/configure-import-data  
[10] <a id="r10"></a> Azure Healthcare APIs changelog: https://docs.microsoft.com/en-us/azure/templates/microsoft.healthcareapis/change-log/services  
[11] <a id="r11"></a> Postman API Platform: https://www.postman.com/  
[12] <a id="r12"></a> Postman FHIR Tutorial: https://docs.microsoft.com/en-us/azure/healthcare-apis/fhir/use-postman  
[13] <a id="r13"></a> FHIR to Synapse Sync Agent Tutorial: https://github.com/microsoft/FHIR-Analytics-Pipelines/blob/main/FhirToDataLake/docs/Deployment.md  
[14] <a id="r14"></a> `bcftools` Documentation: https://samtools.github.io/bcftools/bcftools.html  
[15] <a id="r15"></a> Parquet File Format: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html  
[16] <a id="r16"></a> Python `pandas` library: https://pandas.pydata.org/  
[17] <a id="r17"></a> Azure Synapse Analytics: https://docs.microsoft.com/en-us/azure/synapse-analytics/

# Notices


THIS NOTEBOOK JUST PROVIDE A SAMPLE CODES FOR EDUCATIONAL PURPOSES. MICROSOFT DOES NOT CLAIM ANY OWNERSHIP ON THESE CODES AND LIBRARIES. MICROSOFT PROVIDES THIS NOTEBOOK AND SAMPLE USE OF  LIBRARIES ON AN “AS IS” BASIS. DATA OR ANY MATERIAL ON THIS NOTEBOOK. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THIS NOTEBOOK. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THIS NOTEBOOK.

#### Notebook prepared by [Tim Dunn](https://github.com/TimD1)- Research Intern- Microsoft Biomedical Platforms and Genomics