# PIC-SURE API for the Genomic Information Commons

This is a tutorial notebook aimed to get the user quickly up and running with the PIC-SURE API.

### Genomic Information Commons

The [Genomic Information Commons (GIC)](https://www.genomicinformationcommons.org/) is the first queryable, federated, genomic data collaboration between leading hospitals in the nation and the first genomic data commons in the world that offers participating institutions the ability to leverage globally scalable technologies, policies, and procedures for sharing genomic data, phenotypic data, and biospecimen metadata on broadly consented cohorts, across sites of care. 

The GIC leverages a multi-institutional patient population of diverse backgrounds with unparalleled representation across the spectrum of diseases. Researchers and clinicians employed by member institutions have access to the combined patient population, which is continuously updating and available to query in aggregate view using the [GIC Portal](https://pl-gic.childrens.harvard.edu/).

Institutional data is stored locally and does not leave member institutions without approvals from institutional IRBs and the GIC's Federated Data and Sample Access Committee. Additionally, member institutions have full insight into use of their data by other member institutions.

___

### What is PIC-SURE?

The Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform integrates clinical and genomic data from the PrecisionLink Biobank.

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science.

More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patients that match criteria, and create cohorts from this interactive exploration. The python API is actively developed by the Avillach Lab at Harvard Medical School.

**GitHub repo:**

https://github.com/hms-dbmi/pic-sure-python-adapter-hpds

https://github.com/hms-dbmi/pic-sure-python-client

---

### Retrieve Personal User Security Token
In order to be able to run any of these examples, you'll need to get a **personal user security token.** This is the way the API grants access to individual users to protected-access data.The user token is strictly personal, be careful not to share it with anyone.
<br>

**How to retrieve your personal security token:**
1. In a web browser, navigate to the GIC instance of PIC-SURE
    - Login using your institutional credentials (i.e. Boston Childrens email and password)
2. In the user interface, click the **User Profile** tab
3. A modal will open with a your personal secuirty token.
    - Check the expiration date of your user token. If expired click refresh.
    - Click copy
4. Navigate back to the Jupyter environment.
    - In the folder where your jupyter notebook is located, create a **new** text file: `token.txt` 
    - Paste the personal security token in the text file and click save.
--- 

## Environment Set-up

### Pre-requisites: 
* Python >= 3.7
* pip: Python package manager, already available in most system with a Python interpreter installed 

* [pip installation instructions](https://pip.pypa.io/en/stable/installing/).

### IPython magic command
The two lines of code below load the autoreload IPython extension. Although not necessary to execute the rest of the Notebook, it does enable to reload every dependency each time the python code is executed. 

**This enables the system to take into account changes made in an external file that have been imported into this Notebook (e.g. user defined function stored in separate file), without having to manually reload libraries**. 

This is especially helpful when developing interactively. More information about [IPython Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html) can be found here. 

In [2]:
%load_ext autoreload
%autoreload 2

### Package installation
- Install the packages listed in the `requirements.txt` file using the pip package manager

In [3]:
!cat requirements.txt

numpy==1.22.0
matplotlib>=3.1.1
pandas>=0.25.3
scipy>=1.3.1
tqdm>=4.38.0
statsmodels>=0.10.2


In [4]:
# set up environment
import sys
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git


Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /private/var/folders/4l/6p4fk5f13f572f45_7zs6jfc0000gn/T/pip-req-build-ft6nu2t7
  Running command git clone --quiet https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git /private/var/folders/4l/6p4fk5f13f572f45_7zs6jfc0000gn/T/pip-req-build-ft6nu2t7
  Resolved https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to commit 7b5c4b3fd544be200adaf50b17e4e7d6af5778fb
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting httplib2
  Using cached httplib2-0.21.0-py3-none-any.whl (96 kB)
Collecting pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2
  Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB)
Building wheels for collected packages: PicSureHpdsLib
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureHpdsLib: filename=PicSureHpdsLib-0.9.0-py2.py3-none-any.whl size=22057 s

### Import Dependencies
Import all the external dependencies, as well as user-defined functions stored in the _python_lib_ folder

In [5]:
# Useful to estimate execution time of the Notebook
from datetime import datetime
then = datetime.now()

# PIC-SURE API library
import PicSureHpdsLib
import PicSureClient

# Python library for PIC-SURE
from python_lib.utils import get_multiIndex_variablesDict

# Analysis
import pandas as pd
from pprint import pprint

## Connecting to a PIC-SURE network

Connect to the data network using the HPDS adapter
Three pieces of information are needed to get access to data through the PIC-SURE API: 
- Network URL
- Resource id
- User security token -- this is specific to a given URL + resource 

In [6]:
# Personal Security token is the individual user key given to connect to the GIC institute resource
token_file = "token.txt"
with open(token_file, "r") as f:
    my_token = f.read()

In [8]:
# Connection to the PIC-SURE API w/ key

# network information - insert PIC-SURE instance url
PICSURE_network_URL = "https://pl-gic.childrens.harvard.edu/picsure" 
# get connection object
connection = PicSureClient.Client.connect(url = PICSURE_network_URL,
                                 token = my_token)

+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 7fdb91ab-aceb-472a-b276-490d1729f841 | CHOP                                                 |
| 43ddc535-7740-4c1b-b961-609dd1a0525c | WASHU                                                |
| 04bad269-3b87-4cd4-ac62-9cedfb0096ea | BCH                                                  |
| 6e5f3248-cee5-417c-af40-992cb836c3d3 | CCHMC                                                |
| 31316431-3832-6235-2d33-6332312d3131 | Common-Search                                        |
+--------------------------------------+------------------------------------------------------+


In [10]:
# Get adapter object
adapter = PicSureHpdsLib.Adapter(connection)

In [11]:
# insert your resource id generated above 
resource_id = '' 
# get resource object
resource = adapter.useResource(resource_id)

#### Three objects have been created using the **PICSURE**  and **HPDS** libraries:
1. Connection object
2. Adapter object
3. Resource object

The connection object is useful for getting access to different databases stored in different resources. It is connected to the specific data source ID we specified and enables us to query and retrieve data from this database.

#### Getting help with the Python PIC-SURE API
Each object exposed by the PicSureHpdsLib library got a help() method. Calling it will print out a helper message about it.

In [12]:
# get resource documentation
resource.help()


        [HELP] PicSureHpdsLib.useResource(resource_uuid)
            .dictionary()       Used to access data dictionary of the resource
            .query()            Used to query against data in the resource
            .retrieveQueryResults(query_uuid) returns the results of an asynchronous query that has already been submitted to PICSURE

        [ENVIRONMENT]
              Endpoint URL: https://pl-gic.childrens.harvard.edu/picsure/
             Resource UUID: 04bad269-3b87-4cd4-ac62-9cedfb0096ea


## Using the variables dictionary


Once a connection to the desired resource has been established, we first need to understand which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A `dictionary` instance enables us to retrieve matching records by searching for a specific term, or to retrieve information about all the available variables, using the `find()` method. For instance, looking for variables containing the term `Calcium` in their names is done this way: 

In [13]:
# Initialize the dictionary
dictionary = resource.dictionary()

#search for all variables that contain calcium in their name
dictionary_search = dictionary.find("Calcium")

Note: Using the `dictionary.find()` function without arguments will return every entry, as shown in the help documentation.
We included the term "Calcium" as we are only interested in entries related to calcium.

Subsequently, objects created by the `dictionary.find` method expose the search results via 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`.

In [14]:
pprint({"Count": dictionary_search.count(), 
        "Keys": dictionary_search.keys()[0:5],
        "Entries": dictionary_search.entries()[0:5]})

{'Count': 41,
 'Entries': [{'HpdsDataType': 'phenotypes',
              'categorical': True,
              'categoryValues': ['T46.1X4D Poisoning by calcium-channel '
                                 'blockers, undetermined, subsequent '
                                 'encounter'],
              'name': '\\ACT Diagnosis ICD-10\\S00-T88 Injury, poisoning and '
                      'certain other consequences of external causes '
                      '(S00-T88)\\T36-T50 Poisoning by, adverse effects of and '
                      'underdosing of drugs, medicaments and biological '
                      'substances (T36-T50)\\T46 Poisoning by, adverse effect '
                      'of and underdosing of agents primarily affecting the '
                      'cardiovascular system\\T46.1 Poisoning by, adverse '
                      'effect of and underdosing of calcium-channel '
                      'blockers\\T46.1X Poisoning by, adverse effect of and '
                      'under

**The `.DataFrame()` method enables us to get the result of the dictionary search in a pandas DataFrame format. This way, it allows us to:** 


* Use the various information exposed in the dictionary (patient count, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variable names to be used in the query, as shown below.

Variable names aren't very pratical to use right away for two reasons:
1. Very long
2. Presence of backslashes that prevent from copy-pasting. 

However, retrieving the dictionary search result in the form of a dataframe can help access the variable names.

In [15]:
dictionary_search.DataFrame().head()

Unnamed: 0_level_0,categorical,categoryValues,patientCount,observationCount,HpdsDataType,min,max
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"\ACT Diagnosis ICD-10\S00-T88 Injury, poisoning and certain other consequences of external causes (S00-T88)\T36-T50 Poisoning by, adverse effects of and underdosing of drugs, medicaments and biological substances (T36-T50)\T46 Poisoning by, adverse effect of and underdosing of agents primarily affecting the cardiovascular system\T46.1 Poisoning by, adverse effect of and underdosing of calcium-channel blockers\T46.1X Poisoning by, adverse effect of and underdosing of calcium-channel blockers\T46.1X4 Poisoning by calcium-channel blockers, undetermined\",True,[T46.1X4D Poisoning by calcium-channel blocker...,1,2,phenotypes,,
\ACT Medications\C [Preparations]\Calcium Carbonate / Folic Acid / Magnesium Carbonate\,False,,5809,113018,phenotypes,0.25,2500.0
\ACT Medications\C [Preparations]\Calcium Ascorbate / Calcium Threonate / Ferrous Asparto Glycinate / Ferrous Fumarate / Folic Acid / Succinic Acid / Vitamin B 12\,False,,4,43,phenotypes,1.0,1.0
\ACT Medications\C [Preparations]\Calcium Ascorbate / Calcium Threonate / Ferrous Asparto Glycinate / Liver Stomach Concentrate / Succinic Acid / Vitamin B 12\,False,,4,43,phenotypes,1.0,1.0
\ACT Medications\C [Preparations]\Calcium Carbonate\,False,,6098,202833,phenotypes,0.625,7500.0


The dictionary currently returned by the API provides information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if strings, False if numerical
- min/max: only provided for numerical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

## Querying and retrieving data: Autism and *CPT1A*

We can retrieve data from the resource using the `query` object. 

In [16]:
# Initialize the query object 
new_query = resource.query()

The query object has several methods that enable to build a query.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

### Example Query
Next we will demonstrate how to build a query to return the number of patients with:
- F84.0 autism disorder

        AND 
- genomic variant: *CPT1A* 

#### Add phenotypic variable: Autistic disorder

In [17]:
# find value F84.0 Autistic Disorder by searching for ICD10 code F84
search_autism = dictionary.find("F84").DataFrame() 

#display the  autism dataframe 
search_autism

Unnamed: 0_level_0,categorical,categoryValues,patientCount,observationCount,HpdsDataType,description
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"\ACT Diagnosis ICD-10\F01-F99 Mental, Behavioral and Neurodevelopmental disorders (F01-F99)\F80-F89 Pervasive and specific developmental disorders (F80-F89)\F84 Pervasive developmental disorders\",True,"[F84.0 Autistic disorder, F84.2 Retts syndrome...",21128.0,261772.0,phenotypes,
Gene_with_variant,True,"[C3orf84, ZNF84-DT, ZNF84, ZNF846, ZNF845, ZNF...",,,info,"Description=""The official symbol for a gene af..."


In [18]:
# select the variable (key) of interest
autism_key = search_autism.index.values[0]

# check to see if the variable matches what you expect to see 
print(autism_key)

#define the value of interest
autism_value = 'F84.0 Autistic disorder'

\ACT Diagnosis ICD-10\F01-F99 Mental, Behavioral and Neurodevelopmental disorders (F01-F99)\F80-F89 Pervasive and specific developmental disorders (F80-F89)\F84 Pervasive developmental disorders\


In [19]:
# Initialize query object 
new_query = resource.query()

In [20]:
# add filter F84.0 Autistic disorder to your query 
new_query.filter().add(autism_key,autism_value)

# return number of patients who match the filtered criteria 
print('Number of patients who have been diagnosed with autistic disorder:', '\033[1m', new_query.getCount())

Number of patients who have been diagnosed with autistic disorder: [1m 19797


#### Add genomic filter: CPT1A variant

In [21]:
# variable = Gene_with_variant, values = all variants for CPT1 gene 
new_query.filter().add("Gene_with_variant", "CPT1A") 

# return number of patients who match all specified filters
print('Number of patients with autistic disorder and CPT1A variant:', '\033[1m', new_query.getCount())

Number of patients with autistic disorder and CPT1A variant: [1m 263
