# PIC-SURE API use-case: quick analysis on COPDGene data

This is a tutorial notebook aimed to get the user quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

<!--img src="./img/PIC-SURE_logo.png" width= "360px"> -->

As part of the BioData Catalyst initiative, the Patient Information Commons Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed related studies funded by the National Heart Lung and Blood Institute (NHLBI). 

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science.

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the databases the same way using either language.

PIC-SURE is a larger project from which the R/python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter patients that match criteria, and create cohorts from this interactive exploration.

The python API is actively developed by the Avillach Lab at Harvard Medical School.

PIC-SURE API GitHub repo:
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client



 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

# Environment set-up

### Pre-requisites
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### Install Packages

Install the following:
- packages listed in the `requirements.txt` file (listed below, along with version numbers)
- PIC-SURE API components (from Github)
    - PIC-SURE Adapter 
    - PIC-SURE Client

In [None]:
!cat requirements.txt

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Import all the external dependencies, as well as user-defined functions stored in the `python_lib` folder

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureClient
import PicSureBdcAdapter

from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

##### Set the display parameter for tables and plots

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib display parameters
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)

## Connecting to a PIC-SURE resource

The following is required to get access to data through the PIC-SURE API: 
- a network URL
- a resource id, and 
- a user-specific security token.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

Two objects are created here: a `connection` and a `resource` object.

Since will only be using a single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**. 

It is connected to the specific data source ID we specified and enables us to query and retrieve data from this database.

## Getting help with the PIC-SURE API

Each object exposed by the PicSureBdcHpds library has a `help()` method. Calling it without parameters will print out  information about functionalities of this object.

In [None]:
resource.help()

For instance, this output tells us that this `resource` object has 3 methods, and it gives a quick definition of those methods. 

## Using the *variables dictionary*

Once a connection to the desired resource has been established, we first need to understand which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A `dictionary` instance enables us to retrieve matching records by searching for a specific term, or to retrieve information about all the available variables, using the `find()` method. For instance, looking for variables containing the term `COPD` in their names is done this way: 

In [None]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find("COPD")

Subsequently, objects created by the `dictionary.find` method expose the search results via 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`. 

In [None]:
pprint({"Count": dictionary_search.count(), 
        "Keys": dictionary_search.keys()[0:5],
        "Entries": dictionary_search.entries()[0:5]})

In [None]:
dictionary_search.DataFrame().head()

**The `.DataFrame()` method enables us to get the result of the dictionary search in a pandas DataFrame format. This way, it allows us to:** 


* Use the various information exposed in the dictionary (patient count, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variable names to be used in the query, as shown below.

Variable names aren't very pratical to use right away for two reasons:
1. Very long
2. Presence of backslashes that prevent from copy-pasting. 

However, retrieving the dictionary search result in the form of a dataframe can help access the variable names. Let's say we want to retrieve every variable from the COPDGene study:

In [None]:
plain_variablesDict = resource.dictionary().find("COPDGene").DataFrame()
plain_variablesDict.shape

In [None]:
plain_variablesDict.iloc[10:20,:]

Using the `dictionary.find()` function without arguments will return every entry, as shown in the help documentation.
We included the term "COPDGene" as we are only interested in entries related to COPDGene.

In [None]:
resource.dictionary().help()

The dictionary currently returned by the API provides information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if strings, False if numerical
- min/max: only provided for numerical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

### Export full data dictionary to CSV

In order to export the data dictionary first we will create a Pandas dataframe called fullVariableDict

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()

Check that the fullVariableDict dataframe contains some values.

In [None]:
fullVariableDict.iloc[0:3,:]

In [None]:
fullVariableDict.to_csv('data_dictionary.csv')

You should now see ```data_dictionary.csv``` in the JupyterHub file explorer, in the same folder as this notebook. 

### Variable dictionary + pandas multiIndex

We can use a simple user-defined function (`get_multiIndex_variablesDict`) to add a little more information to the variable dictionary and to simplify working with variables names. It takes advantage of pandas MultiIndex functionality [see pandas official documentation on this topic](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Although not an official feature of the API, such functionality illustrates how to quickly select groups of related variables.

Printing the multiIndexed variable Dictionary allows to quickly see the tree-like organization of the variable names. Moreover, original and simplified variable names are now stored respectively in the `varName` and `simplified_varName` columns (simplified variable names is simply the last component of the variable name, which is usually the most informative to know what each variable is about).

In [None]:
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)

In [None]:
variablesDict

In [None]:
# Now that we have seen how our entire dictionary looked, we limit the number of lines to be displayed for the future outputs
pd.set_option("max.rows", 50)

Below is a simple example to illustrate the simplicity of using a multiIndex dictionary. Let's say we are interested in every variable pertaining to the terms "asthma" and "smoking"

In [None]:
mask_asthma = [type(i) == str and "asthma" in i for i in variablesDict.index.get_level_values(2)]
mask_smoking = [type(i) == str and "smoking" in i for i in variablesDict.index.get_level_values(2)]

asthma_and_smoking_vars = variablesDict.loc[mask_asthma or mask_smoking,:]

asthma_and_smoking_vars

Although pretty simple, it can be easily combined with other filters to quickly select one or more desired groups of variables.

## Querying and retrieving data

The second cornerstone of the API is the `query` object. It is how we retrieve data from the resource.

First, we need to create a query object.

In [None]:
my_query = resource.query()

The query object has several methods that enable to build a query.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

### Building the query
In the following example, we are going to build a query to return data associated with patients in the COPDgene study who completely stopped smoking between the ages of 20 to 70 years. For these entries, we will pull the age that they stopped smoking along with any other categorical variables which have between 100 to 2000 entries.

First, we create a mask ```yo_stop_smoking_varname``` to isolate the variable pertaining to the following text:

    How old were you when you completely stopped smoking? [Years old]

In [None]:
mask = variablesDict["simplified_name"] == "How old were you when you completely stopped smoking? [Years old]"
yo_stop_smoking_varname = variablesDict.loc[mask, "name"] 

Next we create masks to further restrict the query.

```mask_cat``` isolates categorical variables.

```mask_count``` isolates variables with an observationCount value greater than 4000

```varnames``` pulls out the name of variables which satisfy the criteria for both ```mask_cat``` and ```mask_count```.

In [None]:
mask_cat = variablesDict["categorical"] == True
mask_count = variablesDict["observationCount"] > 4000
varnames = variablesDict.loc[mask_cat & mask_count, "name"]

By using the query.filter().add method on ```yo_stop_smoking_varname```, we are able to filter our results to only the variable associated with "How old were you when you completely stopped smoking? [Years old]". 

Additionally, we are able to filter the records by providing min and max arguments to this function. This means that our results will only contain entries that have values between 20 and 70 reported for the variable "How old were you when you completely stopped smoking? [Years old]".

We further build our query with the my_query.select().add method. Here, we add the last 50 variables from varnames.

In [None]:
my_query.filter().add(yo_stop_smoking_varname, min=20, max=70)
my_query.select().add(varnames[:50])

 ## Selecting consent groups

PIC-SURE will limit results based on which study / patient consent groups the researcher has individually been authorized for.

However, sometimes, you might need to limit your results further to only contain a subset of the groups you have been authorized for.

Use resource.list_consents() to view all consent groups you are authorized for, as well as whether they are part of the HarmonizedVariable dataset or the TopMed Freeze.

In [None]:
resource.list_consents()

If you would like to focus on specific groups within this list, you must clear the values within it and then manually replace them.

In this example, we will focus on the c2 consent group within the COPDGene study, which is reflected by code phs000179.c2.

*Note that trying to manually add a consent group which you are not authorized to access will result in errors downstream.*

In [None]:
# If you get the following error: "ERROR: the specified key does not exist", you can ignore it.
my_query.filter().delete("\\_consents\\")

In [None]:
my_query.filter().add("\\_consents\\", ['phs000179.c2'])

## Retrieving the data

Once our query object is finally built, we use the `getResultsDataFrame` function to retrieve the data corresponding to our query.

In [None]:
query_result = my_query.getResultsDataFrame(low_memory=False)

In [None]:
query_result.shape

In [None]:
query_result.tail()

From this point, we can proceed with any data analysis using other python libraries.

In [None]:
query_result[yo_stop_smoking_varname].plot.hist(legend=None, title= "Age stopped smoking", bins=15)

## Retrieving data from query run through PIC-SURE UI

It is possible for you to retrieve the results of a query that you have previously run using the PIC-SURE UI. To do this you must "select data for export", then select the information that you want the query to return and then click "prepare data export". Once the query is finished executing, a group of buttons will be presented.  Click the "copy query ID to clipboard" button to copy your unique query identifier so you can paste it into your notebook.


Paste your query's ID into your notebook and assign it to a variable.  You then use the `query.getResults(resource, yourQueryUUID)` function with an initialized resource object to retrieve the data from your query as shown below.

Note that query IDs do not last forever and will expire.

The screenshot below shows the button of interest in the PIC-SURE UI. It shows that the previously run query has a Query ID of `bf3ddba5-de5f-460b-bcbc-ff56410d3075`. At this point a copy-paste process is used to provide the Query ID to the API, as shown in the example code below.  To run this code you must replace the example query ID with a query ID from a query that you have run in the PIC-SURE API.

<img src="https://drive.google.com/uc?id=1e38XT07bJ-JiO8oqbM5SydvVEozYavOm">

In [None]:
# To run this using your notebook you must replace it with the ID value of a query that you have run.
DataSetID = '<<replace with your QuerySetID>>'

In [None]:
%%capture
results = resource.retrieveQueryResults(DataSetID)

In [None]:
from io import StringIO
df_UI = pd.read_csv(StringIO(results), low_memory=False)

In [None]:
df_UI.head()