# PIC-SURE API tutorial using CureSC database

This is a tutorial notebook, aimed for the user to be quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. Original data exposed through PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on the downstream analyses and to facilitate reproducible sciences.

### More about PIC-SURE
The API is available in two different programming languages, python and R, allowing investigators to query datasets in the same way using either of language. The R/python PIC-SURE API is a small part of the entire PIC-SURE platform.

The API is actively developed by the Avillach Lab at Harvard Medical School.

GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client


 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure you have [added your security token](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/Cure_Sickle_Cell#get-your-security-token). This documentation contains an explanation about how to get a security token, which is required to access the databases.**

# Environment set-up

### Pre-requisite
- python 3.6 or later (although earlier versions of python 3 must work too)
- pip: python package manager, already available in most system with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### IPython magic command

The following code loads the `autoreload` IPython extension. Although `autoreload` is not necessary to execute the rest of the notebook, it does enable the notebook to reload every dependency each time python code is executed.  This will enable the notebook to take into account changes in external files imported, such as the user defined functions stored in separate file, without needing to manually reload libraries. This is helpful when developing interactively. Learn more about [IPython Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [None]:
%load_ext autoreload
%autoreload 2

### Installation of required python packages

Using the pip package manager, we install the packages listed in the `requirements.txt` file.

In [None]:
!cat requirements.txt # List contents of the requirements.txt file

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Import all the external dependencies, as well as user-defined functions stored in the `python_lib` folder

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesDict, get_dic_renaming_vars, joining_variablesDict_onCol

In [None]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureClient: 0.1.0\n- PicSureHpdsLib: 1.1.0\n")
print("The PIC-SURE API libraries versions you've been downloading are: \n- PicSureClient: {0}\n- PicSureHpdsLib: {1}".format(PicSureClient.__version__, PicSureHpdsLib.__version__))

##### Set up the options for displaying tables and plots in this notebook

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib parameters options
fig_size = plt.rcParams["figure.figsize"]
 
# Prints: [8.0, 6.0]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

font = {'weight' : 'bold',
        'size'   : 14}

plt.rc('font', **font)

### Connecting to a PIC-SURE network

You will need the following information before connecting to the PIC-SURE network:
* resource_id: ID of the resource that you are trying to access. You can leave the default value for this project.
* token_file: A text file called token.txt should contain the token retrieved from you user profile in PIC-SURE UI. This file needs to be located at the R root folder.

In [None]:
resource_id = "57e29a43-38c3-4c4b-84c9-dda8138badbe"
token_file = "token.txt"
PICSURE_network_URL = "https://curesc.hms.harvard.edu/picsure"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)

In [None]:
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

Two objects are created here: a `connection` and a `resource` object, using the `picsure` and `hpds` libraries, respectively. 

Since will only be using one single resource, **the `resource` object is the only one we will need to proceed with data analysis hereafter.** It should be noted that the `connection` object is useful to get access to different databases stored in different resources. 

The `resource` object is connected to the specific data source ID we specified and enables us to query and retrieve data from this source.

### Getting help with the PIC-SURE python API

The `help()` method prints out helper message for any PIC-SURE library function. For example, we can learn more about getting a resource using the following code:

In [None]:
resource.help()

This output tells us that this `resource` object has 2 methods, and it gives insights about their function. 

## Using the *variables dictionary*

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. We will use the `dictionary` method of the `resource` object to do this.

A `dictionary` instance offers the possibility to retrieve matching records according to a specific term. The `find()` method can be used to retrieve information about all available variables. For instance, looking for variables containing the term 'Sex' is done this way: 

In [None]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find("Sex")
dictionary_search.DataFrame().head()

Objects created by the `dictionary.find` exposes the search result using 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`. 

In [None]:
pprint({"Count": dictionary_search.count(), 
        "Keys": dictionary_search.keys()[0:3],
        "Entries": dictionary_search.entries()[0:3]})

**`.DataFrame()` enables to get the result of the dictionary search in a pandas DataFrame format** 

The dictionary provides various information about the variables, such as:
- observationCount: number of entries with non-null value
- categorical: type of the variables, True if categorical, False if continuous/numerical
- min/max: only provided for non-categorical variables
- HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses 'phenotypes' variables

Hence, it enables to:
* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.

Variable names as currently implemented in the API aren't straightforward to use for a few reasons:
1. Very long
2. Presence of backslashes that requires modification right after copy-pasting. 

However, using the dictionary to select variables can help to deal with this. Let's say we want to retrieve every variable from the different substudies available in the resource, such as Cure Sickle Cell related studies. One way to proceed would be to retrieve the whole dictionary for those variables in the form of a DataFrame, as below:

In [None]:
plain_variablesDict = resource.dictionary().find().DataFrame()

Using the `dictionary.find()` function without arguments return every entries, as shown in the help documentation.

In [None]:
resource.dictionary().help()

In [None]:
plain_variablesDict.iloc[10:20,:]

### Export Full Data Dictionary to CSV

In order to export the data dictionary first we will create a Pandas dataframe called `fullVariableDict`

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()

Check that `fullVariableDict` dataframe contains some values.

In [None]:
fullVariableDict.iloc[0:3,:]

In [None]:
fullVariableDict.to_csv('data_dictionary.csv')

You should now see a data_dictionary.csv in the Jupyter Hub file explorer.

#### Variable dictionary + pandas multiIndex

Though helpful, we can use a simple user-defined function (`get_multiIndex_variablesDict`) to add a little more information and ease working with long variables names. It takes advantage of pandas MultiIndex functionality [see pandas official documentation on this topic](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing the 'multiIndexed' variable Dictionary allows to quickly see the tree-like organization of the variables. Moreover, original and simplified variable names are now stored in the "name" and "simplified_name" columns, respectively.

In [None]:
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)

In [None]:
variablesDict.loc[["CIBMTR - Cure Sickle Cell Disease"],:]

In [None]:
# Limit the number of lines to be displayed for the future outputs
pd.set_option("max.rows", 50)

Below is a simple example to illustrate the ease of use a multiIndex dictionary. Let's say we are interested in every variable pertaining to "5 - CRF data collection track only".

In [None]:
mask_study = variablesDict.index.get_level_values(0) == "CIBMTR - Cure Sickle Cell Disease"
mask_transplant = variablesDict.index.get_level_values(1) == "5 - CRF data collection track only"
transplant_variables = variablesDict.loc[mask_study & mask_transplant,:]
transplant_variables

Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

## Querying and retrieving data

The second cornerstone of the API is the `query` object. It is how we retrieve data from the resource.

In [None]:
my_query = resource.query()

The query object has several methods that enables us to build a query:

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All 4 methods can be combined when building a query. The record eventually returned by the query has to meet all the different specified filters.

### Building the query

Let's say we want to select a cohort consisting of males with avascular necrosis.

In [None]:
# Selecting all variables from "CIBMTR" study
mask_study = variablesDict.index.get_level_values(0) == "CIBMTR - Cure Sickle Cell Disease"
varnames = variablesDict.loc[mask_study, "name"]

Lets create variables that will look for males and avascular necrosis.

In [None]:
sex_var = variablesDict.loc[variablesDict["simplified_name"] == "Sex", "name"].values[0]

avascular_necrosis_varname = variablesDict.loc[variablesDict["simplified_name"] == "Avascular necrosis", "name"].values[0]


In [None]:
variablesDict.loc[variablesDict["simplified_name"] == "Avascular necrosis", "name"]

Now filter by the variables by wanted values.

In [None]:
my_query = resource.query()
my_query.select().add(avascular_necrosis_varname)
my_query.filter().add(avascular_necrosis_varname, "Yes")

In [None]:
my_query.select().add(sex_var)
my_query.filter().add(sex_var, "Male")

## Retrieving the data

Once our query object is finally built, we use the `query.run` function to retrieve the data corresponding to our query

In [None]:
query_df = my_query.getResultsDataFrame().set_index("Patient ID")

In [None]:
query_df