# PIC-SURE API tutorial using UDN database
This is a tutorial notebook, aimed to be quickly up and running with the Python PIC-SURE API. It covers the main functionalities of the API.

## Python PIC-SURE API
### What is PIC-SURE?
Databases exposed through PIC-SURE API encompass a wide heterogeneity of architectures and data organizations underneath. PIC-SURE hides this complexity and expose the different databases in the same format, allowing researchers to focus on the analysis and medical insights, thus easing the process of reproducible sciences.

### More about PIC-SURE
PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. The API is available in two different programming languages, Python and R, allowing investigators to query databases in the same way using any of those languages.

PIC-SURE is a large project from which the R/Python PIC-SURE API is only a brick. Among other things, PIC-SURE also offers a graphical user interface, allowing research scientist to get quick knowledge about variables and data available for a specific data source.

The API is actively developed by the Avillach-Lab at Harvard Medical School.

GitHub repo:

* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

---

## Getting your own user-specific security token
**Before running this notebook, please be sure to review the get_your_token.ipynb notebook. It contains explanation about how to get a security token, mandatory to access the databases.**

### Environment set-up

#### Pre-requisites: 
* Python >= 3.7
* pip: Python package manager, already available in most system with a Python interpreter installed ([pip installation instructions])(https://pip.pypa.io/en/stable/installing/).

#### IPython magic command
Those two lines of code below do load the autoreload IPython extension. Although not necessary to execute the rest of the Notebook, it does enable to reload every dependency each time python code is executed, thus enabling to take into account changes in external file imported into this Notebook (e.g. user defined function stored in separate file), without having to manually reload libraries. Turns out very handy when developing interactively. More about [IPython Magic commands].(https://ipython.readthedocs.io/en/stable/interactive/magics.html)

In [None]:
%load_ext autoreload
%autoreload 2

#### Packages installation
Using the pip package manager, we install the packages listed in the `requirements.txt` file.

In [None]:
!cat requirements.txt

In [None]:
# set up environment
import sys
!{sys.executable} -m pip install -r requirements.txt
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git


#### Imports
Import all the external dependencies, as well as user-defined functions stored in the _python_lib_ folder

In [None]:
# Useful to estimate execution time of the Notebook
from datetime import datetime
then = datetime.now()

# pic-sure api lib
import PicSureHpdsLib
import PicSureClient

# python_lib for pic-sure
from python_lib.utils import get_multiIndex_variablesDict

# analysis
import pandas as pd
from pprint import pprint

#### Parameters and metadata

In [None]:
# print metadata
print("The PIC-SURE API libraries versions you've been downloading are: \n- PicSureClient: {0}\n- PicSureHpdsLib: {1}".format(PicSureClient.__version__, PicSureHpdsLib.__version__))

In [None]:
print("UDN database time stamp: {}".format(then))

## Connecting to a PIC-SURE network

### 1. Connect to the UDN data network using the HPDS adapter
Several information are needed to get access to data through the PIC-SURE API: a network URL, a resource id, and a user security token which is specific to a given URL + resource.

In [None]:
# Connection to the PIC-SURE API w/ key
# network information
PICSURE_network_URL = "https://udn.hms.harvard.edu/picsure"
resource_id = "c23b6814-7e5b-48d2-80d9-65511d7d2051"

In [None]:
# token is the individual user key given to connect to the UDN resource
token_file = "token.txt"
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
# get connection object
connection = PicSureClient.Client.connect(url = PICSURE_network_URL,
                                 token = my_token)

In [None]:
# get adapter object
adapter = PicSureHpdsLib.Adapter(connection)

In [None]:
# get resource object
resource = adapter.useResource(resource_id)

Three objects are created here: a connection, an adapter and a resource object, using respectively the `picsure` and `hpds` libraries.

As we will only be using one single resource, **the resource object is actually the only one we will need to proceed with data analysis hereafter** (FYI, the connection object is useful to get access to different databases stored in different resources).

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this source.

#### Getting help with the Python PIC-SURE API

Each object exposed by the PicSureHpdsLib library got a help() method. Calling it will print out a helper message about it.

In [None]:
# get resource documentation
resource.help()

For instance, this output tells us that this resource object got 2 methods, and it gives insights about their function.

### 2. Explore the data: data structures description

There are two methods to explore the data from which the user get two different data structures: a **dictionary object** to explore variables and a **query object** to explore the patient records in UDN. 

**Methods**:

    * Search variables: find() method
    * Retrieve data: query() methods

**Data structures**:

    * Dictionary object structure
    * Query object structure
    

#### Explore variables using the _dictionary_

Once a connection to the desired resource has been established, we first need to get a quick idea of which variables are available in the database. To this end, we will use the `dictionary` method of the `resource` object.

A dictionary object offers the possibility to retrieve information about either matching variables according to a specific term or all available variables, using the `find()` method. For instance, looking for variables containing the term 'aplasia' is done this way:

In [None]:
# create a dictionary object and search for a specific term, in this example for "aplasia"
dictionary = resource.dictionary()
lookup = dictionary.find("aplasia")

We have created the dictionary object with only variables matched by the search term. To retrieve the search result from dictionary objects we have 4 different methods: `count()`, `keys()`, `entries()`, and `DataFrame()`.

In [None]:
# description of the dictionary search content
pprint({"Count": lookup.count(), 
        "Keys": lookup.keys()[0:2],
        "Entries": lookup.entries()[0:2]})

**DataFrame()** enables to get the result of the dictionary search in a pandas dataframe format.

In [None]:
# show table of records from the dictionary object
lookup.DataFrame().tail(2)

We can retrieve information about **ALL** variables. We do it without specifying a term in the dictionary search method:

In [None]:
# we search the whole set of variables
plain_variablesDict = resource.dictionary().find().DataFrame()

In [None]:
# description of the whole dictionary of variables
print(plain_variablesDict.shape)
plain_variablesDict.head(2)

The UDN network resource contains 13414 variables described by 10 data fields:
* HpdsDataType
* description
* categorical
* categoryValues
* values
* continuous
* min
* max
* observationCount
* patientCount

The dictionary provide various information about the variables, such as:

* observationCount: number of entries with non-null value
* categorical: type of the variables, True if categorical, False if continuous/numerical
* min/max: only provided for non-categorical variables
* HpdsDataType: 'phenotypes' or 'genotypes'. Currently, the API only expsoses'phenotypes' variables

Hence, it enables to:

* Use the various variables information as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variables names, to be used in the query, as shown below.
 
Variable names (`KEY` or row **indexes** in the dataframe), as currently implemented in the API, aren't straightforward to use because:

1. Very long
2. Presence of backslashes that requires modification right after copy-pasting.

However, using the dictionary to select variables can help to deal with this. 

##### Parsing variable names
We can use an utils function, `get_multiIndex_variablesDict()`, defined in python_lib/utils.py, to add a little more information and ease working with variables names. It takes advantage of pandas MultiIndex functionality see [pandas official documentation on this topic](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Although not an official feature of the API, such functionality illustrates how to quickly scan and select groups of related variables.

Printing the "MultiIndex" Dictionary allows to quickly see the tree-like organisation of the variables. Moreover, original and simplified variable names are now stored respectively in the "varName" and "simplified_varName" columns (simplified variable names is simply the last component of the variable name, which usually makes the most sense to know what each variable is about).

In [None]:
# Display the variables tree hierarchy from the variables name
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)
variablesDict.loc[["04_Clinical symptoms and physical findings (in HPO, from PhenoTips)"], :]

Below is a simple example to illustrate the ease of use of a MultiIndex dictionary. Let's say we are interested in filtering variables related to "aplasias" in the "nervous system".

In [None]:
mask_system = variablesDict.index.get_level_values(2) == "Abnormality of the nervous system"
mask_abnormality = variablesDict.varName.str.contains('Aplasia')
filtered_variables = variablesDict.loc[mask_system & mask_abnormality,]
print(filtered_variables.shape)
filtered_variables.head(2)

Although pretty simple, it can be easily combined with other filters to quickly select necessary variables.

#### Explore patient records using _query_

Beside from the dictionary, the second cornerstone of the API are the query methods (`select()`, `require()`, `anyof()`, `filter()`). They are the entering point to **query and retrieve data from the resource**.

First, we need to create a query object.

In [None]:
# create a query object for the resource
my_query = resource.query()

The query object created will be then passed to the different query methods to build the query: <font color='orange'>select().add(), require().add(), anyof().add(), and filter().add()</font>.

* The **select().add()** method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, without any record (ie subjects/rows) subsetting.
* The **require().add()** method accept variable names as string or list of strings as argument, and will allow the query to return all the variables passed, and only records that do not contain any null values for those variables.
* The **anyof().add()** method accept variable names as string or list of strings as argument, and will allow the query to return all variables included in the list, and only records that do contain at least one non-null value for those variables.
* The **filter().add()** method accept variable names a variable name as strings as argument, plus additional values to filter on that given variable. The query will return this variable and only the records that do match this filter.

All those 4 methods can be combined when building a query. The record eventually returned by the query have to meet all the different specified filters.

##### Building the query
Let's say we want to check some demographics about the data in UDN. We will filter to variables that have observation counts > 50% patient counts.

In [None]:
# select demographic variable names
demographicsDict = resource.dictionary().find("demographics")
mask_obs = demographicsDict.DataFrame().observationCount > demographicsDict.DataFrame().patientCount * .50
selected_varnames = demographicsDict.DataFrame()[mask_obs].index.to_list()
print(len(selected_varnames))
selected_varnames

In [None]:
# build and query for demographics patient data
my_query.select().add(selected_varnames)

##### Retrieving the data
Once our query object is finally built, we use the `getResultsDataFrame()` method to retrieve the data corresponding to our query.

In [None]:
# retrieve the query result as a dataframe
demographics_data = my_query.getResultsDataFrame(low_memory=False).set_index("Patient ID")

In [None]:
print(demographics_data.shape)

In [None]:
demographics_data.head()

We have retieved patient records in UDN that meet the criteria posed in the query. 

**NOTE**: The <font color='orange'>Patient ID</font> is the `KEY` or row `INDEX` of the dataframe derived.

From this point, we can proceed with the data management and analysis using any other Python functions or libraries.

##### Visualize the demographics

In [None]:
# rename column names
demographics_data = demographics_data.rename(columns={"\\00_Demographics\Age at UDN Evaluation (in years)\\": "age_udn",
                                  "\\00_Demographics\Age at symptom onset in years\\": "age_symptom",
                                  "\\00_Demographics\Current age in years\\": "age_current",
                                  "\\00_Demographics\Ethnicity\\": "ethnicity",
                                  "\\00_Demographics\Gender\\": "gender",
                                  "\\00_Demographics\Race\\": "race"
                                 })

In [None]:
# visualize 
demographics_data.groupby(['race']).size().plot.pie( figsize=(10, 5), title="Race distribution in UDN")