# PIC-SURE API tutorial using Cure SC database

This is a tutorial notebook, aimed for the user to be quickly up and running with the python PIC-SURE API. It covers the main functionalities of the API.

## PIC-SURE python API 
### What is PIC-SURE? 

PIC-SURE stands for Patient-centered Information Commons: Standardized Unification of Research Elements. Original data exposed through PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and exposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on the downstream analyses and to facilitate reproducible sciences.

### More about PIC-SURE
The API is available in two different programming languages, python and R, allowing investigators to query datasets in the same way using either of language. The R/python PIC-SURE API is a small part of the entire PIC-SURE platform.

The API is actively developed by the Avillach Lab at Harvard Medical School.

GitHub repo:
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client


 -------   

# Getting your own user-specific security token

**Before running this notebook, please be sure you have [added your security token](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/Cure_Sickle_Cell#get-your-security-token). This documentation contains an explanation about how to get a security token, which is required to access the databases.**

# Environment set-up

### Pre-requisite
- python 3.6 or later
- pip: python package manager, already available in most system with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

### IPython Magic command

The following code loads the `autoreload` IPython extension. Although `autoreload` is not necessary to execute the rest of the notebook, it enables the notebook to reload every dependency each time python code is executed.  This will allow the notebook to take into account changes in imported external files, such as the user defined functions stored in separate file, without needing to manually reload libraries. This is helpful when developing interactively. Learn more about [IPython Magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [1]:
%load_ext autoreload
%autoreload 2

### Install python packages

Using the pip package manager, we install the packages listed in the `requirements.txt` file.

In [2]:
!cat requirements.txt # List contents of the requirements.txt file

numpy>=1.17.3
matplotlib>=3.1.1
pandas>=0.25.3
scipy>=1.3.1
tqdm>=4.38.0
statsmodels>=0.10.2
git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
git+https://github.com/hms-dbmi/pic-sure-python-client.git 


In [3]:
import sys
!{sys.executable} -m pip install -r requirements.txt

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git (from -r requirements.txt (line 7))
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-jmdgx5aa
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git /tmp/pip-req-build-jmdgx5aa
Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git (from -r requirements.txt (line 8))
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-u6k6jo22
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-u6k6jo22
Collecting tqdm>=4.38.0
  Downloading tqdm-4.60.0-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 4.9 MB/s  eta 0:00:01
Collecting httplib2
  Downloading httplib2-0.19.1-py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 5.0 MB/s  eta 0:00:01
Building wheels for collected packages: PicSureClient,

Import external dependencies and user-defined functions stored in the `python_lib` folder.

In [4]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesDict, get_dic_renaming_vars, joining_variablesDict_onCol

In [5]:
print("NB: This Jupyter Notebook has been written using PIC-SURE API following versions:\n- PicSureClient: 0.1.0\n- PicSureHpdsLib: 1.1.0\n")
print("The PIC-SURE API libraries versions you've been downloading are: \n- PicSureClient: {0}\n- PicSureHpdsLib: {1}".format(PicSureClient.__version__, PicSureHpdsLib.__version__))

NB: This Jupyter Notebook has been written using PIC-SURE API following versions:
- PicSureClient: 0.1.0
- PicSureHpdsLib: 1.1.0

The PIC-SURE API libraries versions you've been downloading are: 
- PicSureClient: 1.1.0
- PicSureHpdsLib: 1.1.0


##### Set up the options for displaying tables and plots in this notebook

In [6]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib parameters options
fig_size = plt.rcParams["figure.figsize"]
 
# Prints: [8.0, 6.0]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

font = {'weight' : 'bold',
        'size'   : 14}

plt.rc('font', **font)

### Connecting to a PIC-SURE network

You will need the following information before connecting to the PIC-SURE network:
* resource ID: ID of the resource that you are trying to access. You can leave the default value for this project.
* user-specific token text file: A text file called `token.txt` should contain the token retrieved from your user profile in PIC-SURE UI. This file needs to be located at the python root folder.

In [7]:
resource_id = "57e29a43-38c3-4c4b-84c9-dda8138badbe"
token_file = "token.txt"
PICSURE_network_URL = "https://curesc.hms.harvard.edu/picsure"

In [8]:
with open(token_file, "r") as f:
    my_token = f.read()

In [9]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------
|  Resource UUID                       |  Resource Name                                  
+--------------------------------------+------------------------------------------------------
| 57e29a43-38c3-4c4b-84c9-dda8138badbe
+--------------------------------------+------------------------------------------------------


In [10]:
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

Two objects were created: a `connection` and a `resource` object, using the `picsure` and `hpds` libraries, respectively. 

Since will only be using a single resource, **the `resource` object is the only one we will need to proceed with this data analysis.** It should be noted that the `connection` object is useful to access different databases stored in different resources. 

The `resource` object is connected to the specific resource ID and enables us to query and retrieve data from this source.

### Getting help with the PIC-SURE python API

The `help()` method prints out the helper message for any PIC-SURE library function. For example, we can learn more about getting a resource using the following code:

In [11]:
resource.help()


        [HELP] PicSureHpdsLib.useResource(resource_uuid)
            .dictionary()       Used to access data dictionary of the resource
            .query()            Used to query against data in the resource
            .retrieveQueryResults(query_uuid) returns the results of an asynchronous query that has already been submitted to PICSURE

        [ENVIRONMENT]
              Endpoint URL: https://curesc.hms.harvard.edu/picsure/
             Resource UUID: 57e29a43-38c3-4c4b-84c9-dda8138badbe


This output tells us about the methods and functions of the `resource` object. 

## Using the *variables dictionary*

Once a connection to the desired resource has been established, we first need to get an understanding of which variables are available in the database. We will use the `dictionary` method of the `resource` object to do this.

A `dictionary` instance retrieves records that match a specific term. The `find()` method can be used to retrieve information about the available variables. For instance, looking for variables containing the term 'Sex' is done this way: 

In [12]:
dictionary = resource.dictionary()
dictionary_search = dictionary.find("Sex")
dictionary_search.DataFrame().head()

Unnamed: 0_level_0,categorical,categoryValues,observationCount,patientCount,HpdsDataType
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
\CIBMTR - Cure Sickle Cell Disease\1 - Patient Related\Sex\,True,"[Female, Male]",1518,1518,phenotypes


Objects created by the `dictionary.find` method can expose the search results using 4 different methods: `.count()`, `.keys()`, `.entries()`, and `.DataFrame()`. 

In [13]:
pprint({"Count": dictionary_search.count(), 
        "Keys": dictionary_search.keys()[0:3],
        "Entries": dictionary_search.entries()[0:3]})

{'Count': 1,
 'Entries': [{'HpdsDataType': 'phenotypes',
              'categorical': True,
              'categoryValues': ['Female', 'Male'],
              'name': '\\CIBMTR - Cure Sickle Cell Disease\\1 - Patient '
                      'Related\\Sex\\',
              'observationCount': 1518,
              'patientCount': 1518}],
 'Keys': ['\\CIBMTR - Cure Sickle Cell Disease\\1 - Patient Related\\Sex\\']}


**The `.DataFrame()` method enables us to get the result of the dictionary search in a pandas DataFrame format. This way, it allows us to:** 


* Use the various information exposed in the dictionary (patient count, variable type ...) as criteria for variable selection.
* Use the row names of the DataFrame to get the actual variable names to be used in the query, as shown below.

Variable names aren't very pratical to use right away for two reasons:
1. Very long
2. Presence of backslashes that prevent copy-pasting. 

However, retrieving the dictionary search result in the form of a dataframe can help access the variable names.

Let's say we want to retrieve every variable for in the form of a DataFrame. We can do this using the code below:

In [15]:
plain_variablesDict = resource.dictionary().find().DataFrame()

Using the `dictionary.find()` function without arguments returns every entry, as shown in the help documentation.

In [16]:
resource.dictionary().help()


        [HELP] PicSureHpdsLib.Client(connection).useResource(uuid).dictionary()
            .find()                 Lists all data dictionary entries
            .find(search_string)    Lists matching data dictionary entries
        


In [17]:
plain_variablesDict.iloc[10:20,:]

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Hemoglobin pre-conditioning unit\,,True,732,732,,phenotypes,"[Not reported, g/dL]"
\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Acute renal failure requiring dialysis\,,True,732,732,,phenotypes,"[No, Not reported, Yes]"
\CIBMTR - Cure Sickle Cell Disease\1 - Patient Related\HCT-comorbidity index\,,True,1518,1518,,phenotypes,"[0-2, 3+, Not collected before 2008]"
"\CIBMTR - Cure Sickle Cell Disease\4 - Outcomes\Death without acute graft versus host disease, grades II-IV\",,True,1518,1518,,phenotypes,"[No, Not Reported, Yes]"
\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Stroke pre-conditioning\,,True,732,732,,phenotypes,"[No, Not reported, Yes]"
"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to intubation, months\",0.131579,False,34,34,200.032895,phenotypes,
\CIBMTR - Cure Sickle Cell Disease\3 - Transplant Related\Year of transplant ( Grouped )\,,True,1518,1518,,phenotypes,"[2008-2012, 2013-2017, 2018-2019, < 2008]"
\CIBMTR - Cure Sickle Cell Disease\3 - Transplant Related\Donor-recipient HLA matching\,,True,1518,1518,,phenotypes,"[7/8, 8/8, <=6/8]"
\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Therapy given for growth hormone deficiency\,,True,732,732,,phenotypes,"[N/A, No, Not reported, Yes]"
"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to VOD, months\",0.197368,False,23,23,8.289474,phenotypes,


### Export Full Data Dictionary to CSV

In order to export the data dictionary first we will create a Pandas dataframe called `fullVariableDict`.

In [18]:
fullVariableDict = resource.dictionary().find().DataFrame()

Let's make sure that `fullVariableDict` dataframe contains some values.

In [19]:
fullVariableDict.iloc[0:3,:]

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Time from HCT to diabetes, months\",0.131579,False,48,48,46.315789,phenotypes,
\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Serum creatinine pre-conditioning unit\,,True,732,732,,phenotypes,"[Not reported, mg/dL]"
\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Hypertension (HTN) requiring therapy\,,True,732,732,,phenotypes,"[No, Not reported, Yes]"


In [20]:
fullVariableDict.to_csv('data_dictionary.csv')

You should now see a data_dictionary.csv in the Jupyter Hub file explorer.

#### Variable dictionary + pandas multiIndex

We can use a simple user-defined function (`get_multiIndex_variablesDict`) to add a little more information to the variable dictionary and to simplify working with variables names. It takes advantage of pandas MultiIndex functionality [see pandas official documentation on this topic](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html).

Although not an official feature of the API, such functionality illustrates how to quickly select groups of related variables.

Printing the multiIndexed variable Dictionary allows us to quickly see the tree-like organization of the variable names. Moreover, original and simplified variable names are now stored respectively in the `name` and `simplified_name` columns (simplified variable names is simply the last component of the variable name, which is usually the most informative to let us know what each variable is about).

In [21]:
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)

In [22]:
variablesDict.loc[["CIBMTR - Cure Sickle Cell Disease"],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,simplified_name,name,observationCount,categorical,categoryValues,nb_modalities,min,max,HpdsDataType
level_0,level_1,level_2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
CIBMTR - Cure Sickle Cell Disease,1 - Patient Related,Cases from 2016 Blood publication,Cases from 2016 Blood publication,\CIBMTR - Cure Sickle Cell Disease\1 - Patient...,1517,True,"[No, Yes]",2.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,1 - Patient Related,Cases from 2019 Lancet Heamatology publication,Cases from 2019 Lancet Heamatology publication,\CIBMTR - Cure Sickle Cell Disease\1 - Patient...,1518,True,"[No, Yes]",2.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,1 - Patient Related,Country of HCT institution,Country of HCT institution,\CIBMTR - Cure Sickle Cell Disease\1 - Patient...,1518,True,[USA],1.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,1 - Patient Related,Ethnicity,Ethnicity,\CIBMTR - Cure Sickle Cell Disease\1 - Patient...,1518,True,"[Hispanic or Latino, Non-Hispanic or non-Latin...",4.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,1 - Patient Related,HCT-comorbidity index,HCT-comorbidity index,\CIBMTR - Cure Sickle Cell Disease\1 - Patient...,1518,True,"[0-2, 3+, Not collected before 2008]",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,...,...,...,...,...,...,...,...,...,...,...
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Type of arrythmia,Type of arrythmia,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[Atrial fibrillation or flutter, N/A, Sick sin...",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Type of therapy for iron overload,Type of therapy for iron overload,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[Iron chelation only, N/A, Phlebotomy and iron...",4.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,VOD post-HCT,VOD post-HCT,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Yes]",2.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Vaso-occlusive cirsis requiring hospitalization within 2 years pre-HCT,Vaso-occlusive cirsis requiring hospitalizatio...,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Not reported, Yes]",3.0,,,phenotypes


In [23]:
# Limit the number of lines to be displayed for the future outputs
pd.set_option("max.rows", 50)

Below is a simple example to illustrate the ease of use a multiIndex dictionary. Let's say we are interested in the variable called "5 - CRF data collection track only".

In [24]:
mask_study = variablesDict.index.get_level_values(0) == "CIBMTR - Cure Sickle Cell Disease"
mask_dctrack = variablesDict.index.get_level_values(1) == "5 - CRF data collection track only"
dctrack_variables = variablesDict.loc[mask_study & mask_dctrack,:]
dctrack_variables

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,simplified_name,name,observationCount,categorical,categoryValues,nb_modalities,min,max,HpdsDataType
level_0,level_1,level_2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Acute chest syndrome (ACS) pre-conditioning,Acute chest syndrome (ACS) pre-conditioning,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Not reported, Yes]",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Acute chest syndrome post HCT,Acute chest syndrome post HCT,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,1518,True,"[No, Not Reported, Yes]",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Acute renal failure requiring dialysis,Acute renal failure requiring dialysis,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Not reported, Yes]",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Anxiety requiring therapy,Anxiety requiring therapy,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Not reported, Yes]",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Arrhythmia,Arrhythmia,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Not reported, Yes]",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,...,...,...,...,...,...,...,...,...,...
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Type of arrythmia,Type of arrythmia,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[Atrial fibrillation or flutter, N/A, Sick sin...",3.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Type of therapy for iron overload,Type of therapy for iron overload,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[Iron chelation only, N/A, Phlebotomy and iron...",4.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,VOD post-HCT,VOD post-HCT,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Yes]",2.0,,,phenotypes
CIBMTR - Cure Sickle Cell Disease,5 - CRF data collection track only,Vaso-occlusive cirsis requiring hospitalization within 2 years pre-HCT,Vaso-occlusive cirsis requiring hospitalizatio...,\CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...,732,True,"[No, Not reported, Yes]",3.0,,,phenotypes


This simple filter can be easily combined with other filters to quickly select variables of interest.

In [36]:
variablesDict[variablesDict.categorical == False]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,simplified_name,name,observationCount,categorical,categoryValues,nb_modalities,min,max,HpdsDataType
level_0,level_1,level_2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
CIBMTR - Cure Sickle Cell Disease,1 - Patient Related,"Patient age at transplant, years","Patient age at transplant, years",\CIBMTR - Cure Sickle Cell Disease\1 - Patient...,1518,False,,,0.26,58.0,phenotypes
CIBMTR - Cure Sickle Cell Disease,3 - Transplant Related,Transplant number,Transplant number,\CIBMTR - Cure Sickle Cell Disease\3 - Transpl...,1518,False,,,1.0,1.0,phenotypes
CIBMTR - Cure Sickle Cell Disease,3 - Transplant Related,Year of transplant,Year of transplant,\CIBMTR - Cure Sickle Cell Disease\3 - Transpl...,1518,False,,,1991.0,2019.0,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to acute graft-vs-host disease, months","Time from HCT to acute graft-vs-host disease, ...",\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1363,False,,,0.263158,286.019737,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to chronic graft-vs-host disease, months",Time from HCT to chronic graft-vs-host disease...,\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1488,False,,,0.296053,290.131579,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to date of last contact or death, months",Time from HCT to date of last contact or death...,\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1518,False,,,0.263158,290.131579,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to graft failure, months","Time from HCT to graft failure, months",\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1490,False,,,0.032895,290.131579,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to neutrophil engraftment, months","Time from HCT to neutrophil engraftment, months",\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1484,False,,,0.032895,11.743421,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to platelet recovery, months","Time from HCT to platelet recovery, months",\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1414,False,,,0.032895,174.572368,phenotypes
CIBMTR - Cure Sickle Cell Disease,4 - Outcomes,"Time from HCT to second malignancy, months","Time from HCT to second malignancy, months",\CIBMTR - Cure Sickle Cell Disease\4 - Outcome...,1509,False,,,0.263158,290.131579,phenotypes


## Querying and retrieving data

The second cornerstone of the API is the `query` object, which is how we retrieve data from the resource.

The query object has several methods that enable us to build a query:

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

All 4 methods can be combined when building a query. The record eventually returned by the query has to meet all the different specified filters.

### Building the query

Let's say we are interested in the age at which patients from the following cohort received their transplant:
* males
* patients with avascular necrosis
* patients that received their transplant after the year 1999

First we will find variables pertaining to sex and avascular necrosis. We can do this by searching for "Sex" and "Avascular necrosis" in the `simplified_name` column of `variablesDict`.

In [48]:
sex_var = variablesDict.loc[variablesDict["simplified_name"] == "Sex", "name"].values[0]

avascular_necrosis_varname = variablesDict.loc[variablesDict["simplified_name"] == "Avascular necrosis", "name"].values[0]


In [50]:
# Peek at the result for avascular necrosis
variablesDict.loc[variablesDict["simplified_name"] == "Avascular necrosis", "name"] 

level_0                            level_1                             level_2           
CIBMTR - Cure Sickle Cell Disease  5 - CRF data collection track only  Avascular necrosis    \CIBMTR - Cure Sickle Cell Disease\5 - CRF dat...
Name: name, dtype: object

Next, we can find the variable pertaining to "Year of transplant".

In [51]:
yr_transplant_varname = variablesDict.loc[variablesDict["simplified_name"] == "Year of transplant", "name"].values[0]

Now we can create a new query and apply our filters to retrieve the cohort of interest.

In [58]:
my_query = resource.query()
# Patients with avascular necrosis
my_query.select().add(avascular_necrosis_varname)
my_query.filter().add(avascular_necrosis_varname, "Yes")

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7f8cf67256d8>

In [59]:
# Males
my_query.select().add(sex_var)
my_query.filter().add(sex_var, "Male")

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7f8cf67256d8>

In [60]:
# Patients receiving transplants after 1999
my_query.select().add(yr_transplant_varname)
my_query.filter().add(yr_transplant_varname, min=2000)

<PicSureHpdsLib.PicSureHpdsAttrListKeyValues.AttrListKeyValues at 0x7f8cf67256d8>

Now that we have narrowed down to males with avascular necrosis, we can add the variable of interest: "Patient age at transplant, years"

In [61]:
age_transplant_var = variablesDict.loc[variablesDict["simplified_name"] == "Patient age at transplant, years", "name"].values[0]
my_query.select().add(age_transplant_var)

<PicSureHpdsLib.PicSureHpdsAttrListKeys.AttrListKeys at 0x7f8cf6725588>

## Retrieving the data

Once our query object is finally built, we use the `query.run()` function to retrieve the data corresponding to our query

In [62]:
query_df = my_query.getResultsDataFrame().set_index("Patient ID")

In [63]:
query_df

Unnamed: 0_level_0,"\CIBMTR - Cure Sickle Cell Disease\1 - Patient Related\Patient age at transplant, years\",\CIBMTR - Cure Sickle Cell Disease\1 - Patient Related\Sex\,\CIBMTR - Cure Sickle Cell Disease\3 - Transplant Related\Year of transplant\,\CIBMTR - Cure Sickle Cell Disease\5 - CRF data collection track only\Avascular necrosis\
Patient ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
42,15.0,Male,2007.0,Yes
295,18.0,Male,2008.0,Yes
336,13.0,Male,2009.0,Yes
612,16.0,Male,2012.0,Yes
725,20.0,Male,2014.0,Yes
752,15.0,Male,2014.0,Yes
766,13.0,Male,2014.0,Yes
772,5.0,Male,2014.0,Yes
895,11.0,Male,2015.0,Yes
966,47.0,Male,2016.0,Yes
