# PIC-SURE API use-case: quick analysis on Hematopoietic Cell Transplant for Sickle Cell Disease (HCT for SCD) data

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token).**

 -------   

# Environment set-up

## Environment set-up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages

In [None]:
#needed?
import json
from pprint import pprint
from scipy import stats

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
#needed

In [None]:

import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search

import PicSureBdcAdapter

##### Set the display parameters for tables and plots

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib parameters options
fig_size = plt.rcParams["figure.figsize"]
 
# Prints: [8.0, 6.0]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

font = {'weight' : 'bold',
        'size'   : 14}

plt.rc('font', **font)

## Connecting to the PIC-SURE network

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

# Analysis on Hematopoietic Cell Transplant for Sickle Cell Disease (HCT for SCD) data

<font color='darkgreen'>**Goal: ?**</font> 

## Find all variables in the HCT for SCD data dictionary
We can find all the variables associated with the study of interest, HCT for SCD, but searching the data dictionary for the appropirate PHS number or study ID (phs002385). 

We can find the study IDs in the Data Access Dashboard in the user interface.

In [None]:
scd_dictionary = bdc.useDictionary().dictionary().find('phs002385')
scd_dataframe = scd_dictionary.dataframe()
print(scd_dataframe.shape)
scd_dataframe.head()


## Querying and retrieving data

If you are interested in learning more about the available query methods, see the `1_PICSURE_API_101.ipynb` notebook. 

Let's say we are interested in the age at which patients from the following cohorts received their transplants:
* males
* patients with avascular necrosis
* patients that received their transplant after the year 1999

First we will find variables pertaining to sex and avascular necrosis. We can do this by searching for "Sex" and "Avascular necrosis" in the `simplified_name` column of `variablesDict`.

In [None]:
scd_dataframe[scd_dataframe.columnmeta_description.str.contains("Sex")]

In [None]:
sex_var = variablesDict.loc[variablesDict["simplified_name"] == "Sex", "name"].values[0]
avascular_necrosis_varname = variablesDict.loc[variablesDict["simplified_name"] == "Avascular necrosis", "name"].values[0]

In [None]:
# Peek at the result for avascular necrosis
variablesDict.loc[variablesDict["simplified_name"] == "Avascular necrosis", "name"] 

Next, we can find the variable pertaining to "Year of transplant".

In [None]:
yr_transplant_varname = variablesDict.loc[variablesDict["simplified_name"] == "Year of transplant", "name"].values[0]
yr_transplant_varname

Now we can create a new query and apply our filters to retrieve the cohort of interest.

In [None]:
# Patients with avascular necrosis
my_query.select().add(avascular_necrosis_varname)
my_query.filter().add(avascular_necrosis_varname, "Yes")

In [None]:
# Males
my_query.select().add(sex_var)
my_query.filter().add(sex_var, "Male")

In [None]:
# Patients receiving transplants after 1999
my_query.select().add(yr_transplant_varname)
my_query.filter().add(yr_transplant_varname, min=2000)

Using this cohort, we can add the variable of interest: "Patient age at transplant, years"

In [None]:
age_transplant_var = variablesDict.loc[variablesDict["simplified_name"] == "Patient age at transplant, years", "name"].values[0]
my_query.select().add(age_transplant_var)

## Retrieving the data

Once our query object is finally built, we use the `query.run()` function to retrieve the data corresponding to our query

In [None]:
query_df = my_query.getResultsDataFrame().set_index("Patient ID")

In [None]:
query_df.head()

Once the data has been retrieved as a dataframe, you can use python functions to conduct analyses and create visualizations, such as this:

In [None]:
query_df[age_transplant_var].plot.hist(legend=None, 
                                       title= "Age when transplant received in males with avascular necrosis from 2000 to present", 
                                       bins=15)