# PIC-SURE API use-case: quick analysis on Hematopoietic Cell Transplant for Sickle Cell Disease (HCT for SCD) data

For a more basic introduction to the python PIC-SURE API, see the `1_PICSURE_API_101.ipynb` notebook.
 
**Before running this notebook, please be sure to get a user-specific security token. For more information about how to proceed, see the \"Get your security token\" instructions in the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token).**

 -------   

## Environment set-up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages

**Note that if you are using the dedicated PIC-SURE environment within the BioData Catalyst Seven Bridges platform, the necessary packages have already been installed.**

In [None]:
import sys

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search


In [None]:
import PicSureClient
import PicSureBdcAdapter
import pandas as pd
import matplotlib.pyplot as plt


##### Set the display parameters for tables and plots

In [None]:
# Pandas DataFrame display options
pd.set_option("display.max_rows", 100)

# Matplotlib parameters options
fig_size = plt.rcParams["figure.figsize"]
 
# Prints: [8.0, 6.0]
fig_size[0] = 14
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size

font = {'weight' : 'bold',
        'size'   : 14}

plt.rc('font', **font)

## Connecting to the PIC-SURE network

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

# Analysis on Hematopoietic Cell Transplant for Sickle Cell Disease (HCT for SCD) data

Let's search the data dictionary in PIC-SURE to find all the variables in the `HCT for SCD` study, which is associated to PHS study number `phs002385`.



In [None]:
scd_dictionary = bdc.useDictionary().dictionary().find('phs002385')
scd_dataframe = scd_dictionary.dataframe()
print(scd_dataframe.shape)
scd_dataframe.head()


## Building a query: investigating male patients with avascular necrosis who received their transplant after 1999


If you are interested in learning more about the available query methods, see the `1_PICSURE_API_101.ipynb` notebook. 

Let's say we are interested in the age at which patients from the following cohort received their transplant:

- males
- patients with avascular necrosis
- patients who received their transplant after 1999

We will use regular expressions to search the variable descriptions within the HCT for SCD study to find these variables.

In [None]:
scd_dataframe[scd_dataframe.columnmeta_description.str.contains("Sex|Avascular necrosis|Year of transplant$")]

sex_var = scd_dataframe[scd_dataframe.columnmeta_description.str.contains("Sex")][['HPDS_PATH']].iloc[0,0]
necrosis_var = scd_dataframe[scd_dataframe.columnmeta_description.str.contains("Avascular necrosis")][['HPDS_PATH']].iloc[0,0]
transplant_var = scd_dataframe[scd_dataframe.columnmeta_description.str.contains("Year of transplant$")][['HPDS_PATH']].iloc[0,0]


Now we can create a new query using the `HPDS_PATH` associated with our variables of interest and apply our filters to retrieve the cohort of interest.



In [None]:
# Initialize a query
authPicSure = bdc.useAuthPicSure()
myquery = authPicSure.query()

# Filter to Males
myquery.filter().add(sex_var, 'Male')

# Filter to patients with Avascular Necrosis
myquery.filter().add(necrosis_var, 'Yes')

# Filter to patients with year of transplant after 1999
myquery.filter().add(transplant_var, min = 1999)

We are also interested in the patients' age at which they received their transplant. However, we do not want to filter our cohort based on these values. We can use the query.select method to add this variable to the query without filtering.

Using this cohort, we can add the variable of interest: "Patient age at transplant, years"



In [None]:
age_at_transplant = scd_dataframe[scd_dataframe.columnmeta_description.str.contains("age at transplant, years$")][['HPDS_PATH']].iloc[0,0]
myquery.select().add(age_at_transplant)

## Retrieving the data

Once our query object is finally built, we use the `query.run()` function to retrieve the data corresponding to our query

In [None]:
results = myquery.getResultsDataFrame()

In [None]:
results.head()

Once the data has been retrieved as a dataframe, you can use python functions to conduct analyses and create visualizations, such as this:

In [None]:
results[age_at_transplant].plot.hist(legend=None, 
                                      title= "Age when transplant received in males with avascular necrosis from 2000 to present", 
                                      bins=15)