# Introduction to the PIC-SURE API
This is a tutorial notebook aimed to get the user quickly up and running with the PIC-SURE API. 

## PIC-SURE python API
### What is PIC-SURE?
As part of the BioData Catalyst Initiative, the Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed-related studies funded by the National Heart, Lung, and Blood Institute (NHLBI). 

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and esposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science. 

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the database the same way using either language.

PIC-SURE is a larger project from which the R and python PIC-SURE APIs are only a small part. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter participants that match criteria, and create cohorts from this interactive exploration.

The python API is actively developed by the Avillach Lab at Harvard Medical School. 

PIC-SURE API GitHub repositories:
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

 ------- 

## Getting your user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

## Environment set-up

### Pre-requisites
**Check with Jason about this**
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter

In [None]:
import pandas as pd
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search

import PicSureBdcAdapter

## Connecting to a PIC-SURE resource

The following is required to get access to the PIC-SURE API:
* a network URL
* a user-specific security token

The following code specifies the network URL as the BioData Catalyst Powered by PIC-SURE URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

### Can we have this not printed out? Seems more confusing than helpful to the user

## Getting help with the PIC-SURE API
Each of the objects in the PIC-SURE API have a `help()` method that you can use to get more information about its functionalities.

In [None]:
bdc.help()

For example, the above output lists and briefly defines the four methods that can be used with the `bdc` resource. 

## Using the PIC-SURE variables dictionary
Now that you have set up your connection to the PIC-SURE API, let's determine which study or studies you are authorized to access. The `dictionary` method can be used to search the data dictionary for a specific term or to retrieve information about all the variables you are authorized to access.

In [None]:
dictionary = bdc.useDictionary().dictionary() # Set up the dictionary
all_variables = dictionary.find() # Retrieve all variables you have access to

In [None]:
## Code used to identify a bug

#concepts = all_variables.listPaths()
#for i in concepts:
#    phs = i.split("\\")[1]
#    if 'phs' not in phs:
#        print(i)
## Note: all concept paths start with phs
## There aren't _consents or anything like that
#print(all_variables.count())
#print(len(concepts))

In [None]:
all_variables.listPaths()[0:10] # Show the first ten variables

***Note: if you do not see any variables, you are not authorized to access any data. The rest of the notebook will not work.*** 

The above output lists all of the variables you are authorized to access in the PIC-SURE. These variables are listed as concept paths organized in the "phs", "pht", and "phv" format that is used by the [database of Genotypes and Phenotypes (dbGaP)](https://www.ncbi.nlm.nih.gov/gap/). The concept path is organized in the following format:

**\\\phs\\\pht\\\phv\\\variable name\\\**

* **phs** corresponds to the study accession number 
* **pht** corresponds to the trait table accession number
* **phv** corresponds to the variable accession number
* the **variable name** corresponds to the encoded variable name that was used by the submitters of the data

Generally, a given study will have several tables, and those tables have several variables. You can find more information on the dbGaP data structure [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2031016/).

Let's save the phs accession number from a study that you are authorized to access.

# Is this format for the concept paths in BioLINCC studies?

In [None]:
phs_number = all_variables.listPaths()[0].split('\\')[1]
print("The phs accession being used is", phs_number)

Now let's find all of the variables associated with that study. We can search for these using the `find()` method and searching the phs accession number. Additionally, we can view these variables in a convenient dataframe format using the `dataframe()` method.

In [None]:
my_variables = dictionary.find(phs_number) # Search for the phs accession number
print("There are", my_variables.count(), "variables that were returned for your search.")
my_variables_df = my_variables.dataframe() # Save the results as a dataframe
my_variables_df.head() # View the first 5 rows 

# Is going through each column and describing the data overkill? --> Yes.

The `dataframe()` method displays additional information about the variables in PIC-SURE. Here is a brief explanation of each column:
* **values:** data dictionary information, specifically how to decode the raw data to human-readable data. For example, if the values for the variable "Have you had asthma?" was `{'0': 'Yes', '1': 'No'}`, the 0s in the raw data were decoded to "Yes" and 1s were decoded to "No".
* **metadata_tags:** tags that are associated with the variable's metadata. These tags can be used to search for or filter to sepcific variables.
* **value_tags:** this column ???
* **studyId:** the phs accession number associated with the variable.
* **dtId:** the pht accession number associated with the variable.
* **varId:** the phv accession number of the variable.
* **is_categorical:** whether the variable is categorical ("True") or not ("False").
* **is_continuous:** whether the variable is continuous ("True") or not ("False").
* **reported_type:** this column ???
* **description:** the description associated with the variable name.
* **HPDS_PATH:** the PIC-SURE variable name
* **study_id:** the phs accession number associated with the variable. *How is this different than studyId?*
* **var_report_description:** the description associated with the variable name. *How is this different than description?*
* **dataTableName:** the name associated with the data table or pht accession number.
* **variable:** *This column seems empty?*
* **name:** the encoded name of the variable.
* **dataTableId:** the pht accession number associated with the variable. *How is this different than dtId?*
* **dataTableDescription:** the description associated with the data table or pht accession number.
* **id:** the full phv accession number of the variable. *Is this the right way to explain this?*
* **calculated_type:** the PIC-SURE determined type of the data associated with the variable.
* **var_name:** the encoded name of the variable. *How is this different than the name?*
* **units:** units associated with the variable if the variable is continuous or numeric.
* **unit:** units associated with the variable if the variable is continuous or numeric. *How is this different from units?*
* **var_report_comment:** additional comment from the variable report. *Is this correct?*
* **logical_max:** *What is this?*
* **coll_interval:** *what is this?*
* **logical_min:** *what is this?*
* **type:** the PIC-SURE determined type of the data associated with the variable. *How is this different from calculated_type?*

We can also view this information for individual variables using the `varInfo()` method. 

In [None]:
first_var = my_variables.listPaths()[0]
my_variables.varInfo(first_var)

# Add a section about searching for a term of interest - do it yourself :D

## Using PIC-SURE to build a query and retrieve data
You can also use the PIC-SURE API to build a query and retrieve data. With this functionality, you can filter based on specific variables, add others, and export the data as a dataframe into this notebook. 

The first step to this is setting up the `authQuery`.

In [None]:
authPicSure = bdc.useAuthPicSure()
authQuery = authPicSure.query()

# Add in table of functions from old notebook

### Build a query with a categorical variable
Let's practice building a query by filtering on variables. First, let's select a categorical variable to use. We can identify one using the `is_categorical` column of the variable dataframe.

In [None]:
categorical_var_info = my_variables_df[my_variables_df.is_categorical == True].iloc[0] # Filter to the first categorical variable
categorical_var = categorical_var_info.HPDS_PATH
print("We will use the PIC-SURE variable", categorical_var, "which is", categorical_var_info.description)
categories = list(categorical_var_info.values[0].values())
print("\nHere are the categories associated with the variable:", categories)
filter_category = categories[0]
print("\nWe will filter to participants with the value", filter_category, "for the variable.")

We now have the PIC-SURE variable and the value to apply to the filter saved. We can use the `filter()` method to add this information to our query. 

In [None]:
authQuery.filter().add(categorical_var, filter_category)

Note that though we are only filtering by one value, you can filter by multiple values by passing a list into `filter()`.

Now we can export our filtered data to a pandas dataframe in this notebook using `getResultsDataFrame()`.

In [None]:
results = authQuery.getResultsDataFrame(low_memory = False)
results.head()

# Mention that the columns added to this dataframe are from the rows in the data dictionary. Link the two.

The above dataframe should contain the variable we selected to add to the query. Additionally, all participants should have the value we filtered by for that variable.

# Include a visualization showing that the query worked and the filter was applied

### Build a query with a continuous variable
Similarly, we can create a query using a continuous variable. Instead of using the `is_continuous` column, let's make use of the `logical_max` and `logical_min` columns to determine the range of values that can be selected for our query.

In [None]:
continuous_var_info = my_variables_df[~pd.isna(my_variables_df.logical_max)].iloc[0] # Filter to the first continuous variable
continuous_var = continuous_var_info.HPDS_PATH
print("We will use the PIC-SURE variable", continuous_var, "which is", continuous_var_info.description)
value_range = [float(continuous_var_info.logical_min), float(continuous_var_info.logical_max)]
print("\nHere is the range of values that is associated with the variable:", value_range)
max_value = value_range[1]
min_value = value_range[1] - (value_range[1] - value_range[0])/2
print("\nWe will filter to participants with the values between", min_value, "and", max_value, "for the variable.")

Again, we can use the `filter()` method to add the continuous variable to the query. Once added, we can retrieve our results dataframe.

In [None]:
authQuery = authPicSure.query() # Start a new query
authQuery.filter().add(continuous_var, min_value, max_value)

In [None]:
results = authQuery.getResultsDataFrame(low_memory = False)
results.head()

The above dataframe should contain the variable we selected to add to the query. Additionally, the values listed as part of that variable should be within the range of values we defined.  

# Link back to data dictionary

### Build a query with multiple variables
You can also add multiple variables to a single query. Let's build a query with the first five variables for the study of interest.

In [None]:
query_vars = my_variables.listPaths()[0:5]
print("We will add the following variables to the query:", query_vars)

We can use a different method, `anyof()`, to add variables to the query. This will filter to participants that have data for at least one of the variables added.  

In [None]:
authQuery = authPicSure.query() # Start a new query
authQuery.anyof().add(query_vars)

In [None]:
results = authQuery.getResultsDataFrame(low_memory = False)
results.head()

### Changing study consents in the query
# Need to add this section

# Add the query ID section