# Introduction to the PIC-SURE API
This is a tutorial notebook aimed to get the user quickly up and running with the PIC-SURE API. 

## PIC-SURE python API
### What is PIC-SURE?
As part of the BioData Catalyst Initiative, the Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed-related studies funded by the National Heart, Lung, and Blood Institute (NHLBI). 

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and esposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science. 

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the database the same way using either language.

PIC-SURE is a larger project from which the R and python PIC-SURE APIs are only a small part. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter participants that match criteria, and create cohorts from this interactive exploration.

The python API is actively developed by the Avillach Lab at Harvard Medical School. 

PIC-SURE API GitHub repositories:
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

 ------- 

## Getting your user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the NHLBI_BioData_Catalyst [README.md file](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token). It explains about how to get a security token, which is mandatory to access the databases.**

## Environment set-up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter
* BioData Catalyst PIC-SURE Adapter

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search

import PicSureBdcAdapter

## Connecting to a PIC-SURE resource

The following is required to get access to the PIC-SURE API:
* a network URL
* a user-specific security token

The following code specifies the network URL as the BioData Catalyst Powered by PIC-SURE URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the [README.md](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API/tree/master/NHLBI_BioData_Catalyst#get-your-security-token) file.

In [None]:
# Uncomment production URL when testing in production
# PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## Getting help with the PIC-SURE API
Each of the objects in the PIC-SURE API have a `help()` method that you can use to get more information about its functionalities.

In [None]:
bdc.help()

For example, the above output lists and briefly defines the four methods that can be used with the `bdc` resource. 

## Using the PIC-SURE variables dictionary
Now that you have set up your connection to the PIC-SURE API, let's determine which study or studies you are authorized to access. The `dictionary` method can be used to search the data dictionary for a specific term or to retrieve information about all the variables you are authorized to access.

In [None]:
dictionary = bdc.useDictionary().dictionary() # Set up the dictionary
all_variables = dictionary.find() # Retrieve all variables you have access to

In [None]:
## Code used to identify a bug

#concepts = all_variables.listPaths()
#for i in concepts:
#    phs = i.split("\\")[1]
#    if 'phs' not in phs:
#        print(i)
## Note: all concept paths start with phs
## There aren't _consents or anything like that
#print(all_variables.count())
#print(len(concepts))

In [None]:
list_all_variables = all_variables.listPaths()
studies = set([i.split("\\")[1] for i in list_all_variables])
if len(studies) > 0:
    print("You are authorized to access the following studies:\n", studies)
else:
    print("You are not authorized to access any studies.")

### ***Note: if you do not see any studies listed above, you are not authorized to access any data. The rest of the notebook will not work.*** 
Let's save the first study to use as an example for the rest of the notebook.

In [None]:
phs_number = list(studies)[-1]
print("The phs accession being used is", phs_number)

Now let's find all of the variables associated with that study. We can search for these using the `find()` method and searching the phs accession number. 
Additionally, we can view a list of the variables returned from this search using the `listPaths()` method.

In [None]:
my_variables = dictionary.find(phs_number) # Search for the phs accession number
print("There are", my_variables.count(), "variables that were returned for your search.")
print("Here are some of the variables you have access to:\n", my_variables.listPaths()[0:10])

# Will need to update this accordingly once BioLINCC studies are included

The above output lists all of the variables you are authorized to access in the PIC-SURE. These variables are listed as concept paths organized in the "phs", "pht", and "phv" format that is used by the [database of Genotypes and Phenotypes (dbGaP)](https://www.ncbi.nlm.nih.gov/gap/). The concept path is organized in the following format:

**\\\phs\\\pht\\\phv\\\variable name\\\**

* **phs** corresponds to the study accession number 
* **pht** corresponds to the trait table accession number
* **phv** corresponds to the variable accession number
* the **variable name** corresponds to the encoded variable name that was used by the submitters of the data

Generally, a given study will have several tables, and those tables have several variables. You can find more information on the dbGaP data structure [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2031016/).

We can view these variables in a detailed dataframe format using the `dataframe()` method.

In [None]:
my_variables_df = my_variables.dataframe() # Save the results as a dataframe
my_variables_df.head() # View the first 5 rows 

In [None]:
my_variables_df.columns

We can also view additional information for individual variables using the `varInfo()` method. 

In [None]:
first_var = my_variables.listPaths()[0]
my_variables.varInfo(first_var)

Now you can try to search for a term on your own. Below is sample code on how to search for the term `sex`. To practice searching the data dictionary, you can change "sex" to a term you are interested in. If there are results for your term, you will see them displayed in the convenient dataframe format using the `displayResults()` method.

In [None]:
my_search = dictionary.find("sex") # Change sex to be your term of interest
if len(my_search.listPaths()) > 0:
    my_search.displayResults()
else:
    print("Search term returned no results. Please try searching a different term.")

## Using PIC-SURE to build a query and retrieve data
You can also use the PIC-SURE API to build a query and retrieve data. With this functionality, you can filter based on specific variables, add others, and export the data as a dataframe into this notebook. 

The first step to this is setting up the `authQuery`.

In [None]:
authPicSure = bdc.useAuthPicSure()
authQuery = authPicSure.query()

There are several methods that can be used to build a query, which are listed below.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

### Build a query with a categorical variable
Let's practice building a query by filtering on variables. First, let's select a categorical variable to use. We can identify one using the `is_categorical` column of the variable dataframe.

In [None]:
i = 0
categories = []
while len(categories) == 0:
    categorical_var_info = my_variables_df[my_variables_df.is_categorical == True].iloc[i]
    categorical_var = categorical_var_info.HPDS_PATH
    categories = list(categorical_var_info.values[0].values())
    i += 1
    
filter_category = categories[0]
print("We will use the PIC-SURE variable", categorical_var, "which is", categorical_var_info.description)
print("\nHere are the categories associated with the variable:", categories)
print("\nWe will filter to participants with the value '", filter_category, "' for the variable.")

We now have the PIC-SURE variable and the value to apply to the filter saved. We can use the `filter()` method to add this information to our query. 

In [None]:
authQuery.filter().add(categorical_var, filter_category)

Note that though we are only filtering by one value, you can filter by multiple values by passing a list into `filter()`.

Now we can export our filtered data to a pandas dataframe in this notebook using `getResultsDataFrame()`.

In [None]:
results = authQuery.getResultsDataFrame(low_memory = False)
results.head()

In the data dictionary dataframe shown previously, each row represented a single concept path or variable. In the query dataframe, the concept paths are added as columns with each row representing a participant with data that matches your query. 

The dataframe above should contain some automatically exported concept paths, such as `Patient ID`, `Parent Study Accession with Subject ID`, `Topmed Study Accession with Subject ID`, and `consents`, and the concept path we added to our query (`categorical_var`). Additionally, all participants should have the value we used to filter for our added concept path.

We can see how this query filtering worked by comparing the resulting dataframe to the full unfiltered data for this variable. Let's build a query that retrieves the data from all participants that have data for the categorical variable of interest using `require()`.

In [None]:
authQuery = authPicSure.query() # Initialize a new query
authQuery.require().add(categorical_var) # Use require() and the categorical_var
full_results = authQuery.getResultsDataFrame(low_memory = False) # Get results dataframe

In [None]:
# Visualize the results with pie charts
full_stats = full_results[categorical_var].value_counts()
plt.pie(full_stats, labels = full_stats.index)
plt.title("Before filtering variable\n"+categorical_var)
plt.show()
stats = results[categorical_var].value_counts()
plt.pie(stats, labels = stats.index)
plt.title("After filtering variable\n"+categorical_var)
plt.show()

### Build a query with a continuous variable
Similarly, we can create a query using a continuous variable. Instead of using the `is_continuous` column, let's make use of the `logical_max` and `logical_min` columns to determine the range of values that can be selected for our query.

# Data dictionary that Tom is creating will have a reliable min and a reliable max that does indeed match the data

# Code below is currently broken, will need to update to exclude logical_max and logical_min

In [None]:
continuous_var_info = my_variables_df[~pd.isna(my_variables_df.logical_max)].iloc[0] # Filter to the first continuous variable
continuous_var = continuous_var_info.HPDS_PATH
print("We will use the PIC-SURE variable", continuous_var, "which is", continuous_var_info.description)
value_range = [float(continuous_var_info.logical_min), float(continuous_var_info.logical_max)]
print("\nHere is the range of values that is associated with the variable:", value_range)
max_value = value_range[1]
min_value = value_range[1] - (value_range[1] - value_range[0])/2
print("\nWe will filter to participants with the values between", min_value, "and", max_value, "for the variable.")

# Show a varInfo on this variable to provide more information

Again, we can use the `filter()` method to add the continuous variable to the query. Once added, we can retrieve our results dataframe.

In [None]:
authQuery = authPicSure.query() # Start a new query
authQuery.filter().add(continuous_var, min_value, max_value)

In [None]:
results = authQuery.getResultsDataFrame(low_memory = False)
results.head()

The above dataframe should contain the variable we selected to add to the query. Additionally, the values listed as part of that variable should be within the range of values we defined.  

# Link back to data dictionary
# Filtered range vs. require (no filter) histograms --> is there a way to overlay these histograms?

### Build a query with multiple variables
You can also add multiple variables to a single query. Let's build a query with the first five variables for the study of interest.

In [None]:
query_vars = my_variables.listPaths()[0:5]
print("We will add the following variables to the query:", query_vars)

We can use a different method, `anyof()`, to add variables to the query. This will filter to participants that have data for at least one of the variables added.  

In [None]:
authQuery = authPicSure.query() # Start a new query
authQuery.anyof().add(query_vars)

In [None]:
results = authQuery.getResultsDataFrame(low_memory = False)
results.head()

### Selecting consent groups

PIC-SURE will limit results based on which study and consent groups you have been individually authorized to access. In some cases, such as instances where you can access multiple studies and/or consent groups, you may need to limit your results further to only a subset of the groups you have been authorized to access.

Let's see which studies and consent groups you are authorized to access using the `show()` method of the query.

# Need to add this section
the way to do this is delete the _consents and add back with new consent group

In [None]:
authQuery = authPicSure.query()
authQuery.show()

The `\\_consents\\` section of the output shown above lists all of the phs accession numbers and consent codes that you are authorized to access. 

To query on specific consent groups in this list, you must first clear the list of values within the `\\_consents\\` section and then manually replace them. Let's practice this by copying and pasting a phs accession number and consent code, deleting the `\\_consents\\` field, and adding it back with our selected consent code.

*Note that trying to manually add a consent group which you are not authorized to access will results in errors downstream.*

In [None]:
consent_group_filter = "phs000964.c1"#"<<<Paste consent group here>>>"
authQuery.filter().delete("\\_consents\\")
authQuery.filter().add("\\_consents\\", consent_group_filter)

Now your query is set to select only variables and participants from the phs accession and consent code you selected. From here, you can build out your query as shown above.

### Retrieving data from a query built in the PIC-SURE user interface (UI)

You are able to retrieve the results of a query that you have previously built using the [PIC-SURE UI](https://picsure.biodatacatalyst.nhlbi.nih.gov/psamaui/). After you have built your query and filtered to your cohort of interest, click the "Select data for export" button in the **What is this box called?!?!**. 

# Need to finish this section when the UI is complete