# Introduction to the PIC-SURE API
This is a tutorial notebook aimed to get the user quickly up and running with the PIC-SURE API. 

## PIC-SURE python API
### What is PIC-SURE?
As part of the *NHLBI BioData Catalyst® (BDC)* ecosystem, the Patient Information Commons: Standard Unification of Research Elements (PIC-SURE) platform has been integrating clinical and genomic datasets from multiple TOPMed and TOPMed-related studies funded by the National Heart, Lung, and Blood Institute (NHLBI). 

Original data exposed through the PIC-SURE API encompasses a large heterogeneity of data organization underneath. PIC-SURE hides this complexity and esposes the different study datasets in a single tabular format. By simplifying the process of data extraction, it allows investigators to focus on downstream analysis and to facilitate reproducible science. 

### More about PIC-SURE
The API is available in two different programming languages, python and R, enabling investigators to query the database the same way using either language.

PIC-SURE is a larger project from which the R and python PIC-SURE APIs are only a small part. Among other things, PIC-SURE also offers a graphical user interface that allows researchers to explore variables across multiple studies, filter participants that match criteria, and create cohorts from this interactive exploration.

The python API is actively developed by the Avillach Lab at Harvard Medical School. 

PIC-SURE API GitHub repositories:
* https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-adapter-hpds
* https://github.com/hms-dbmi/pic-sure-python-client

 ------- 

## Getting your user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the [`README.md` file](../README.md). It explains how to get a security token, which is mandatory to use the PIC-SURE API.**

To set up your token file, be sure to run the [`Workspace_setup.ipynb` file](./Workspace_setup.ipynb).

## Environment set-up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter
* *BDC-PIC-SURE* Adapter

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to a PIC-SURE resource

The following is required to get access to the PIC-SURE API:
* a network URL
* a user-specific security token

The following code specifies the network URL as the *BDC Powered by PIC-SURE* URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the `README.md` file and the `Workspace_setup.ipynb` file.

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

## Getting help with the PIC-SURE API
Each of the objects in the PIC-SURE API have a `help()` method that you can use to get more information about its functionalities.

In [None]:
bdc.help()

For example, the above output lists and briefly defines the four methods that can be used with the `bdc` resource. 

## Using the PIC-SURE variables dictionary
For the rest of this example notebook, we will use one of the publicly available datasets available on PIC-SURE. This dataset is the "Framingham Heart Study: Dataset for Teaching Purposes", which has a study accession of `tutorial-biolincc_framingham` in PIC-SURE. To find the variables related to this dataset, we will use the `dictionary()` method to search for a term of interest - in this case, `tutorial`.

In [None]:
search_term = "tutorial"

In [None]:
my_variables = dictionary.find(search_term) # Search for "tutorial"

In [None]:
my_variables.count() # How many variables did the search return?

We can view these variables in a detailed dataframe format using the `dataframe()` method.

In [None]:
my_variables_df = my_variables.dataframe() # Save the results as a dataframe
my_variables_df.head(5) # View the first 5 rows 

PIC-SURE integrates clinical and genomic datasets across *BDC*, including TOPMed and TOPMed-related studies, COVID-19 studies, and BioLINCC studies. Each variable is organized as a concept path that contains information about the study, variable group, and variable. Though the specifics of the concept paths are dependent on the type of study, the overall information included is the same. 

Data Organization in PIC-SURE
---------------------------------------
| Data organization | TOPMed & TOPMed-related studies | BioLINCC & COVID-19 studies (including public data) |
|-------------------|---------------------------------|-----------------------------|
| General organization | Data organized using the format implemented by the database of Genotypes and Phenotypes (dbGaP). Generally, a given study will have several tables, and those tables have several variables. | Data do not follow dbGaP format; there are no phv or pht accessions. Data are organized in groups of like variables, when available. For example, variables like Age, Gender, and Race could be part of the Demographics variable group. |
| Concept path structure | \phs\pht\phv\variable name\ | \phs\variable name |
| Variable ID | phv corresponding to the variable accession number | Equivalent to variable name | 
| Variable name | Encoded variable name that was used by the original submitters of the data | Encoded variable name that was used by the original submitters of the data |
| Variable description | Description of the variable | Description of the variable, as available |
| Dataset ID | pht corresponding to the trait table accession number | Equivalent to Dataset name | 
| Dataset name | Name of the trait table | Name of a group of like variables, as available | 
| Dataset description | Description of the trait table | Description of a group of like variables, as available |
| Study ID | phs corresponding to the study accession number | phs corresponding to the study accession number |
| Study description | Description of the study from dbGaP | Description of the study from dbGaP |

*Note: The concept paths in PIC-SURE is used for querying. This is called `HPDS_PATH` in the data ditcionary shown above.*

The `listPaths()` function can be helpful to retrieve the concept paths for specific variables, which are used for building cohorts and selecting participant-level data.

In [None]:
my_variables.listPaths()[0:10] # What are the first ten concept paths in my search results?

We can also view additional information for individual variables using the `varInfo()` method. 

In [None]:
first_var = my_variables.listPaths()[0] # Save the concept path for the first variable
my_variables.varInfo(first_var) # Show information for that first concept path

Now you can try to search for a term on your own. Below is sample code on how to search for the term `sex`. To practice searching the data dictionary, you can change "sex" to a term you are interested in. You will see the results displayed in the convenient dataframe format using the `displayResults()` method. Note - the results displayed will show results from all studies you have access to. 

In [None]:
my_search = dictionary.find("sex") # Change sex to be your term of interest
#my_search.displayResults() # Show the variables that match my search result

## Using PIC-SURE to build a query and retrieve data
You can also use the PIC-SURE API to build a query and retrieve data. With this functionality, you can filter based on specific variables, add others, and export the data as a dataframe into this notebook. 

The first step to this is setting up the query object.

In [None]:
authPicSure = bdc.useAuthPicSure()
query_example = authPicSure.query()

There are several methods that can be used to build a query, which are listed below.

| Method | Arguments / Input | Output|
|--------|-------------------|-------|
| query.select.add() | variable names (string) or list of strings | all variables included in the list (no record subsetting)|
| query.require.add() | variable names (string) or list of strings | all variables; only records that do not contain null values for input variables |
| query.anyof.add() | variable names (string) or list of strings | all variables; only records that contain at least one non-null value for input variables |
| query.filter.add() | variable name and additional filtering values | input variable; only records that match filter criteria |

As an example query, let's use the Framingham tutorial dataset to investigate the prevalence of hypertension and distribution of age of current smokers with body mass index greater than 20.

In [None]:
# Ensure that only Framingham tutorial variables are shown in the data dictionary, which can vary based on individual access
phs_number = "tutorial-biolincc_framingham"
tutorial_df = my_variables_df[my_variables_df.studyId == phs_number]

### Build a query with a categorical variable - Current smoker
Let's practice building a query by filtering on variables. Based on the search for the Framingham tutorial dataset variables, we can save the concept path of the "Current cigarette smoking at exam" variable, which is a categorical variable. 

In [None]:
smoke_variable_path = tutorial_df.HPDS_PATH[tutorial_df.description == "Current cigarette smoking at exam"]
smoke_variable_path

We can take a look at the options of values to filter by using the `values` column of the data dictionary.

In [None]:
tutorial_df["values"][tutorial_df.description == "Current cigarette smoking at exam"]

Let's apply a filter on the "Current cigarette smoking at exam" variable to only select participants with "Current smoker." Note that though we are only filtering by one value, you can filter by multiple values by passing a list into `filter()`.

In [None]:
query_example.filter().add(smoke_variable_path, "Current smoker")

### Build a query with a continuous variable - BMI

Let's practice building a query by filtering on a continuous variable, in this case, BMI. We can find the BMI concept path using a similar approach as above.

In [None]:
bmi_variable_path = tutorial_df.HPDS_PATH[tutorial_df.columnmeta_name == "BMI"]
bmi_variable_path

We can look at the minimum and maximum values of the variable using the `min` and `max` columns of the data dictionary.

In [None]:
tutorial_df[["min", "max"]][tutorial_df.columnmeta_name == "BMI"]

Let's apply a filter on the "Body Mass Index, weight in kilograms/height meters squared" variable to select only participants with values greater than 20. Note that while in this example only a `min` is specified, a `max` can also be defined for the filter.

In [None]:
query_example.filter().add(bmi_variable_path, min=20)

### Adding variables to include in export - Age and Hypertension
In addition to adding filters, specific variables can be included in the export for analysis. Let's do this for the "Age at exam (years)" and "Hypertensive. Defined as the first exam treated for high blood pressure or second exam in which either Systolic is 6 140 mmHg or Diastolic 6 90mmHg" variables.

In [None]:
age_variable_path = tutorial_df.loc[tutorial_df.description == "Age at exam (years)", "HPDS_PATH"].item()
hyperten_variable_path = tutorial_df.loc[tutorial_df.columnmeta_name == "HYPERTEN", "HPDS_PATH"].item()

Let's add these variables to our query. To do this, we can either use `select()`, `require()`, or `anyof()`. 

`select()` will add these variables for all participants we have filtered thus far, regardless whether they have a value for the variables or not.

`anyof()` will add these variables for all participants that have at least one non-null value for the variables added.

`require()` will add these variables for all participants that have only non-null values for all variables added.


For this, let's use `require()` to only select participants that have information for both of these variables. 

In [None]:
query_example.require().add([age_variable_path, hyperten_variable_path])

### Exporting participant-level data from the query
The query has been constructed and can now be exported for analysis. 

In the data dictionary dataframe shown previously, each row represented a single concept path or variable. In the query dataframe, the concept paths are added as columns with each row representing a participant with data that matches your query. 

The dataframe above should contain some automatically exported concept paths, such as `patient_id`, `Parent Study Accession with Subject ID`, `Topmed Study Accession with Subject ID`, and `consents`, and the concept paths we added to our query.

In [None]:
example_results = query_example.getResultsDataFrame(low_memory=False)
example_results.head()

As you can see in the results, there are some instances where study participants may have more than one value for a given variable. For example, this may be the case when a study participants answers questionnaires for multiple visits. 

In the PIC-SURE output, this is shown as values being separated by a tab or `\t` value. These multiple values will need to be accounted for depending on the planned analysis.

With this example, averages of the age and BMI values could be calculated and a new variable "ever smoker" could be created based on whether "current smoker" was ever answered for the CURSMOKE variable. The code below shows this example of how to handle these values.

*Note: Approaches to handling multiple values will differ based on the approach.*

In [None]:
# Select rows of interest and rename them
clean_results = example_results[["\\tutorial-biolincc_framingham\\AGE\\", "\\tutorial-biolincc_framingham\\BMI\\", "\\tutorial-biolincc_framingham\\CURSMOKE\\", "\\tutorial-biolincc_framingham\\HYPERTEN\\"]]
clean_results.columns = ["AGE", "BMI", "CURSMOKE", "HYPERTEN"]

In [None]:
# Function that splits the values and calculates the mean
import statistics
def mean_multiple_values(df_values):
    sep_values = str(df_values).split("\t")
    mean_val = statistics.mean([float(i) for i in sep_values])
    return(mean_val)

# Apply the function to calculate means to the AGE and BMI variables
clean_results.loc[:, "mean_age"] = clean_results.loc[:, "AGE"].apply(mean_multiple_values)
clean_results.loc[:, "mean_bmi"] = clean_results.loc[:, "BMI"].apply(mean_multiple_values)

In [None]:
# Function that flags participants as smoker if they have an answer of "Current smoker"
def ever_smoker(smoke_vals):
    sep_smoke_vals = smoke_vals.split("\t")
    if "Current smoker" in sep_smoke_vals:
        return("Smoker")
    else:
        return("Non-smoker")
    
# Apply the function to identify smokers to the CURSMOKE variable
clean_results.loc[:, "ever_smoker"] = clean_results.loc[:, "CURSMOKE"].apply(ever_smoker)

In [None]:
# Take a look at the new results
clean_results[["mean_age", "mean_bmi", "ever_smoker","HYPERTEN"]]