# Retrieve Variable-Level Data From PIC-SURE API

BDC Developer login credentials or access to all data in PIC-SURE Authorized Access is also required to ensure all variables are reviewed.

Two steps are needed to export data into an analysis workspace:
1. Personalized access token: a user-specific token that tells PIC-SURE which studies a user is authorized to access
2. Connection to the PIC-SURE API

Using these two components, the API can be used to export the selected data into the analysis workspace (in this case, where this Jupyter Notebook is being run). 

**Note: The project on Seven Bridges must be "Controlled" if you wish to copy files into the project.**

## Step 1: Getting your user-specific security token
**Before running this notebook, please be sure to acquire your personal security token, which is mandatory to use the PIC-SURE API.**

![get_token](imgs/get_token.png)

## Step 2: Setting up your connection to the PIC-SURE API

### Pre-requisites for the notebook
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed

### Install packages to connect to the PIC-SURE API
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter
* *BDC-PIC-SURE* Adapter

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**


Be sure to save your user-specific token as token.txt prior to running the code.

In [1]:
import pandas as pd
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git@new-search

import PicSureBdcAdapter

Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-7jg1f4qs
  Running command git clone --filter=blob:none --quiet https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-7jg1f4qs
  Resolved https://github.com/hms-dbmi/pic-sure-python-client.git to commit b1c5419290fd1d7ecc1494698caae5436fb6a2e8
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: PicSureClient
  Building wheel for PicSureClient (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureClient: filename=PicSureClient-0.1.0-py2.py3-none-any.whl size=10646 sha256=2e0db37b163ac7bd749a4369c072d342146b49bd52214d9f73b53ce2e9219f58
  Stored in directory: /tmp/pip-ephem-wheel-cache-10uuwduf/wheels/90/65/c4/e74447484bdae71b64f3f0a500bc7b3d9d6ee7edc62ade6667
Successfully built PicSureClient
Installing collected packages: PicSureClient
  Attempting uninstall: PicSureCl

In [3]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
#PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()

bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 70c837be-5ffc-11eb-ae93-0242ac130002 | open-hpds                                            |
| ca0ad4a9-130a-3a8a-ae00-e35b07f1108b | visualization                                        |
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 | auth-hpds                                            |
| 36363664-6231-6134-2d38-6538652d3131 | dictionary                                           |
+--------------------------------------+------------------------------------------------------+


## Retrieve Data

Save all variables of interest in PIC-SURE Authorized Access to a Pandas DataFrame

In [4]:
dictionary = bdc.useDictionary().dictionary() # Set up the dictionary
all_vars = dictionary.find('phs002385') # If restricting to a specific study, fill in with phs number of interest
all_variables = all_vars.dataframe() # Retrieve all variables you have access to

In [5]:
all_variables

Unnamed: 0,values,studyId,dtId,varId,is_categorical,is_continuous,columnmeta_is_stigmatized,columnmeta_name,description,columnmeta_min,...,columnmeta_study_id,is_stigmatized,derived_var_name,derived_study_abv_name,derived_study_description,columnmeta_var_group_id,derived_group_name,columnmeta_HPDS_PATH,min,max
0,"[ATG, Alemtuzumab, None, Not Reported]",phs002385,All Variables,atgf,True,False,false,atgf,ATG/Alemtuzumab given as conditioning regimen/...,,...,phs002385,false,atgf,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\ATGF\,,
1,"[No, Not reported, Yes]",phs002385,All Variables,ACSPR,True,False,false,ACSPR,Acute chest syndrome (ACS) pre-conditioning,,...,phs002385,false,ACSPR,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\ACSPR\,,
2,"[No, Not reported, Yes]",phs002385,All Variables,ACSPSHI,True,False,false,ACSPSHI,Acute chest syndrome post HCT,,...,phs002385,false,ACSPSHI,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\ACSPSHI\,,
3,"[Acute GVHD, grade unknown, No, Not reported, ...",phs002385,All Variables,AGVHD,True,False,false,AGVHD,"Acute graft versus host disease, grades II-IV",,...,phs002385,false,AGVHD,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\AGVHD\,,
4,"[No, Not reported, Yes]",phs002385,All Variables,ADIALYHI,True,False,false,ADIALYHI,Acute renal failure requiring dialysis,,...,phs002385,false,ADIALYHI,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\ADIALYHI\,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,"[No, Yes]",phs002385,All Variables,VOD,True,False,false,VOD,VOD post-HCT,,...,phs002385,false,VOD,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\VOD\,,
153,"[No, Not reported, Yes]",phs002385,All Variables,VOC2YPR,True,False,false,VOC2YPR,Vaso-occlusive cirsis requiring hospitalizatio...,,...,phs002385,false,VOC2YPR,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\VOC2YPR\,,
154,"[No, Not reported, Yes]",phs002385,All Variables,VOCPSHI,True,False,false,VOCPSHI,Vaso-occlusive pain post HCT,,...,phs002385,false,VOCPSHI,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\VOCPSHI\,,
155,[],phs002385,All Variables,yeartx,False,True,false,yeartx,Year of transplant,1991.0,...,phs002385,false,yeartx,HCT_for_SCD,Hematopoietic Cell Transplant for Sickle Cell ...,All Variables,,\phs002385\YEARTX\,1991.0,2019.0


In [7]:
all_variables['dtId']

0      All Variables
1      All Variables
2      All Variables
3      All Variables
4      All Variables
           ...      
152    All Variables
153    All Variables
154    All Variables
155    All Variables
156    All Variables
Name: dtId, Length: 157, dtype: object

# Descriptions of Fields

### Note types of studies in PIC-SURE:
1. **dbGap format compliant** - ingested by dbGap in the dbGap recommended format (https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/)
2. **dbGap ingested, but not format compliant**
3. **Not ingested by dbGap** - are not format compliant, and do not have a phs number

***NOTE:*** *There are some fields included that may not be relevant. Some that are generated during the PIC-SURE data curation process that are duplicates of other fields listed as well as others that are stored specifically for internal use, these have been identified below.*

**values** - An array of all unique values included for the variable

**studyId** - The ID associated with a study. For dbGap-assosciated studies this is in the format phsxxxxxx. Non-dbGap studies can be in other formats. Field is consistent with the DBGAP ACCESSION NUMBER in Gen3.

**dtId** - ID associated with the table the variable is stored in within the study. For studies in dbGap format this is provided as phtxxxxxxx. Non-compliant studies can instead be names of the table/form, listed as "All Variables" if the variables were not grouped in a table/form.

**varId** - ID associated with the variable. For studies in dbGap format this is provided as phvxxxxxxxx. Non-compliant studies instead have a short text ID provided that can be a duplicate of the columnmeta_name field.

**is_categorical, is_continuous** - boolean True/False values that describe if a variable is filtered in the PIC-SURE UI as a range (continuous) or individual selections (categorical).

**columnmeta_is_stigmatized** - boolean True/False value that determines if a varaible is shown in PIC-SURE Open Access. A value of True means that variable is not shown in Open Access. Further description availible here: (https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables/tree/main)

**columnmeta_name** - A short text ID assosciated with a variable. These are often not human-readable as they are mostly derived from the column names in datasets. For non-compliant studies this can be a duplicate of the varID field.

**description** - A text field with a more human-readable description of the variable. When not provided by the study submitters, this field will be a duplicate of the columnmeta_name field

**HPDS_PATH** - The concept path used to uniquely identify a variable when exported to users. Further description provided here: (https://docs.google.com/document/d/1tsxQS-c9A2GxqJm6piDfVzSTHPI1B0IFYH23MXwbNe8/edit?usp=sharing)

**derived_group_id** - The table ID and version number, when applicable.

**columnmeta_var_group_description** - If provided by the study submitters, field contains a long text description of variable groupings. Variables are not always grouped together in studies.

**derived_variable_level_data** - An array of additional information that is study and variable specific. An example would be units of measurement. Currently only implemented in small number of studies, working on increasing that number.

**data_hierarchy** - A text field displaying a more human-readable path that is used in the PIC-SURE UI. Currently only implemented in small number of studies, working on increasing that number.

**columnmeta_data_type** - Text field containing "categorical" or "continuous", based on is_categorical and is_continuous fields

**derived_var_id** - variable ID with version number, when applicable.

**derived_study_abv_name** - Short text abbreviation used to refer to a study and shown in the UI.

**derived_study_description** - Description of the study, consitent with **Full Name** field in **Gen3**.

**columnmeta_min, columneta_max** - Fields generated internally for use in the UI elements for specific studies.

**hashed_var_id** - Hashed variable ID for internal use.

columnmeta_hpds_path - duplicate of HPDS_PATH

columnmeta_var_id - duplicate of varId

derived_var_description - duplicate of description

derived_group_description - duplicate of columnmeta_var_group_description

columnmeta_description duplicate of description

derived_study_id - duplicate of studyId

columnmeta_study_id - duplicate of studyId

is_stigmatized - Duplicate of columnmeta_is_stigmatized

derived_var_name - Duplicate of columnmeta_name

columnmeta_var_group_id - Duplicate of dtId

columnmeta_HPDS_PATH - duplicate of HPDS_PATH

min, max - duplicates of columnmeta_min, columneta_max
