# Retrieve Data Dictionary From PIC-SURE API

This is a notebook to demonstrate functionality of the PIC-SURE API to export the data dictionaries of studies available in BDC.

**Note: You can only retrieve the data dictionaries from studies you are authorized to access.**

## Getting your user-specific security token

**Before running this notebook, please be sure to review the "Get your security token" documentation, which exists in the [`README.md` file](../README.md). It explains how to get a security token, which is mandatory to use the PIC-SURE API.**

To set up your token file, be sure to run the [`Workspace_setup.ipynb` file](./Workspace_setup.ipynb).

## Environment set-up

### Pre-requisites
* python 3.6 or later
* pip python package manager, already available in most systems with a python interpreter installed (link to pip)

### Install packages
The first step to using the PIC-SURE API is to install the packages needed. The following code installs the PIC-SURE API components from GitHub, specifically:
* PIC-SURE Client
* PIC-SURE Adapter
* *BDC-PIC-SURE* Adapter

**Note that if you are using the dedicated PIC-SURE environment within the *BDC Powered by Seven Bridges* platform, the necessary packages have already been installed.**

In [None]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
# BDC Powered by Terra users uncomment the following line to specify package install location
# sys.path.insert(0, r"/home/jupyter/.local/lib/python3.7/site-packages")

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter

## Connecting to a PIC-SURE resource

The following is required to get access to the PIC-SURE API:
* a network URL
* a user-specific security token, which tells the PIC-SURE API which studies you are authorized to access

The following code specifies the network URL as the *BDC Powered by PIC-SURE* URL and references the user-specific token saved as `token.txt`.

If you have not already retrieved your user-specific token, please refer to the "Get your security token" section of the `README.md` file and the `Workspace_setup.ipynb` file.

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
token_file = "token.txt"

with open(token_file, "r") as f:
    my_token = f.read()
    
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)

# Getting the Data Dictionary

The `dictionary` resource can be used to retrieve the data dictionary for a single or all studies in BDC-PIC-SURE. Note that you are only able to retrieve the data dictionaries for studies you are authorized to access. This output can then be saved as a python pandas dataframe.

## Data dictionary for all studies you are authorized to access
To get all studies you are authorized to access, a `blank` search can be done. 

In [None]:
dictionary = bdc.useDictionary().dictionary() # Set up the dictionary
all_vars = dictionary.find("") # Perform a blank search
all_variables = all_vars.dataframe() # Retrieve all variables you have access to as a dataframe

In [None]:
all_variables.head()

## Descriptions of Fields
As you can see, there are several columns returned in the data dictionary.

Note that there are several types of studies in PIC-SURE:
1. **dbGaP format compliant** - ingested by dbGap in the dbGap recommended format (https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/)
2. **dbGaP ingested, but not format compliant**
3. **Not ingested by dbGaP** - are not format-compliant and do not have a study accession number (or phs number)

### PIC-SURE Data Dictionary Fields

* **values**: An array of all unique values included for the variable.
* **studyId**: ID associated with a study. For dbGaP-assosciated studies this is in the format phsxxxxxx. Non-dbGaP studies can be in other formats. The field is consistent with the DBGAP ACCESSION NUMBER in BDC Powered by Gen3.
* **dtId**: ID associated with the table the variable is stored in within the study. For studies in dbGaP format, this is provided as “phtXXXXXXX”. Non-compliant studies can instead be names of the table or form, listed as "All Variables" if the variables were not grouped in a table or form.
* **varId**: ID associated with the variable. For studies in dbGaP format, this is provided as “phvXXXXXXXX”. Non-compliant studies instead have a short text ID provided that can be a duplicate of the columnmeta_name field.
* **is_categorical**: boolean True/False values that describe whether a variable is filtered in PIC-SURE as a set of discrete values (categorical).
* **is_continuous**: boolean True/False values that describe whether a variable is filtered in PIC-SURE as a numerical range (continuous).
* **columnmeta_is_stigmatized**: boolean True/False value that determines whether a variable is shown in Open PIC-SURE. A value of True means that the variable is not shown in Open PIC-SURE. For further information about stigmatizing variables, please refer to this documentation: https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables/tree/main
* **columnmeta_name**: A short text ID associated with a variable. These are often not human-readable as they are mostly derived from the column names in datasets. For non-compliant studies, this can be a duplicate of the varID field.
* **description**: A text field with a human-readable description of the variable. When not provided by the study submitters, this field will be a duplicate of the columnmeta_name field.
* **HPDS_PATH**: The concept path used to uniquely identify a variable when exported to users. For more information about concept paths and data organization, please refer to the Data Organization in BDC-PIC-SURE page. 
* **derived_group_id**: The table ID and version number, when applicable.
* **columnmeta_var_group_description**: If provided by the study submitters, this field contains a long text description of variable groupings. Variables are not always grouped together in studies.
* **derived_variable_level_data**: An array of additional information that is study- and variable-specific. An example would be units of measurement. This is only available for some of the studies.
* **data_hierarchy**: A text field displaying a human-readable path that is used in the PIC-SURE user interface. This is only available for some of the studies.
* **columnmeta_data_type**: Text field containing "categorical" or "continuous", based on the is_categorical and is_continuous fields.
* **derived_var_id**: Variable ID with version number, when applicable.
* **derived_study_abv_name**: Short text abbreviation used to refer to a study and shown in the PIC-SURE user interface.
* **derived_study_description**: Description of the study, consistent with the “Full Name” field i BDC Powered by Gen3.
* **columnmeta_min**: Field generated internally for use in the PIC-SURE user interface elements for specific studies. Describes the minimum associated with continuous variables. 
* **columneta_max**: Fields generated internally for use in the PIC-SURE user interface elements for specific studies. Describes the maximum associated with continuous variables.
* **hashed_var_id**: Hashed variable ID for internal use.


The following are fields that are duplicated data:
* **columnmeta_hpds_path**: duplicate of HPDS_PATH
* **columnmeta_var_id**: duplicate of varId
* **derived_var_description**: duplicate of description
* **derived_group_description**: duplicate of columnmeta_var_group_description
* **columnmeta_description**: duplicate of description
* **derived_study_id**: duplicate of studyId
* **columnmeta_study_id**: duplicate of studyId
* **is_stigmatized**: duplicate of columnmeta_is_stigmatized
* **derived_var_name**: duplicate of columnmeta_name
* **columnmeta_var_group_id:** duplicate of dtId
* **columnmeta_HPDS_PATH**: duplicate of HPDS_PATH
* **min, max**: duplicates of columnmeta_min, columneta_max

## Data Dictionary for one study
To get all studies you are authorized to access, a search for the study accession number, or the phs number, can be done. 

In [None]:
# See which phs numbers you are authorized to access
phs_number = all_variables.studyId[0] # Save the first phs number from the above output

In [None]:
study_vars = dictionary.find(phs_number) # Perform a blank search
study_variables = study_vars.dataframe() # Retrieve all variables you have access to as a dataframe

In [None]:
study_variables.head()

## Data Dictionary for a search term
Data dictionary entries for a specific search term can also be retrieved by searching for a phenotype of interest. 