# PICSURE API test notebook

Notebook aimed at testing ongoing issues with the PIC-SURE API. Two parts, 1. Environment set-up, and 2. Ongoing issues

# Environment set-up

### Installation of external dependencies

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip install -r requirements.txt

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git (from -r requirements.txt (line 7))
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-req-build-qpjrw9d_
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-req-build-qpjrw9d_
Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git (from -r requirements.txt (line 8))
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-req-build-j8gxfw5d
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-client.git /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-req-build-j8gxfw5d
Building wheels for collected packages: PicSureHpdsLib, PicSureClient
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Cr

In [3]:
!python -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git 
!python -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-req-build-kizwj4hi
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-req-build-kizwj4hi
Collecting httplib2
  Using cached https://files.pythonhosted.org/packages/e3/cc/7e82cfdc417f28ae92a67493eb65a2ce8b7ced89c09d21e625556caa0f26/httplib2-0.16.0-py3-none-any.whl
Building wheels for collected packages: PicSureHpdsLib
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureHpdsLib: filename=PicSureHpdsLib-0.9.0-py2.py3-none-any.whl size=17668 sha256=c9dab710d2290c6f9fbc72a877b2ff2a96a9211b694dcf5732243c212b8fb732
  Stored in directory: /private/var/folders/hm/wn0bpy0j7vl2q9gqnhhccpph0000gn/T/pip-ephem-wheel-cache-gdcy20b2/wheels/6c/ac/12/4d1

In [5]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesDict, get_dic_renaming_vars,\
match_dummies_to_varNames, joining_variablesDict_onCol
from python_lib.HPDS_connection_manager import tokenManager

### Connecting to a PIC-SURE network

Testing environment: BioData Catalyst 

In [11]:
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [12]:
with open(token_file, "r+") as f:
    token = f.read()

In [13]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, token)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

### Retrieving variables dictionary from HPDS Database

NB: dictionary methods work alright, it just might be useful for getting variable names

In [14]:
plain_variablesDict = resource.dictionary().find().DataFrame()
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)

In [15]:
plain_variablesDict

Unnamed: 0_level_0,min,categorical,observationCount,patientCount,max,HpdsDataType,categoryValues
KEY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"\Framingham Cohort\Tests\Pulmonary Function Tests\Pulmonary Function Diffusion Test, New Offspring Spouse and Omni 2, Exam 1. Spirometry and diffusing capacity were collected in accordance with contemporaneous American Thoracic Society standards for the measurement of lung function.\Study Date\",0.000,False,362,362,12.000,phenotypes,
"\Atherosclerosis Risk in Communities (ARIC) Cohort\Subject Phenotype\Cohort Exam\ECG\ECG Data\Exam 3\Sociodemography and Administration\Major1: [Complete/Intermittent LBBB, Complete/Intermittent RBBB, Complete/Intermittent RBBB w/ Left A, Nonspecific Intraventricular Block] [ECG data, exam 3]\",,True,489,489,,phenotypes,"[Complete/Intermittent LBBB, Complete/Intermit..."
"\Atherosclerosis Risk in Communities (ARIC) Cohort\Subject Phenotype\Cohort Exam\Retinal\Retinal Vessel Measurements\Exam 3\RIP\RIPX92 [Retinal Vessel Measurements, exam 3]\",0.566,False,10351,10351,1.218,phenotypes,
"\Atherosclerosis Risk in Communities (ARIC) Cohort\Subject Phenotype\Cohort Exam\Retinal\Retinal Vessel Measurements\Exam 3\Arteriole\Arteriole 15 [Retinal Vessel Measurements, exam 3]\",27.000,False,211,211,108.000,phenotypes,
\Cardiovascular Health Study (CHS) Cohort\BASE1\Psychological and Psychiatric Observations\Life Events\FELT EVERYTHING DONE WAS EFFORT\,,True,4753,4753,,phenotypes,"[A MODERATE AMOUNT OF TIME(3 TO 4 DAYS), MOST ..."
...,...,...,...,...,...,...,...
\Genome-wide Association Study of Adiposity in Samoans\Lifestyle and Environment\Smoking Status\Do you use smokeless tobacco?\,,True,3502,3502,,phenotypes,"[No, Smokeless_tobacco_use, Yes, missing value]"
"\The Jackson Heart Study (JHS)\Subject Phenotype\Renal Disease\Renal Disease Form, RDF. Version B\Visit 9\Q2. Have You Ever Been Told by a Health Care Provider That You Had\Q2d. Have you ever been told by a health care provider that you had: autoimmune disease, such as lupus? [Visit 9] [Renal Disease Form, RDF]\",,True,1599,1599,,phenotypes,"[Don't Know, No, Yes]"
\Atherosclerosis Risk in Communities (ARIC) Cohort\Subject Phenotype\Cohort Exam\Sleep Heart Health Study Form\EMG\Chin EMG signal quality (hours of signal most free from artifact)\,1.000,False,1820,1820,4.000,phenotypes,
\NHLBI Cleveland Family Study (CFS) Candidate Gene Association Resource (CARe)\CFS - Phenotype Data\FAMILY15_LAD_LONG_Data\Medical History\Medical History - Subjects\Sleep Associated Variables\Bedpartner's Observations\Talks in sleep at night (Bed Partner)\,-2.000,False,22,22,999.000,phenotypes,


In [16]:
bmi_harmonized = "\\DCC Harmonized data set\\03 - Baseline common covariates\\Body mass index calculated at baseline.\\"
random_variables = plain_variablesDict.index[6:10]

In [18]:
query = resource.query()
query.filter().add(bmi_harmonized, 10.5, 11)
query.getQueryCommand()

'{"query": {"fields": [], "crossCountFields": [], "requiredFields": [], "anyRecordOf": [], "numericFilters": {"\\\\DCC Harmonized data set\\\\03 - Baseline common covariates\\\\Body mass index calculated at baseline.\\\\": {}}, "categoryFilters": {}, "variantInfoFilters": [{"categoryVariantInfoFilters": {}}, {"numericVariantInfoFilters": {}}]}, "resourceUUID": "02e23f52-f354-4e8b-992c-d37c8b9ba140"}'

In [19]:
query._lstFilter.data

{'\\DCC Harmonized data set\\03 - Baseline common covariates\\Body mass index calculated at baseline.\\': {'type': 'minmax',
  'HpdsDataType': 'phenotypes'}}

In [33]:
query = resource.query()
query.select().add(bmi_harmonized)
facts = query.getResultsDataFrame()

In [34]:
facts[bmi_harmonized].notnull().value_counts()

True     231159
False     88883
Name: \DCC Harmonized data set\03 - Baseline common covariates\Body mass index calculated at baseline.\, dtype: int64

In [35]:
facts[bmi_harmonized].replace({0: np.NaN}).notnull().value_counts()

True     230917
False     89125
Name: \DCC Harmonized data set\03 - Baseline common covariates\Body mass index calculated at baseline.\, dtype: int64

In [36]:
mask_0 = facts[bmi_harmonized] == 0
patient_id = facts.loc[mask_0,"Patient ID"]

In [37]:
patient_id.to_csv("patient_id_zero.csv", header=True, index=False)

# Errors reproduction

## Issue 1: query.anyof.add() → HTTP Error 

query method `anyof` is throwing HTTP Error, although other query methods work fine (`select`, `add`, `filter`)

In [27]:
print(random_variables)

Index(['\NHLBI TOPMed: Diabetes Heart Study (DHS) African American Coronary Artery Calcification (AA CAC)\Physical Observations\Body mass index\',
       '\NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women\Physical Observations\Body mass index\',
       '\NHLBI TOPMed: Boston Early-Onset COPD Study in the TOPMed Program\Subject Phenotype\Physical Observations\Body Mass Index [BMI ]\',
       '\The Jackson Heart Study (JHS)\Subject Phenotype\Analysis\Analysis1\Visit 1\BMI/Height/Weight/Waist Circumference/Hip Circumference/Body Surface Area\Body Surface Area\Body Mass Index (kg/m^2) [Visit 1]\'],
      dtype='object', name='KEY')


In [24]:
query = resource.query()
query.anyof().add(random_variables)
facts_anyof = query.getResultsDataFrame()

In [25]:
facts_anyof.shape

(2871, 1)

In [26]:
facts_anyof.head()

Unnamed: 0,Patient ID
0,51578
1,51579
2,51580
3,51582
4,51583


In [30]:
query.show()

.__________[ Query.select()  has NO SELECTIONS ]____________________________________________________________________________________________________________
.__________[ Query.crosscounts()  has NO SELECTIONS ]_______________________________________________________________________________________________________
.__________[ Query.require() has NO SELECTIONS ]____________________________________________________________________________________________________________
.__________[ Query.anyof()  Settings ]______________________________________________________________________________________________________________________
| _key__________________________________________________________________________________________________________________________
|  \\NHLBI TOPMed: Diabetes Heart Study (DHS) African American Coronary Artery Calcification (AA CAC)\\Physical Observations\\Body mass index\\ |
|  \\NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women\\Physic

##### `select`, `add`, `filter`: work alright, just for information

In [None]:
query = resource.query()
query.require().add(random_variable_name)
facts_require = query.getResultsDataFrame()

In [None]:
facts_require.shape

In [None]:
facts_require.head()

In [None]:
query = resource.query()
query.select().add(random_variable_name)
facts_select = query.getResultsDataFrame()

In [None]:
facts_select.shape

In [None]:
facts_select.head()

In [None]:
query = resource.query()
query.filter().add(random_variable_name, 10, 100)
facts_filter = query.getResultsDataFrame()

In [None]:
facts_filter.shape

In [None]:
facts_filter.head()

## Issue 2: connection.list() → TypeError

On a less important topic (from the point of view of an end-user at least), `
PicSureClient.Client().connect().list()` method is not working.

Specific to python API

In [None]:
connection.list()

### Issue 4: Count of non-null values by the dictionary 

### Issue 5: Categorical type of variables in variable dictionary is sometimes not accurate

### Issues with Dictionary 

1. Some counts are not accurate:
    - see '\\_Consents\\Short Study Accession with Consent Code\\'
2. Datatype are not accurate: continuous instead of categorical and things like that. Maybe reformat to numerical/string
3. Null values counted as real ones
    - See snippet on 
4. Information on some studies missing
    - 

# Allowing the API to query based on the output of the UI

# Enabling dictionary to search for multiple different strings

## Adding the possibility to filter on a value using regex

# Dictionary should return the same columns, regardless of the type of variables queried
For the sake of combination of different dictionaries. Currently, different columns are returned if dictionary contain only continuous or categorical variables

# Missing studies in the dictionary 
- Two studies variables are not present in the dictionary, hence it prevent from querying the dictionary

### Testing the connection object once it is created

## Changing the way to get query results, making it look more like R implementation