# PICSURE API test notebook

Notebook aimed at testing ongoing issues with the PIC-SURE API. Two parts, 1. Environment set-up, and 2. Ongoing issues

# Environment set-up

### Installation of external dependencies

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
!pip install -r requirements.txt

In [None]:
!python -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git 
!python -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 

import PicSureHpdsLib
import PicSureClient

from python_lib.utils import get_multiIndex_variablesDict, get_dic_renaming_vars,\
match_dummies_to_varNames, joining_variablesDict_onCol

### Connecting to a PIC-SURE network

Testing environment: BioData Catalyst 

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [None]:
with open(token_file, "r+") as f:
    token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, token)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource(resource_id)

### Retrieving variables dictionary from HPDS Database

NB: dictionary methods work alright, it just might be useful for getting variable names

In [None]:
plain_variablesDict = resource.dictionary().find().DataFrame()

In [None]:
variablesDict = get_multiIndex_variablesDict(plain_variablesDict)

In [None]:
bmi_harmonized = "\\DCC Harmonized data set\\03 - Baseline common covariates\\Body mass index calculated at baseline.\\"

In [None]:
random_variables = plain_variablesDict.index[6:10]

In [None]:
query = resource.query()
query.select().add(bmi_harmonized)

In [None]:
query._lstFilter.data

In [None]:
query = resource.query()
query.select().add(bmi_harmonized)
facts = query.getResultsDataFrame()

In [None]:
facts[bmi_harmonized].notnull().value_counts()

In [None]:
facts[bmi_harmonized].replace({0: np.NaN}).notnull().value_counts()

In [None]:
mask_0 = facts[bmi_harmonized] == 0
patient_id = facts.loc[mask_0,"Patient ID"]

In [None]:
patient_id.to_csv("patient_id_zero.csv", header=True, index=False)

# Errors reproduction

## Issue 1: query.anyof.add() → HTTP Error 

query method `anyof` is throwing HTTP Error, although other query methods work fine (`select`, `add`, `filter`)

In [None]:
print(random_variables)

In [None]:
query = resource.query()
query.anyof().add(random_variables)
facts_anyof = query.getResultsDataFrame()

In [None]:
facts_anyof.shape

In [None]:
facts_anyof.head()

In [None]:
query.show()

##### `select`, `add`, `filter`: work alright, just for information

In [None]:
query = resource.query()
query.require().add(random_variable_name)
facts_require = query.getResultsDataFrame()

In [None]:
facts_require.shape

In [None]:
facts_require.head()

In [None]:
query = resource.query()
query.select().add(random_variable_name)
facts_select = query.getResultsDataFrame()

In [None]:
facts_select.shape

In [None]:
facts_select.head()

In [None]:
query = resource.query()
query.filter().add(random_variable_name, 10, 100)
facts_filter = query.getResultsDataFrame()

In [None]:
facts_filter.shape

In [None]:
facts_filter.head()

## Issue 2: connection.list() → TypeError

On a less important topic (from the point of view of an end-user at least), `
PicSureClient.Client().connect().list()` method is not working.

Specific to python API

In [None]:
connection.list()

### Issue 4: Count of non-null values by the dictionary 

### Issue 5: Categorical type of variables in variable dictionary is sometimes not accurate

### Issues with Dictionary 

1. Some counts are not accurate:
    - see '\\_Consents\\Short Study Accession with Consent Code\\'
2. Datatype are not accurate: continuous instead of categorical and things like that. Maybe reformat to numerical/string
3. Null values counted as real ones
    - See snippet on 
4. Information on some studies missing
    - 

# Allowing the API to query based on the output of the UI

# Enabling dictionary to search for multiple different strings

## Adding the possibility to filter on a value using regex

# Dictionary should return the same columns, regardless of the type of variables queried
For the sake of combination of different dictionaries. Currently, different columns are returned if dictionary contain only continuous or categorical variables

# Missing studies in the dictionary 
- Two studies variables are not present in the dictionary, hence it prevent from querying the dictionary

### Testing the connection object once it is created

## Changing the way to get query results, making it look more like R implementation