# Environment set-up

### Pre-requisite
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

Demographics_PICSURE_API_validations.ipynb

data_medications_nodes_from_db.csv


### Packages installation

Installation of the packages listed in the `requirements.txt` file, as well as the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [None]:
!cat requirements.txt

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [None]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

Import all the external dependencies, as well as user-defined functions stored in the `python_lib` folder

In [None]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

#from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

##### Setting the display parameter for tables and plots

In [None]:
# Pandas DataFrame display options
pd.set_option("max.rows", 100)

# Matplotlib display parameters
plt.rcParams["figure.figsize"] = (14,8)
font = {'weight' : 'bold',
        'size'   : 12}
plt.rc('font', **font)


## Connecting to a PIC-SURE resource

Several information are required to get access to data through the PIC-SURE API: a network URL, a resource id, and a user-specific security token.

In [None]:
PICSURE_network_URL = "https://precisionlink-biobank4discovery.childrens.harvard.edu/picsure/"
resource_id = "6aa47730-3288-4c45-bfa1-5a8730666016"
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource()

Two objects are created here: a `connection` and a `resource` object.

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**. 

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this database.

#File data_medications_nodes_from_db.csv is generated from database with node_name, variable_name and patient_counts for all the categorical values under Medications node. And compared to the counts in HPDS for these variable.

In [None]:
#"COUNTS","CONCEPT_PATH","TVAL_CHAR","NVAL_NUM"
#Processing file 1
import time
print('********  Start processing \Medications\ *********' ) 
dfdb_node = pd.read_csv('Med_Data_Ext_1.csv' )
dfdb_node_sort = dfdb_node.sort_values(by=["CONCEPT_PATH","TVAL_CHAR"])

f = open("MedicationsError_1.txt", "a")
for i in range(len(dfdb_node_sort)):
    try:
        counts = dfdb_node_sort['COUNTS'].iloc[i]
        concept_path =  dfdb_node_sort['CONCEPT_PATH'].iloc[i] 
        tval_char =  dfdb_node_sort['TVAL_CHAR'].iloc[i] 
        my_query = resource.query()
        my_query.filter().add(concept_path,tval_char)
        print( 'Processing '+ concept_path+tval_char )
        my_count = my_query.getCount() 
        if ( my_count != counts ):
            print( concept_path +'  '+ tval_char  +'counts not match  - Actual counts '+ str(my_count) , '<>  DB count ' + str(counts) )
        else:
            print( concept_path +'  '+ tval_char  +'- Actual counts '+ str(my_count) , ' = DB count ' + str(counts) )
    
    except:  
        f.write("Errored Row concept_path - " + concept_path + " tval_char " + tval_char + "\n")
        continue
f.close()
print('********  End processing \Medications\ *********' )     


In [None]:
#"COUNTS","CONCEPT_PATH","TVAL_CHAR","NVAL_NUM"
#Processing file 2
import time
print('********  Start processing \Medications\Any Nodes *********' )
dfdb_node = pd.read_csv('Med_Data_Ext_2.csv' )
dfdb_node_sort = dfdb_node.sort_values(by=["CONCEPT_PATH","TVAL_CHAR"])

f = open("MedicationsError_2.txt", "a")
for i in range(len(dfdb_node_sort)):
    try:
        counts = dfdb_node_sort['COUNTS'].iloc[i]
        concept_path =  dfdb_node_sort['CONCEPT_PATH'].iloc[i] 
        tval_char =  dfdb_node_sort['TVAL_CHAR'].iloc[i] 
        my_query = resource.query()
        my_query.filter().add(concept_path,tval_char)
        print( 'Processing '+ concept_path+tval_char )
        my_count = my_query.getCount() 
        if ( my_count != counts ):
            print( concept_path +'  '+ tval_char  +'counts not match  - Actual counts '+ str(my_count) , '<>  DB count ' + str(counts) )
        else:
            print( concept_path +'  '+ tval_char  +'- Actual counts '+ str(my_count) , ' = DB count ' + str(counts) )
    
    except:  
        f.write("Errored Row concept_path - " + concept_path + " tval_char " + tval_char + "\n")
        continue
f.close()
print('********  End processing \Medications\Any Nodes *********' )