# Environment set-up



# Pre-requisite
- python 3.6 or later
- pip python package manager, already available in most systems with a python interpreter installed ([pip installation instructions](https://pip.pypa.io/en/stable/installing/))

Diagnosis_PICSURE_API_validations.ipynb


### Packages installation

Installation of the packages listed in the `requirements.txt` file, as well as the two components of the PIC-SURE API from GitHub, that is the PIC-SURE adapter and the PIC-SURE Client.

In [1]:
!cat requirements.txt

numpy==1.16.4
matplotlib>=3.1.1
pandas>=0.25.3
scipy>=1.3.1
tqdm>=4.38.0
statsmodels>=0.10.2


In [2]:
import sys
!{sys.executable} -m pip install -r requirements.txt



In [3]:
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git to /tmp/pip-req-build-83440wzl
Collecting httplib2
  Using cached httplib2-0.19.0-py3-none-any.whl (95 kB)
Collecting pyparsing<3,>=2.4.2
  Using cached pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
Building wheels for collected packages: PicSureHpdsLib
  Building wheel for PicSureHpdsLib (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureHpdsLib: filename=PicSureHpdsLib-0.9.0-py2.py3-none-any.whl size=21890 sha256=3f84c27b769a803a60c1ba10a9382dff6f6fb3a82a0c004e973117a27ed6c7df
  Stored in directory: /tmp/pip-ephem-wheel-cache-s1ufrkhh/wheels/e8/35/43/484d5d574661fc4a2c5b083551bc3c7254695764ed17ce397e
Successfully built PicSureHpdsLib
Installing collected packages: pyparsing, httplib2, PicSureHpdsLib
  Attempting uninstall: pyparsing
    Found existing installation: pyparsing 2.4.7
    Uninstalling pyparsing-2.4.7:
      Successfully

Import all the external dependencies, as well as user-defined functions stored in the `python_lib` folder

In [4]:
import json
from pprint import pprint

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

import PicSureHpdsLib
import PicSureClient

#from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

##### Setting the display parameter for tables and plots

## Connecting to a PIC-SURE resource

Several information are required to get access to data through the PIC-SURE API: a network URL, a resource id, and a user-specific security token.

In [5]:
PICSURE_network_URL = "https://precisionlink-biobank4discovery.childrens.harvard.edu/picsure/"
resource_id = "6aa47730-3288-4c45-bfa1-5a8730666016"
token_file = "token.txt"

In [6]:
with open(token_file, "r") as f:
    my_token = f.read()

In [7]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureHpdsLib.Adapter(connection)
resource = adapter.useResource()

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------
|  Resource UUID                       |  Resource Name                                  
+--------------------------------------+------------------------------------------------------
| 6aa47730-3288-4c45-bfa1-5a8730666016
+--------------------------------------+------------------------------------------------------


Two objects are created here: a `connection` and a `resource` object.

As we will only be using one single resource, **the `resource` object is actually the only one we will need to proceed with data analysis hereafter**. 

It is connected to the specific data source ID we specified, and enables to query and retrieve data from this database.

# 

#File Diag_Data_Ext_1.csv is generated from database with node_name, variable_name and patient_counts for all the categorical values under Diagnosis node. And compared to the counts in HPDS for these variable.

In [8]:
#"COUNTS","CONCEPT_PATH","TVAL_CHAR","NVAL_NUM"
#Processing file Diag_Data_Ext_1.csv
#Validations report written to DiagnosisValidations.txt
import time
print('********  Start processing \Diagnosis\ *********' ) 
dfdb_node = pd.read_csv('Diag_Data_Ext_1.csv' )
dfdb_node_sort = dfdb_node.sort_values(by=["CONCEPT_PATH","TVAL_CHAR"])

f = open("DiagnosisValidations.txt", "a")

for i in range(len(dfdb_node_sort)):
    try:
        counts = dfdb_node_sort['COUNTS'].iloc[i]
        concept_path =  dfdb_node_sort['CONCEPT_PATH'].iloc[i] 
        tval_char =  dfdb_node_sort['TVAL_CHAR'].iloc[i] 
        my_query = resource.query()
        my_query.filter().add(concept_path,tval_char)
        print( 'Processing '+ concept_path+tval_char )
        my_count = my_query.getCount() 
        if ( my_count != counts ):
            f.write('concept_path '+ concept_path +' tval_char '+ tval_char +' counts ****not match**** = api count - ' + str(my_count) +' db count - ' +str(counts)  +"\n")

        else:
            f.write('concept_path '+ concept_path +' tval_char '+ tval_char +' counts match = ' + str(my_count) +"\n")
    
    except:  
        print("Errored Row concept_path - " + concept_path + " tval_char " + tval_char + "\n")
        continue
f.close()

print('********  End processing \Diagnosis\ *********' )     


********  Start processing \Diagnosis\ *********
Processing \Diagnosis\ICD-10-CM Diagnoses (2015) with ICD9\Certain conditions originating in the perinatal period (P00-P96)\Abnormal findings on neonatal screening (P09)\Abnormal findings on neonatal screening\Admitting
Processing \Diagnosis\ICD-10-CM Diagnoses (2015) with ICD9\Certain conditions originating in the perinatal period (P00-P96)\Abnormal findings on neonatal screening (P09)\Abnormal findings on neonatal screening\Billing Diagnosis
Processing \Diagnosis\ICD-10-CM Diagnoses (2015) with ICD9\Certain conditions originating in the perinatal period (P00-P96)\Abnormal findings on neonatal screening (P09)\Abnormal findings on neonatal screening\Final
********  End processing \Diagnosis\ *********
