# BioData Catalyst Data Release QA
Validation tests in this notebook:
1. [**Patient counts of new studies**](https://basicnotebookinstance-rl0ytn08jb87.notebook.us-east-1.sagemaker.aws/notebooks/biodatacatalyst-pic-sure/access-dashboard-metadata/biodatacatalyst_data_release_QA.ipynb#Validation:-New-study-patient-counts): Patient counts of the new studies from the integration environment are compared to the patient counts in Patient_Count_Per_Consents.csv

### Prerequisites
- Developer access to the integration enviroment (token)
- Consent value(s) of the new study (or studies) to validate (phs number)

### Install packages

In [1]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git
!{sys.executable} -m pip install -r requirements.txt

Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-572j0njv
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-572j0njv
Building wheels for collected packages: PicSureClient
  Building wheel for PicSureClient (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureClient: filename=PicSureClient-0.1.0-py2.py3-none-any.whl size=10300 sha256=108b70e75232a9f71ccd9464879ae0329bc61fc76debe1f48228c9b6a62c4668
  Stored in directory: /tmp/pip-ephem-wheel-cache-n_bu92se/wheels/31/ef/21/e362bba8de04e0072fafec9f77bd1abdf7e166213d27e98729
Successfully built PicSureClient
Installing collected packages: PicSureClient
  Attempting uninstall: PicSureClient
    Found existing installation: PicSureClient 0.1.0
    Uninstalling PicSureClient-0.1.0:
      Successfully uninstalled PicSureClient-0.1.0
Successfully installed PicSureClient-0.1.0
You 

In [2]:
import json
from pprint import pprint

import pandas as pd

from shutil import copyfile
import PicSureClient
import PicSureBdcAdapter

### Connect to PIC-SURE
Be sure to use the **developer token** from the **integration environment**. It is necessary to have access to all studies to validate the counts.

In [3]:
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
token_file = "token.txt"

In [6]:
with open(token_file, "r") as f:
    my_token = f.read()

In [7]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 |                                                      |
| 70c837be-5ffc-11eb-ae93-0242ac130002 |                                                      |
+--------------------------------------+------------------------------------------------------+


## Validation: New study patient counts
The purpose of this section is to validate the patient counts for newly ingested studies.

### Specify the new study (or studies) to be tested
To validate the new studies ingested, specify the phs numbers in the cell below without the consent group. 
For example, if the study of interest is the AMISH study, list `phs000956` below (*not* `phs000956.c1`).

In [8]:
#to_validate = ['list', 'phs_numbers', 'here']
to_validate = ['phs000285', #CARDIA,
               'phs000703', #CATHGEN
               'phs001194', #PCGC
               'phs000810', #HCHS-SOL
               'phs001252' #ECLIPSE
              ]

### Get patient count file from S3 bucket
This notebook uses `Patient_Count_Per_Consents.csv` from the S3 bucket as the reference file. First we need to copy this file over to this directory.

In [9]:
src = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/general/completed/Patient_Count_Per_Consents.csv'
dst = '/home/ec2-user/SageMaker/biodatacatalyst-pic-sure/access-dashboard-metadata/Patient_Count_Per_Consents.csv'
copyfile(src, dst)

'/home/ec2-user/SageMaker/biodatacatalyst-pic-sure/access-dashboard-metadata/Patient_Count_Per_Consents.csv'

In [10]:
# Load S3 file as a dataframe in the Jupyter Notebook
patient_ref_file = pd.read_csv('Patient_Count_Per_Consents.csv', header=None, names=['consent', 'patient_count'])
patient_ref_file

Unnamed: 0,consent,patient_count
0,phs001001.c1,933
1,phs001001.c2,92
2,phs001345.c1,1854
3,phs000956.c2,1123
4,phs000956.c0,1069
...,...,...
108,phs000287.c4,8
109,phs001412.c2,3
110,phs001024.c1,128
111,phs001412.c1,402


In [11]:
# Extract the consent groups based on the user-identified phs values
full_phs = []
for phs in to_validate:
    for full in list(patient_ref_file['consent']):
        if phs in full and ".c0" not in full:
            full_phs.append(full)
full_phs

['phs000285.c2',
 'phs000285.c1',
 'phs000703.c1',
 'phs001194.c1',
 'phs001194.c2',
 'phs000810.c2',
 'phs000810.c1',
 'phs001252.c1']

### Get patient count for the specified consent groups
Now the consent groups will be used to find the patient counts currently in the integration environment.

In [12]:
# Start a new query and initialize output dictionary
patient_count_query = resource.query()
output = {}

# Get patient counts for each consent group
for consentGroup in full_phs:
    print(consentGroup)
    patient_count_query.filter().delete("\\_consents\\") # Delete all consents
    patient_count_query.filter().add("\\_consents\\", consentGroup) # Add back consent group of interest
    patient_count = patient_count_query.getCount() # Get patient count
    output[consentGroup] = patient_count # Add to output

phs000285.c2
Deleted key: \_consents\
phs000285.c1
Deleted key: \_consents\
phs000703.c1
Deleted key: \_consents\
phs001194.c1
Deleted key: \_consents\
phs001194.c2
Deleted key: \_consents\
phs000810.c2
Deleted key: \_consents\
phs000810.c1
Deleted key: \_consents\
phs001252.c1
Deleted key: \_consents\


### Compare the values to the reference file
Finally we will compare the values from the reference file to the counts in the integration environment.

In [13]:
for consent_val in output.keys():
    ref_count = int(patient_ref_file[patient_ref_file['consent'] == consent_val]['patient_count']) # Count from reference file
    integration_count = output[consent_val] # Count from integration environment
    # Display result message
    if ref_count == integration_count:
        print(consent_val, "passes validation")
    else:
        print('***DID NOT PASS VALIDATION:', consent_val)
        print('Expected count from Patient_Count_Per_Consents.csv:\t', ref_count)
        print('Count retrieved from integration environment:\t', integration_count)

phs000285.c2 passes validation
phs000285.c1 passes validation
phs000703.c1 passes validation
phs001194.c1 passes validation
phs001194.c2 passes validation
phs000810.c2 passes validation
phs000810.c1 passes validation
phs001252.c1 passes validation
