# BioData Catalyst Data Release QA
Validation tests in this notebook:
1. [**Patient counts of new studies**](https://basicnotebookinstance-rl0ytn08jb87.notebook.us-east-1.sagemaker.aws/notebooks/biodatacatalyst-pic-sure/access-dashboard-metadata/biodatacatalyst_data_release_QA.ipynb#Validation:-New-study-patient-counts): Patient counts of the new studies from the integration environment are compared to the patient counts in Patient_Count_Per_Consents.csv
2. [**Data dictionary comparison**](https://basicnotebookinstance-rl0ytn08jb87.notebook.us-east-1.sagemaker.aws/notebooks/biodatacatalyst-pic-sure/access-dashboard-metadata/biodatacatalyst_data_release_QA.ipynb#Validation:-Data-dictionary-comparison): Integration and production data dictionaries are compared to ensure complete match

### Prerequisites
- Developer access to the integration enviroment (token)
- Consent value(s) of the new study (or studies) to validate (phs number)
- Knowledge on whether a harmonized study was added

### Install packages

In [None]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git
!{sys.executable} -m pip install -r requirements.txt

In [None]:
import json
from pprint import pprint

import pandas as pd
import math

from shutil import copyfile
import PicSureClient
import PicSureBdcAdapter

from utils import get_full_consent_vals, compare_datadict_indices, compare_datadicts, get_topmed_and_harmonized_consents

### Connect to PIC-SURE
Be sure to use the **developer token** from the **integration environment**. It is necessary to have access to all studies to validate the counts.

In [None]:
integration_PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure"
int_resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
integration_token_file = "token.txt"

In [None]:
with open(integration_token_file, "r") as f:
    my_int_token = f.read()

In [None]:
int_client = PicSureClient.Client()
int_connection = int_client.connect(integration_PICSURE_network_URL, my_int_token, True)
int_adapter = PicSureBdcAdapter.Adapter(int_connection)
int_resource = int_adapter.useResource(int_resource_id)

## Validation: New study patient counts
The purpose of this section is to validate the patient counts for newly ingested studies.

### Specify the new study (or studies) to be tested
To validate the new studies ingested, specify the phs numbers in the cell below without the consent group. 
For example, if the study of interest is the AMISH study, list `phs000956` below (*not* `phs000956.c1`).

In [None]:
#to_validate = ['list', 'phs_numbers', 'here']
to_validate = ['phs000285', #CARDIA,
               'phs000703', #CATHGEN
               'phs001194', #PCGC
               'phs000810', #HCHS-SOL
               'phs001252' #ECLIPSE
              ]

to_validate_topmed, to_validate_harmonized = get_topmed_and_harmonized_consents(to_validate)

### Get patient count file from S3 bucket
This notebook uses `Patient_Count_Per_Consents.csv` from the S3 bucket as the reference file. First we need to copy this file over to this directory.

In [None]:
src = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/general/completed/Patient_Count_Per_Consents.csv'
dst = '/home/ec2-user/SageMaker/biodatacatalyst-pic-sure/access-dashboard-metadata/Patient_Count_Per_Consents.csv'
copyfile(src, dst)

In [None]:
# Load S3 file as a dataframe in the Jupyter Notebook
patient_ref_file = pd.read_csv('Patient_Count_Per_Consents.csv', header=None, names=['consent', 'patient_count'])
patient_ref_file

In [None]:
# Extract the consent groups based on the user-identified phs values
full_phs = get_full_consent_vals(to_validate, patient_ref_file)
full_phs

### Get patient count for the specified consent groups
Now the consent groups will be used to find the patient counts currently in the integration environment.

In [None]:
# Start a new query and initialize output dictionary
patient_count_query = int_resource.query()
output = {}

# Get patient counts for each consent group
for consentGroup in full_phs:
    print(consentGroup)
    patient_count_query.filter().delete("\\_consents\\") # Delete all consents
    patient_count_query.filter().add("\\_consents\\", consentGroup) # Add back consent group of interest
    patient_count = patient_count_query.getCount() # Get patient count
    output[consentGroup] = patient_count # Add to output

### Compare the values to the reference file
Finally we will compare the values from the reference file to the counts in the integration environment.

In [None]:
for consent_val in output.keys():
    ref_count = int(patient_ref_file[patient_ref_file['consent'] == consent_val]['patient_count']) # Count from reference file
    integration_count = output[consent_val] # Count from integration environment
    # Display result message
    if ref_count == integration_count:
        print(consent_val, "passes validation")
    else:
        print('***DID NOT PASS VALIDATION:', consent_val)
        print('Expected count from Patient_Count_Per_Consents.csv:\t', ref_count)
        print('Count retrieved from integration environment:\t', integration_count)

## Validation: Data dictionary comparison
The purpose of this section is to compare the data dictionaries of the production and integration environments. These data dictionaries should be identical besides the studies that are being loaded and/or updated.

### Establish connection to production environment

In [None]:
production_PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
prod_resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140"
production_token_file = "prod_token.txt"

In [None]:
with open(production_token_file, "r") as f:
    my_prod_token = f.read()

In [None]:
prod_client = PicSureClient.Client()
prod_connection = prod_client.connect(production_PICSURE_network_URL, my_prod_token, True)
prod_adapter = PicSureBdcAdapter.Adapter(prod_connection)
prod_resource = prod_adapter.useResource(prod_resource_id)

### Load data dictionaries
Next we will load the data dictionaries from the integration and production environments as dataframes. These will be used to compare the environments.

In [None]:
int_dictionary = int_resource.dictionary()
prod_dictionary = prod_resource.dictionary()

In [None]:
integ = int_dictionary.find().DataFrame()

In [None]:
prod = prod_dictionary.find().DataFrame()

### Compare dictionaries
The following comparisons will be made between the dictionaries:
1. Find concept paths that exist in integration, but not production
2. Find concept paths that exist in production, but not integration
3. Identify differences in the dataframe between integration and production

The first comparisons use the concept paths, which we will extract and compare now.

In [None]:
first_comparison = compare_datadict_indices(integ, prod, 1, to_validate)

In [None]:
second_comparison = compare_datadict_indices(prod, integ, 2)

The third comparison compares the data in the data dictionary. The following function iterates through each row of the data dictionary and compares the integration and production results.

In [None]:
harmonized = get_full_consent_vals(to_validate_harmonized, patient_ref_file)
topmed = get_full_consent_vals(to_validate_topmed, patient_ref_file)

In [None]:
to_check = compare_datadicts(integ, prod, to_validate, full_phs, harmonized, topmed, patient_ref_file)

In [None]:
to_check

In [None]:
# For next QA process:
# Run counts for each value under _studies_consents
# Create a table for each environment
# Compare the tables