# BioData Catalyst Powered by PIC-SURE: Validate stigmatizing variables

The purpose of this notebook is to validate stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, this notebook will ensure the stigmatizing variables identified were removed from PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Install packages

In [5]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-mny3i762
  Running command git clone --filter=blob:none -q https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-mny3i762
  Resolved https://github.com/hms-dbmi/pic-sure-python-client.git to commit aabcc6574eede2dc3de410c6c75f7f77ea18d23c
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: PicSureClient
  Building wheel for PicSureClient (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureClient: filename=PicSureClient-0.1.0-py2.py3-none-any.whl size=10300 sha256=e0b8ade7ac6faeb4d0c18bd82bb1225df7974ffdcc44e65a534ba4580c4ebaf4
  Stored in directory: /tmp/pip-ephem-wheel-cache-v7n9z8eb/wheels/31/ef/21/e362bba8de04e0072fafec9f77bd1abdf7e166213d27e98729
Successfully built PicSureClient
Installing collected packages: PicSureClient
  Attempting uninstall: PicSureClient


In [6]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol
from python_lib.stig_utils import validate_stig_vars
import pandas as pd

### Connect to PIC-SURE

In [101]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure" 
resource_id = "70c837be-5ffc-11eb-ae93-0242ac130002" # Be sure to use Open Access resource id
token_file = "token.txt"

In [102]:
with open(token_file, "r") as f:
    my_token = f.read()

In [103]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 70c837be-5ffc-11eb-ae93-0242ac130002 | open-hpds                                            |
| ca0ad4a9-130a-3a8a-ae00-e35b07f1108b | visualization                                        |
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 | auth-hpds                                            |
| 36363664-6231-6134-2d38-653865

### Get concept paths from PIC-SURE Open Access

To ensure that all stigmatizing variables were removed from PIC-SURE Open Access, we will compare the previously identified stigmatizing variables to a list of all variables in Open Access. 

In [104]:
bdc = PicSureBdcAdapter.Adapter(PICSURE_network_URL, my_token)
dictionary = bdc.useDictionary().dictionary()

+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 70c837be-5ffc-11eb-ae93-0242ac130002 | open-hpds                                            |
| ca0ad4a9-130a-3a8a-ae00-e35b07f1108b | visualization                                        |
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 | auth-hpds                                            |
| 36363664-6231-6134-2d38-6538652d3131 | dictionary                                           |
+--------------------------------------+------------------------------------------------------+


In [112]:
vars = dictionary.find("phs002752") # Replace with study or list of studies that is being checked
vars = vars.dataframe()

#vars

### Validation testing

`validate_stig_vars` is a function that compares the list of previously identified stigmatizing variables to the variables in PIC-SURE Open Access. If stigmatizing variables are found in Open Access, it will save the variables to a specified output file. 

| Function | Arguments / Input | Output|
|--------|-------------------|-------|
| `validate_stig_vars()` | (1) fullVariableDict of Open Access variables, (2) tab-delimited list of stigmatizing variables - output from identify_stigmatizing_variables.ipynb, (3) output file name | list of stigmatizing variables found in Open Access, if any |

In [113]:
input_file = 'stigmatizing_variable_results/REVAMP_stigmatizing_variables.txt'
output_file = 'stigmatizing_variable_results/validation1.txt'

In [114]:
results = validate_stig_vars(vars, input_file, output_file)

No stigmatizing variables found in Open Access. Passed validation test.
