# BioData Catalyst Powered by PIC-SURE: Validate stigmatizing variables

The purpose of this notebook is to validate stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, this notebook will ensure the stigmatizing variables identified were removed from PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Install packages

In [1]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol
from python_lib.stig_utils import validate_stig_vars

In [2]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

Collecting git+https://github.com/hms-dbmi/pic-sure-python-client.git
  Cloning https://github.com/hms-dbmi/pic-sure-python-client.git to /tmp/pip-req-build-8j2fb1e5
  Running command git clone -q https://github.com/hms-dbmi/pic-sure-python-client.git /tmp/pip-req-build-8j2fb1e5
Building wheels for collected packages: PicSureClient
  Building wheel for PicSureClient (setup.py) ... [?25ldone
[?25h  Created wheel for PicSureClient: filename=PicSureClient-0.1.0-py2.py3-none-any.whl size=10300 sha256=214f698d32a8b122cf80f45d35fb575484b57352160df3084a9e791f94378b5c
  Stored in directory: /tmp/pip-ephem-wheel-cache-vuadbfx3/wheels/31/ef/21/e362bba8de04e0072fafec9f77bd1abdf7e166213d27e98729
Successfully built PicSureClient
Installing collected packages: PicSureClient
  Attempting uninstall: PicSureClient
    Found existing installation: PicSureClient 0.1.0
    Uninstalling PicSureClient-0.1.0:
      Successfully uninstalled PicSureClient-0.1.0
Successfully installed PicSureClient-0.1.0
Coll

In [3]:
import pandas as pd

### Connect to PIC-SURE

In [4]:
# Should integration environment be used?
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure" 
resource_id = "70c837be-5ffc-11eb-ae93-0242ac130002" # Be sure to use Open Access resource id
token_file = "token.txt"

In [5]:
with open(token_file, "r") as f:
    my_token = f.read()

In [6]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

[38;5;91;40m

|        certificates to be acceptable for connections.  This may be useful for           |
|        working in a development environment or on systems that host public              |
|        data.  BEST SECURITY PRACTICES ARE THAT IF YOU ARE WORKING WITH SENSITIVE        |
|        DATA THEN ALL SSL CERTS BY THOSE EVIRONMENTS SHOULD NOT BE SELF-SIGNED.          |
[39;49m
+--------------------------------------+------------------------------------------------------+
|  Resource UUID                       |  Resource Name                                       |
+--------------------------------------+------------------------------------------------------+
| 02e23f52-f354-4e8b-992c-d37c8b9ba140 |                                                      |
| 70c837be-5ffc-11eb-ae93-0242ac130002 |                                                      |
+--------------------------------------+------------------------------------------------------+


### Get concept paths from PIC-SURE Open Access

In [7]:
fullVariableDict = resource.dictionary().find().keys()
fullVariableDict

['\\Multi-Ethnic Study of Atherosclerosis (MESA) SHARe ( phs000209 )\\MESA Lung Ancillary Study Exam 3 Dataset: This dataset provides Lung CT scan data for MESA Classic participants enrolled in the MESA Lung Ancillary Study.\\RIGHT LUNG, LOWER: THE INTERCEPT OF THE LINE AT THE ANKLE\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\ECG\\TREATMENT FOR VARICOSE VEINS (LEFT)\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Hematologic\\SYSTOLIC MURMUR: BASE GRADE\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Hematologic\\IF YES TO G3A143 OR G3A144: HOW MANY YEARS HAVE YOU BROUGHT PHLEGM UP FROM YOUR CHEST ON MOST DAYS?\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Echocardiogram, (Jackson Only), Exam 3\\Pulmonary regurgitation 20 [Cohort (Jackson only), Exam 3]\\',
 "\\Framingham Cohort ( phs000007 )\\Clinic Questionnaire (Interview and Physical Exam)\\Clinic Exam Questionnaire\\MD Interview, Physical 

In [8]:
test = ['\\Multi-Ethnic Study of Atherosclerosis (MESA) SHARe ( phs000209 )\\MESA Lung Ancillary Study Air New Recruit Dataset: This dataset provides Lung CT scan data for MESA Air New Recruit participants enrolled in the MESA Lung Ancillary Study.\\LEFT LUNG: DEFINITION OF EMPHYSEMA CUTOFF VALUE (HU) VALUES LESS THAN THIS ARE CONSIDERED EMPHYSEMA\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Hematologic\\NEUROLOGICAL FINDINGS: LOCALIZED MUSCLE WEAKNESS\\',
 '\\Cardiovascular Health Study (CHS) Cohort: an NHLBI-funded observational study of risk factors for cardiovascular disease in adults 65 years or older ( phs000287 )\\Data contain extensive medical history information of subjects (all > 65 years of age)\\2 HR INSULIN (IU/ml)\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\X-ray\\ECG: SUPRAVENTRICULAR-TACHYCARDIA\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Cohort Event Eligibility Form\\Hospital discharge dx or procedure codes Q10c\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\Bone Study\\Baseline exam: length of first hand (Left-Right unknown) middle phalanx 5 (pinkie)\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Anthropometry Form, Visit 5\\Q2a. Self report. Self reported weight [Anthropometry Form]\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\Bone Study\\X-RAY: AFTER, GENERALIZED CARDIAC ENLARGEMENT\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Cohort Event Eligibility Form\\Hospital discharge dx or procedure codes Q10h\\']

In [None]:
test_results = validate_stig_vars(fullVariableDict, )

In [None]:
fullvar = pd.DataFrame(fullVariableDict, columns=['Key'])
fullvar