# BioData Catalyst Powered by PIC-SURE: Validate stigmatizing variables

The purpose of this notebook is to validate stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, this notebook will ensure the stigmatizing variables identified were removed from PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Install packages

In [None]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol
from python_lib.stig_utils import validate_stig_vars

In [None]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import pandas as pd

### Connect to PIC-SURE

In [None]:
# Should integration environment be used?
PICSURE_network_URL = "https://biodatacatalyst.integration.hms.harvard.edu/picsure" 
resource_id = "70c837be-5ffc-11eb-ae93-0242ac130002" # Be sure to use Open Access resource id
token_file = "token.txt"

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

### Get concept paths from PIC-SURE Open Access and list of stigmatizing variables

In [None]:
fullVariableDict = resource.dictionary().find().keys()
fullVariableDict

In [None]:
#stigvar = pd.read_csv('./NAME OF STIGVARS.CSV HERE', header=None).values.tolist()

### Validation testing

In [None]:
test = ['\\Multi-Ethnic Study of Atherosclerosis (MESA) SHARe ( phs000209 )\\MESA Lung Ancillary Study Air New Recruit Dataset: This dataset provides Lung CT scan data for MESA Air New Recruit participants enrolled in the MESA Lung Ancillary Study.\\LEFT LUNG: DEFINITION OF EMPHYSEMA CUTOFF VALUE (HU) VALUES LESS THAN THIS ARE CONSIDERED EMPHYSEMA\\',
 '\\Framingham Cohort ( phs000007 )\\Lab Work\\Blood\\Hematologic\\NEUROLOGICAL FINDINGS: LOCALIZED MUSCLE WEAKNESS\\',
 '\\Cardiovascular Health Study (CHS) Cohort: an NHLBI-funded observational study of risk factors for cardiovascular disease in adults 65 years or older ( phs000287 )\\Data contain extensive medical history information of subjects (all > 65 years of age)\\2 HR INSULIN (IU/ml)\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\X-ray\\ECG: SUPRAVENTRICULAR-TACHYCARDIA\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Cohort Event Eligibility Form\\Hospital discharge dx or procedure codes Q10c\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\Bone Study\\Baseline exam: length of first hand (Left-Right unknown) middle phalanx 5 (pinkie)\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Anthropometry Form, Visit 5\\Q2a. Self report. Self reported weight [Anthropometry Form]\\',
 '\\Framingham Cohort ( phs000007 )\\Tests\\Bone Study\\X-RAY: AFTER, GENERALIZED CARDIAC ENLARGEMENT\\',
 '\\NHLBI Atherosclerosis Risk in Communities (ARIC) Candidate Gene Association Resource (CARe) ( phs000280 )\\Cohort Event Eligibility Form\\Hospital discharge dx or procedure codes Q10h\\']

In [None]:
test1 = ['\\lets see what\\we can find here\\shall we?']

In [None]:
test_results = validate_stig_vars(fullVariableDict, test1)
test_results

In [None]:
fullvar = pd.DataFrame(fullVariableDict, columns=['Key'])
fullvar