# BioData Catalyst Powered by PIC-SURE: Identify stigmatizing variables

The purpose of this notebook is to identify stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, stigmatizing variables will be identified in PIC-SURE Authorized Access and removed for PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Install packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from pprint import pprint
import json

In [None]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol
from python_lib.stig_utils import *

### Connect to PIC-SURE

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140" # Be sure to use Authorized Access resource ID
token_file = "token.txt" # Be sure to use developer token to get all variables

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

### Save all variables in PIC-SURE Authorized Access to DataFrame

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()
#fullVariableDict
multiindex = get_multiIndex_variablesDict(fullVariableDict)

In [None]:
multiindex # potentially explore categoryValues

### Stigmatizing variables using `simplified_name`

#### Sex history filtering
The following terms are used to filter out sex history variables:
- sex
- sex history
- sexual
- sexually
- intercourse
- coitus
- copulation
- pareunia
- futunio
- venery

In [None]:
sex_history_terms = ['sex', 'sex history', 'sexual', 'sexually', 'intercourse', 
                     'coitus', 'copulation', 'pareunia', 'futunio', 'venery']
sex_remove = ['sex', 'sex of participant']
terms_to_filter = ['race and sex adjusted']

In [None]:
sex_stig_vars, ex_sex_vars = check_simplified_name(sex_history_terms, multiindex, exclude_vars=sex_remove)

In [None]:
final_sex_vars = regex_filter_out(sex_stig_vars, terms_to_filter)

In [None]:
print(len(sex_stig_vars))
print(len(final_sex_vars))

#### Sexually transmitted disease diagnosis/history/treatment filtering
The following terms are used to filter out variables related to sexually transmitted disease:
- chlamydia
- genital
- herpes
- gonorrhea
- HIV
- AIDS
- HPV
- pubic lice
- syphilis
- trichomoniasis
- estrogens
- vagina
- progesterone

Should estrogen and progesterone be on here?

In [None]:
sex_disease_terms = ['chlamydia', 'genital', 'herpes', 'gonorrhea', 'hiv', 
                     'aids', 'hpv', 'pubic lice', 'syphilis', 'trichomoniasis', 
                     'estrogens', 'vagina', 'progesterone', "venereal", "penis", 
                     "vagina", "antiviral"]
terms_to_filter = ['hives', 'health aids', 'nsaids', 'herpes zoster', 'chlamydia pneumoniae', 'heart disease']

In [None]:
sex_disease_stig_vars, ex_sex_disease_vars = check_simplified_name(sex_disease_terms, multiindex)

In [None]:
final_sex_disease_vars = regex_filter_out(sex_disease_stig_vars, terms_to_filter)

In [None]:
print(len(sex_disease_stig_vars))
print(len(final_sex_disease_vars))

#### Mental health diagnoses/history/treatment filtering
The following terms are used to filter out variables related to mental health:
- depression
- depressive
- anxiety
- panic
- phobia
- schizophrenia
- mental (and NOT supplemental/instrumental/environmental etc)
- mental health
- psycho
- psychological
- emotional health

In [None]:
mental_health_terms = ['depression', 'depressive', 'anxiety', 'panic', 'phobias', 'schizophrenia',
                       'mental', 'mental health', 'psycho', 'psychological', 'emotional health']
terms_to_filter = ['hispanic', 'electrocardiograph', 'minn code', 'minnesota code', 
                   'ecg', 'environmental', 'instrumental', 'mini-mental state exam']

In [None]:
mental_health_stig_vars, ex_mental_health_vars = check_simplified_name(mental_health_terms, multiindex)

In [None]:
final_mental_health_vars = regex_filter_out(mental_health_stig_vars, terms_to_filter)

In [None]:
print(len(mental_health_stig_vars))
print(len(final_mental_health_vars))

#### Illicit drug use history filtering
The following terms are used to filter out variables related to illicit drug use:
- illicit
- street drug
- abuse
- illegal
- fentanyl
- cocaine
- ecstasy
- LSD
- methamphetamine
- heroin
- phencyclidine
- angel dust
- rohypnol
- roofies
- ketamine hydrochloride
- psilocybin
- mushroom
- krokodil
- marijuana
- salvia
- bath salts
- flakka
- ayahuasca
- DMT
- central nervous system depressant
- hallucinogen
- inhalant
- khat
- kratom
- mescaline
- loperamide
- dextromethorphan
- opioid
- stimulant
- cannabinoid
- gamma hydroxybutyrate
- alcohol
- steroid
- tobacco
- nicotine

In [None]:
illicit_drug_terms = ['illicit', 'street drug', 'abuse', 'illegal', 'fentanyl', 
                      'cocaine', 'ecstasy', 'lsd', 'methamphetamine', 'heroin', 
                      'phencyclidine', 'angel dust', 'rohypnol', 'roofies', 
                      'ketamine hydrochloride', 'psilocybin', 'mushroom', 'krokodil', 
                      'marijuana', 'salvia','bath salts', 'flakka', 'ayahuasca', 'dmt', 
                      'central nervous system depressant', 'hallucinogen', 'inhalant', 'khat', 
                      'kratom', 'mescaline', 'loperamide', 'dextromethorphan','opioid', 
                      'stimulant', 'cannabinoid', 'gamma hydroxybutyrate', 'depressants']#, 
                      #'alcohol', 'steroid', 'tobacco', 'nicotine']
terms_to_filter = ['coffee or tea']

In [None]:
illicit_drug_stig_vars, ex_illicit_drug_vars = check_simplified_name(illicit_drug_terms, multiindex)

In [None]:
final_illicit_drug_vars = regex_filter_out(illicit_drug_stig_vars, terms_to_filter)

In [None]:
print(len(illicit_drug_stig_vars))
print(len(final_illicit_drug_vars))

#### Intellectual achievement/ability/educational attainment filtering
The following terms are used to filter out variables related to intellectual achievement:
(Note from Rui: no genetics IQ outcomes)
- bachelor
- master
- phd
- quotient
- intellectual
- intelligence
- achievement
- disability
- ability - maybe
- attainment
- education
- genetic iq
- school

In [None]:
intell_ability_terms = ['bachelor', 'master', 'phd', 'quotient', 'intellectual', 'intelligence',
                        'achievement', 'disability', 'ability', 'attainment', 'education', 'genetic iq', 'school']
terms_to_filter = ['change in ability to', 'how ability to', 'ability to', 
                   'variability', 'gradability', 'reliability', 'acceptability', 
                   'irritability', 'leg ability', 'physical ability']

In [None]:
intell_ability_stig_vars, ex_intell_ability_vars = check_simplified_name(intell_ability_terms, multiindex)

In [None]:
final_intell_ability_vars = regex_filter_out(intell_ability_stig_vars, terms_to_filter)

In [None]:
print(len(intell_ability_stig_vars))
print(len(final_intell_ability_vars))

In [None]:
final_intell_ability_vars

#### Direct or surrogate identifiers of legal status filtering
The following terms are used to filter out variables related to legal status:
- villainage
- villeinage
- citizenship
- marital
- married
- unmarried
- single
- divorced
- widowed
- minority
- nonage
- marriage
- matrimony
- spousal
- civil union
- wedlock
- bachelorhood
- spinsterhood
- widowhood
- ethnicity
- nationality
- race
- death

In [None]:
legal_status_terms = ['villainage', 'villeinage', 'citizenship', 'marital', 
                      'married', 'unmarried', 'single', 'divorced', 'widowed', 
                      'minority', 'nonage', 'marriage', 'matrimony', 'spousal',  
                      'civil union', 'wedlock', 'bachelorhood', 'spinsterhood',
                      'widowhood', 'ethnicity', 'nationality', 'race', 'death', 
                      'identifier', 'identity', 'surrogate', 'legal status']#, 
                      #'ethnicity', 'race', 'nationality', 'death']
#legal_status_remove = ['subject identifier']
terms_to_filter = ['single tennis', 'single ventricular', 'single nodule', 
                   'urinalysis: albumin', 'brace', 'contraceptive', 
                   'race and sex adjusted', 'single sup']

In [None]:
legal_status_stig_vars, ex_legal_status_vars = check_simplified_name(legal_status_terms, multiindex)

In [None]:
final_legal_status_vars = regex_filter_out(legal_status_stig_vars, terms_to_filter)

In [None]:
print(len(legal_status_stig_vars))
print(len(final_legal_status_vars))

### Manual review of potentially stigmatizing variables

In [None]:
test_list = ["\\Women's Health Initiative Clinical Trial and Observational Study ( phs000200 )\\The subject sample mapping data table includes a mapping of subject IDs to sample IDs. Included are samples from WHI SHARe, GO-ESP, GARNET, CARe, PAGE, Imputation, WHIMS+, and BA23 CHD. Samples are the final preps submitted for genotyping, sequencing, or expression data. For example, if one patient (subject ID) gave one sample, and that sample was processed differently to generate 2 sequencing runs, there would be two rows, both using the same subject ID, but having 2 unique sample IDs. The data table also includes a mapping of sample IDs to other sample ID aliases, the substudy (phs accession) that the sample belongs to, and sample use.\\Sample use. Array_DNA_Methylation: Genome-wide DNA methylation profiling using methylation arrays, quantitative methylation measurements at the single-CpG-site level; Array_SNP: SNP genotypes obtained using standard or custom microarrays; Array_miRNA_Expression: Expression data for microRNA samples (array data); Imputation_SNP: Imputed SNP genotypes; PCR_DNA_SNP: SNP genotypes obtained using PCR amplified DNA; Seq_DNA_SNP: SNP genotypes derived from sequence data; Seq_DNA_WholeExome: Whole exome sequencing; Seq_DNA_WholeGenome: Whole genome sequencing\\",
 "\\Women's Health Initiative Clinical Trial and Observational Study ( phs000200 )\\UNC Heart Failure Details (Main, Ext1, Ext2)\\F136 Thoracentesis\\"]
test_ex = ['\\NHLBI Cleveland Family Study (CFS) Candidate Gene Association Resource (CARe) ( phs000284 )\\CARe_CFS (Cleveland Family Study) - Sleep and Health Phenotype (Adults/Children)\\Cause of death 1\\',
 '\\NHLBI Cleveland Family Study (CFS) Candidate Gene Association Resource (CARe) ( phs000284 )\\CARe_CFS (Cleveland Family Study) - Sleep and Health Phenotype (Adults/Children)\\Cause of death 2\\']

In [None]:
stigs, exs = manual_check(test_list)#, test_ex)

In [None]:
stigs

### Export potentially stigmatizing variables

In [None]:
def final_export(sex, sex_disease, mental_disease, illicit_drug, intell_ability, legal_status):
    labels = {'***SEX STIG VARS***': sex, '***SEX DISEASE STIG VARS***': sex_disease, 
              '***MENTAL STIG VARS***': mental_disease, '***ILLICIT DRUG STIG VARS***': illicit_drug, 
              '***INTELL ABILITY STIG VARS***': illicit_drug, '***LEGAL STATUS STIG VARS***': legal_status}
    final = []
    for i in labels:
        final.append(i)
        for j in labels[i]:
            if j not in final:
                final.append(j)
    pd.DataFrame(final).to_csv("stig_vars.tsv", sep='\t')
    return("Finished.")

In [None]:
final_export(sex_stig_vars, sex_disease_stig_vars, mental_disease_stig_vars, 
             illicit_drug_stig_vars, intell_ability_stig_vars, legal_status_stig_vars)

In [None]:
test[67]

In [None]:
sex_stig_vars[67]