# BioData Catalyst Powered by PIC-SURE: Identify stigmatizing variables

The purpose of this notebook is to identify stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, stigmatizing variables will be identified in PIC-SURE Authorized Access and removed for PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Install packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from pprint import pprint
import json

In [None]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol

### Connect to PIC-SURE

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140" # Be sure to use Authorized Access resource ID
token_file = "token.txt" # Be sure to use developer token to get all variables

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

### Save all variables in PIC-SURE Authorized Access to DataFrame

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()
#fullVariableDict
multiindex = get_multiIndex_variablesDict(fullVariableDict)

In [None]:
multiindex # potentially explore categoryValues

### Stigmatizing variables using `simplified_name`

In [None]:
def check_simplified_name(varlist, multiindex_df, exclude_vars=[]):
    stig_var_list = []
    excluded_var_list = []
    for i in range(1, len(multiindex_df["simplified_name"])):
        for var in varlist:
            if re.search(var, multiindex_df['simplified_name'][i], re.IGNORECASE):
                for ex in exclude_vars:
                    if multiindex_df['simplified_name'][i].lower() == ex:
                        if multiindex_df['simplified_name'][i] not in excluded_var_list:
                            excluded_var_list.append(multiindex_df['name'][i])
                if multiindex_df['name'][i] not in excluded_var_list:
                    if multiindex_df['name'][i] not in stig_var_list:
                        stig_var_list.append(multiindex_df['name'][i])
    return stig_var_list, excluded_var_list

In [None]:
def regex_filter_out(stig_vars, terms_to_filter):
    filter_out = []
    for i in stig_vars:
        simple_var = i.strip('\\').split('\\')[-1]
        for term in terms_to_filter:
            if re.search(term, simple_var, re.IGNORECASE):
                filter_out.append(i)
    list_difference = [item for item in stig_vars if item not in filter_out]
    return list_difference

#### Sex history filtering
The following terms are used to filter out sex history variables:
- sex
- sex history
- sexual
- sexually
- intercourse
- coitus
- copulation
- pareunia
- futunio
- venery

In [None]:
sex_history_terms = ['sex', 'sex history', 'sexual', 'sexually', 'intercourse', 
                     'coitus', 'copulation', 'pareunia', 'futunio', 'venery']
sex_remove = ['sex', 'sex of participant']
terms_to_filter = ['race and sex adjusted']

In [None]:
sex_stig_vars, ex_sex_vars = check_simplified_name(sex_history_terms, multiindex, exclude_vars=sex_remove)

In [None]:
final_sex_vars = regex_filter_out(sex_stig_vars, terms_to_filter)

In [None]:
print(len(sex_stig_vars))
print(len(final_sex_vars))

#### Sexually transmitted disease diagnosis/history/treatment filtering
The following terms are used to filter out variables related to sexually transmitted disease:
- chlamydia
- genital
- herpes
- gonorrhea
- HIV
- AIDS
- HPV
- pubic lice
- syphilis
- trichomoniasis
- estrogens
- vagina
- progesterone

In [None]:
sex_disease_terms = ['chlamydia', 'genital', 'herpes', 'gonorrhea', 'hiv', 
                     'aids', 'hpv', 'pubic lice', 'syphilis', 'trichomoniasis', 
                     'estrogens', 'vagina', 'progesterone', "venereal", "penis", 
                     "vagina", "antiviral"]
terms_to_filter = ['hives', 'health aids', 'nsaids', 'herpes zoster', 'chlamydia pneumoniae', 'heart disease']

In [None]:
sex_disease_stig_vars, _ = check_simplified_name(sex_disease_terms, multiindex)

In [None]:
len(sex_disease_stig_vars)

In [None]:
final = regex_filter_out(sex_disease_stig_vars, terms_to_filter)

In [None]:
len(final)

#### Mental health diagnoses/history/treatment filtering
The following terms are used to filter out variables related to mental health:
- depression
- depressive
- anxiety
- panic
- phobia
- schizophrenia
- mental (and NOT supplemental/instrumental/environmental etc)
- mental health
- psycho
- psychological
- emotional health

In [None]:
mental_disease_terms = ['depression', 'depressive', 'anxiety', 'panic', 'phobias', 'schizophrenia',
                       'mental', 'mental health', 'psycho', 'psychological', 'emotional health']
terms_to_filter = ['hispanic', 'electrocardiograph', 'minn code', 'minnesota code', 
                   'ecg', 'environmental', 'instrumental', 'mini-mental state exam']

In [None]:
mental_disease_stig_vars, _ = check_simplified_name(mental_disease_terms, multiindex)

In [None]:
final = regex_filter_out(mental_disease_stig_vars, terms_to_filter)

In [None]:
print(len(mental_disease_stig_vars))
print(len(final))

#### Illicit drug use history filtering
The following terms are used to filter out variables related to illicit drug use:
- illicit
- street drug
- abuse
- illegal
- fentanyl
- cocaine
- ecstasy
- LSD
- methamphetamine
- heroin
- phencyclidine
- angel dust
- rohypnol
- roofies
- ketamine hydrochloride
- psilocybin
- mushroom
- krokodil
- marijuana
- salvia
- bath salts
- flakka
- ayahuasca
- DMT
- central nervous system depressant
- hallucinogen
- inhalant
- khat
- kratom
- mescaline
- loperamide
- dextromethorphan
- opioid
- stimulant
- cannabinoid
- gamma hydroxybutyrate
- alcohol
- steroid
- tobacco
- nicotine

In [None]:
illicit_drug_terms = ['illicit', 'street drug', 'abuse', 'illegal', 'fentanyl', 
                      'cocaine', 'ecstasy', 'lsd', 'methamphetamine', 'heroin', 
                      'phencyclidine', 'angel dust', 'rohypnol', 'roofies', 
                      'ketamine hydrochloride', 'psilocybin', 'mushroom', 'krokodil', 
                      'marijuana', 'salvia','bath salts', 'flakka', 'ayahuasca', 'dmt', 
                      'central nervous system depressant', 'hallucinogen', 'inhalant', 'khat', 
                      'kratom', 'mescaline', 'loperamide', 'dextromethorphan','opioid', 
                      'stimulant', 'cannabinoid', 'gamma hydroxybutyrate', 'depressants']#, 
                      #'alcohol', 'steroid', 'tobacco', 'nicotine']

In [None]:
illicit_drug_stig_vars = check_simplified_name(illicit_drug_terms, multiindex)

#### Intellectual achievement/ability/educational attainment filtering
The following terms are used to filter out variables related to intellectual achievement:
(Note from Rui: no genetics IQ outcomes)
- bachelor
- master
- phd
- quotient
- intellectual
- intelligence
- achievement
- disability
- ability - maybe
- attainment
- education
- genetic iq
- school

In [None]:
intell_ability_terms = ['bachelor', 'master', 'phd', 'quotient', 'intellectual', 'intelligence',
                        'achievement', 'disability', 'ability', 'attainment', 'education', 'genetic iq', 'school']

In [None]:
intell_ability_stig_vars = check_simplified_name(intell_ability_terms, multiindex)

#### Direct or surrogate identifiers of legal status filtering
The following terms are used to filter out variables related to legal status:
- villainage
- villeinage
- citizenship
- marital
- married
- unmarried
- single
- divorced
- widowed
- minority
- nonage
- marriage
- matrimony
- spousal
- civil union
- wedlock
- bachelorhood
- spinsterhood
- widowhood
- ethnicity
- nationality
- race
- death

In [None]:
legal_status_terms = ['villainage', 'villeinage', 'citizenship', 'marital', 
                      'married', 'unmarried', 'single', 'divorced', 'widowed', 
                      'minority', 'nonage', 'marriage', 'matrimony', 'spousal',  
                      'civil union', 'wedlock', 'bachelorhood', 'spinsterhood',
                      'widowhood', 'ethnicity', 'nationality', 'race', 'death', 
                      'identifier', 'identity', ' no.', 'surrogate', 'legal status', 
                      'ethnicity', 'race', 'nationality', 'death']

In [None]:
legal_status_stig_vars = check_simplified_name(legal_status_terms, multiindex)

### Export potentially stigmatizing variables

In [None]:
def final_export(sex, sex_disease, mental_disease, illicit_drug, intell_ability, legal_status):
    labels = {'***SEX STIG VARS***': sex, '***SEX DISEASE STIG VARS***': sex_disease, 
              '***MENTAL STIG VARS***': mental_disease, '***ILLICIT DRUG STIG VARS***': illicit_drug, 
              '***INTELL ABILITY STIG VARS***': illicit_drug, '***LEGAL STATUS STIG VARS***': legal_status}
    final = []
    for i in labels:
        final.append(i)
        for j in labels[i]:
            if j not in final:
                final.append(j)
    pd.DataFrame(final).to_csv("stig_vars.tsv", sep='\t')
    return("Finished.")

In [None]:
final_export(sex_stig_vars, sex_disease_stig_vars, mental_disease_stig_vars, 
             illicit_drug_stig_vars, intell_ability_stig_vars, legal_status_stig_vars)

In [None]:
test[67]

In [None]:
sex_stig_vars[67]