# BioData Catalyst Powered by PIC-SURE: Identify stigmatizing variables

The purpose of this notebook is to identify stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, stigmatizing variables will be identified in PIC-SURE Authorized Access and removed for PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Install packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from pprint import pprint
import json

In [None]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol
from python_lib.stig_utils import check_simplified_name, regex_filter_out, manual_check, go_through_df

### Connect to PIC-SURE

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140" # Be sure to use Authorized Access resource ID
token_file = "token.txt" # Be sure to use developer token to get all variables

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

### Save all variables in PIC-SURE Authorized Access to DataFrame

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()
#fullVariableDict
multiindex = get_multiIndex_variablesDict(fullVariableDict)

In [None]:
multiindex # potentially explore categoryValues

### Identify stigmatizing variables using `simplified_name`

There are two functions to identify stigmatizing variables: `check_simplified_name` and `regex_filter_out`. 

`check_simplified_name` selects all variables from the `multiindex` dataframe where the `simplified_name` contains any of the terms in the given list. It also takes an optional argument `exclude_vars` that removes variable with specified `simplified_name` variables.

For example, 

`check_simplified_name(['bio', 'data', 'catalyst'], multiindex, ['biology variable'])`

would find all variables where the `simplified_name` contains 'bio', 'data', and/or 'catalyst' but excludes `simplified_name`s equal to 'biology variable' (ignoring capitalization).

`regex_filter_out` uses the list of potentially stigmatizing variables and filters out any `simplified_name` variables containing the given list of terms. Unlike `check_simplified_name` where excluded variables must match the `simplified_name` completely, this function excludes the variable if the term is *contained* by the `simplified_name`. 

For example,

`regex_filter_out(['biodata catalyst', 'terra', 'heliobacter pylori'], ['ter'])`

would exclude all variables containing '*ter*'. In this case, '*ter*ra' and 'heliobac*ter* pylori' would be removed.

| Function | Arguments / Input | Output|
|--------|-------------------|-------|
| check_simplified_name() | (1) list of search terms, (2) multiindex dataframe, (3) optional: variables to exclude | (1) list of potentially stigmatizing variables, (2) variables excluded using provided criteria|
| regex_filter_out() | (1) list of stigmatizing variables, (2) list of terms to filter | list of stigmatizing variables that do not contain any of terms to filter |

#### Sex history filtering
The following terms are used to filter out sex history variables:

<table border="0">
    <tr>
        <td>sex</td>
        <td>sex history</td>
    </tr>
    <tr>
        <td>sexual</td>
        <td>sexually</td>
    </tr>
    <tr>
        <td>intercourse</td>
        <td>coitus</td>
    </tr>
    <tr>
        <td>copulation</td>
        <td>pareunia</td>
    </tr>
    <tr>
        <td>futunio</td>
        <td>venery</td>
    </tr>
</table>

The following `simplified_name` variables are excluded:
<table border="0">
    <tr>
        <td>sex</td>
    </tr>
    <tr>
        <td>sex of participant</td>
    </tr>
</table>

`simplified_name` variables containing the following terms are excluded:
<table border="0">
    <tr>
        <td>race and sex adjusted</td>
    </tr>
</table>

In [None]:
sex_history_terms = ['sex', 'sex history', 'sexual', 'sexually', 'intercourse', 
                     'coitus', 'copulation', 'pareunia', 'futunio', 'venery']
sex_remove = ['sex', 'sex of participant']
terms_to_filter = ['race and sex adjusted']

In [None]:
sex_stig_vars, ex_sex_vars = check_simplified_name(sex_history_terms, multiindex, exclude_vars=sex_remove)

In [None]:
final_sex_vars = regex_filter_out(sex_stig_vars, terms_to_filter)

In [None]:
print("Total number of sex vars", len(sex_stig_vars))
print("After filtering", len(final_sex_vars))

#### Sexually transmitted disease diagnosis/history/treatment filtering
The following terms are used to filter out variables related to sexually transmitted disease:

<table border="0">
    <tr>
        <td>chlamydia</td>
        <td>genital</td>
    </tr>
    <tr>
        <td>herpes</td>
        <td>gonorrhea</td>
    </tr>
    <tr>
        <td>HIV</td>
        <td>AIDS</td>
    </tr>
    <tr>
        <td>pubic lice</td>
        <td>syphilis</td>
    </tr>
    <tr>
        <td>trichomoniasis</td>
        <td>vagina</td>
    </tr>
    <tr>
        <td>progesterone</td>
        <td>estrogens</td>
    </tr>
</table>

***Should estrogen and progesterone be on here?***

`simplified_name` variables containing the following terms are excluded:
<table border="0">
    <tr>
        <td>hives</td>
        <td>health aids</td>
    </tr>
    <tr>
        <td>nsaids</td>
        <td>herpes zoster</td>
    </tr>
    <tr>
        <td>chlamydia pneumoniae</td>
        <td>heart disease</td>
    </tr>
    <tr>
        <td>walking aid</td>
        <td>archive</td>
    </tr>
    <tr>
        <td>shiver</td>
    </tr>
</table>

In [None]:
sex_disease_terms = ['chlamydia', 'genital', 'herpes', 'gonorrhea', 'hiv', 
                     'aids', 'hpv', 'pubic lice', 'syphilis', 'trichomoniasis', 
                     'estrogens', 'vagina', 'progesterone', "venereal", "penis", 
                     "vagina", "antiviral"]
terms_to_filter = ['hives', 'health aids', 'nsaids', 'herpes zoster', 'chlamydia pneumoniae', 
                   'heart disease', 'walking aid', 'archive', 'shiver']

In [None]:
sex_disease_stig_vars, ex_sex_disease_vars = check_simplified_name(sex_disease_terms, multiindex)

In [None]:
final_sex_disease_vars = regex_filter_out(sex_disease_stig_vars, terms_to_filter)

In [None]:
print("Total number of sex disease vars", len(sex_disease_stig_vars))
print("After filtering", len(final_sex_disease_vars))

#### Mental health diagnoses/history/treatment filtering
The following terms are used to filter out variables related to mental health:

<table border="0">
    <tr>
        <td>depression</td>
        <td>depressive</td>
    </tr>
    <tr>
        <td>anxiety</td>
        <td>panic</td>
    </tr>
    <tr>
        <td>phobia</td>
        <td>schizophrenia</td>
    </tr>
    <tr>
        <td>mental</td>
        <td>mental health</td>
    </tr>
    <tr>
        <td>psycho</td>
        <td>psychological</td>
    </tr>
    <tr>
        <td>emotional health</td>
        <td></td>
    </tr>
</table>

`simplified_name` variables containing the following terms are excluded:
<table border="0">
    <tr>
        <td>hispanic</td>
        <td>electrocardiograph</td>
    </tr>
    <tr>
        <td>minn code</td>
        <td>minnesota code</td>
    </tr>
    <tr>
        <td>ecg</td>
        <td>environmental</td>
    </tr>
    <tr>
        <td>instrumental</td>
        <td>mini-mental state exam</td>
    </tr>
</table>

In [None]:
mental_health_terms = ['depression', 'depressive', 'anxiety', 'panic', 'phobias', 'schizophrenia',
                       'mental', 'mental health', 'psycho', 'psychological', 'emotional health']
terms_to_filter = ['hispanic', 'electrocardiograph', 'minn code', 'minnesota code', 
                   'ecg', 'environmental', 'instrumental', 'mini-mental state exam']

In [None]:
mental_health_stig_vars, ex_mental_health_vars = check_simplified_name(mental_health_terms, multiindex)

In [None]:
final_mental_health_vars = regex_filter_out(mental_health_stig_vars, terms_to_filter)

In [None]:
print("Total number of mental health vars", len(mental_health_stig_vars))
print("After filtering", len(final_mental_health_vars))

#### Illicit drug use history filtering
The following terms are used to filter out variables related to illicit drug use:

<table border="0">
    <tr>
        <td>illicit</td>
        <td>street drug</td>
        <td>rohypnol</td>
    </tr>
    <tr>
        <td>abuse</td>
        <td>illegal</td>
        <td>roofies</td>
    </tr>
    <tr>
        <td>fentanyl</td>
        <td>cocaine</td>
        <td>ketamine hydrochloride</td>
    </tr>
    <tr>
        <td>ecstasy</td>
        <td>LSD</td>
        <td>psilocybin</td>
    </tr>
    <tr>
        <td>methamphetamine</td>
        <td>heroin</td>
        <td>mushroom</td>
    </tr>
    <tr>
        <td>phencyclidine</td>
        <td>angel dust</td>
        <td>krokodil</td>
    </tr>
    <tr>
        <td>mushroom</td>
        <td>salvia</td>
        <td>bath salts</td>
    </tr>
    <tr>
        <td>flakka</td>
        <td>ayahuasca</td>
        <td>DMT</td>
    </tr>
    <tr>
        <td>central nervous system depressant</td>
        <td>hallucinogen</td>
        <td>inhalant</td>
    </tr>
    <tr>
        <td>khat</td>
        <td>kratom</td>
        <td>mescaline</td>
    </tr>
    <tr>
        <td>loperamide</td>
        <td>dextromethorphan</td>
        <td>opioid</td>
    </tr>
    <tr>
        <td>stimulant</td>
        <td>cannabinoid</td>
        <td>gamma hydroxybutyrate</td>
    </tr>
</table>

`simplified_name` variables containing the following terms are excluded:
<table border="0">
    <tr>
        <td>coffee or tea</td>
    </tr>
</table>

In [None]:
illicit_drug_terms = ['illicit', 'street drug', 'abuse', 'illegal', 'fentanyl', 
                      'cocaine', 'ecstasy', 'lsd', 'methamphetamine', 'heroin', 
                      'phencyclidine', 'angel dust', 'rohypnol', 'roofies', 
                      'ketamine hydrochloride', 'psilocybin', 'mushroom', 'krokodil', 
                      'marijuana', 'salvia','bath salts', 'flakka', 'ayahuasca', 'dmt', 
                      'central nervous system depressant', 'hallucinogen', 'inhalant', 'khat', 
                      'kratom', 'mescaline', 'loperamide', 'dextromethorphan','opioid', 
                      'stimulant', 'cannabinoid', 'gamma hydroxybutyrate', 'depressants']
terms_to_filter = ['coffee or tea']

In [None]:
illicit_drug_stig_vars, ex_illicit_drug_vars = check_simplified_name(illicit_drug_terms, multiindex)

In [None]:
final_illicit_drug_vars = regex_filter_out(illicit_drug_stig_vars, terms_to_filter)

In [None]:
print("Total number of illicit drug vars", len(illicit_drug_stig_vars))
print("After filtering", len(final_illicit_drug_vars))

#### Intellectual achievement/ability/educational attainment filtering
The following terms are used to filter out variables related to intellectual achievement:
(Note from Rui: no genetics IQ outcomes)

<table border="0">
    <tr>
        <td>bachelor</td>
        <td>master</td>
    </tr>
    <tr>
        <td>phd</td>
        <td>quotient</td>
    </tr>
    <tr>
        <td>intellectual</td>
        <td>intelligence</td>
    </tr>
    <tr>
        <td>acheivement</td>
        <td>disability</td>
    </tr>
    <tr>
        <td>ability</td>
        <td>attainment</td>
    </tr>
    <tr>
        <td>education</td>
        <td>genetic iq</td>
    </tr>
    <tr>
        <td>school</td>
        <td></td>
    </tr>
</table>

`simplified_name` variables containing the following terms are excluded:
<table border="0">
    <tr>
        <td>change in ability to</td>
        <td>how ability to</td>
    </tr>
    <tr>
        <td>ability to</td>
        <td>variability</td>
    </tr>
    <tr>
        <td>gradability</td>
        <td>reliability</td>
    </tr>
    <tr>
        <td>acceptability</td>
        <td>irritability</td>
    </tr>
    <tr>
        <td>leg ability</td>
        <td>physical ability</td>
    </tr>
</table>

In [None]:
intell_ability_terms = ['bachelor', 'master', 'phd', 'quotient', 'intellectual', 'intelligence',
                        'achievement', 'disability', 'ability', 'attainment', 'education', 'genetic iq', 'school']
terms_to_filter = ['change in ability to', 'how ability to', 'ability to', 
                   'variability', 'gradability', 'reliability', 'acceptability', 
                   'irritability', 'leg ability', 'physical ability']

In [None]:
intell_ability_stig_vars, ex_intell_ability_vars = check_simplified_name(intell_ability_terms, multiindex)

In [None]:
final_intell_ability_vars = regex_filter_out(intell_ability_stig_vars, terms_to_filter)

In [None]:
print("Total number of intellectual vars", len(intell_ability_stig_vars))
print("After filtering", len(final_intell_ability_vars))

In [None]:
final_intell_ability_vars

#### Direct or surrogate identifiers of legal status filtering
The following terms are used to filter out variables related to legal status:

<table border="0">
    <tr>
        <td>villainage</td>
        <td>villeinage</td>
    </tr>
    <tr>
        <td>citizenship</td>
        <td>marital</td>
    </tr>
    <tr>
        <td>married</td>
        <td>unmarried</td>
    </tr>
    <tr>
        <td>single</td>
        <td>divorces</td>
    </tr>
    <tr>
        <td>widowed</td>
        <td>minority</td>
    </tr>
    <tr>
        <td>nonage</td>
        <td>marriage</td>
    </tr>
    <tr>
        <td>matrimony</td>
        <td>spousal</td>
    </tr>
    <tr>
        <td>civil union</td>
        <td>wedlock</td>
    </tr>
    <tr>
        <td>bachelorhood</td>
        <td>spinsterhood</td>
    </tr>
    <tr>
        <td>widowhood</td>
        <td>ethnicity</td>
    </tr>
    <tr>
        <td>nationality</td>
        <td>race</td>
    </tr>
    <tr>
        <td>death</td>
        <td>identifier</td>
    </tr>
    <tr>
        <td>identity</td>
        <td>surrogate</td>
    </tr>
    <tr>
        <td>legal status</td>
        <td></td>
    </tr>
</table>

`simplified_name` variables containing the following terms are excluded:
<table border="0">
    <tr>
        <td>single tennis</td>
        <td>single ventricular</td>
    </tr>
    <tr>
        <td>single nodule</td>
        <td>urinalysis: albumin</td>
    </tr>
    <tr>
        <td>brace</td>
        <td>contraceptive</td>
    </tr>
    <tr>
        <td>race and sex adjusted</td>
        <td>single sup</td>
    </tr>
</table>

In [None]:
legal_status_terms = ['villainage', 'villeinage', 'citizenship', 'marital', 
                      'married', 'unmarried', 'single', 'divorced', 'widowed', 
                      'minority', 'nonage', 'marriage', 'matrimony', 'spousal',  
                      'civil union', 'wedlock', 'bachelorhood', 'spinsterhood',
                      'widowhood', 'ethnicity', 'nationality', 'race', 'death', 
                      'identifier', 'identity', 'surrogate', 'legal status']
#legal_status_remove = ['subject identifier']
terms_to_filter = ['single tennis', 'single ventricular', 'single nodule', 
                   'urinalysis: albumin', 'brace', 'contraceptive', 
                   'race and sex adjusted', 'single sup']

In [None]:
legal_status_stig_vars, ex_legal_status_vars = check_simplified_name(legal_status_terms, multiindex)

In [None]:
final_legal_status_vars = regex_filter_out(legal_status_stig_vars, terms_to_filter)

In [None]:
print("Total number of legal status vars", len(legal_status_stig_vars))
print("After filtering", len(final_legal_status_vars))

### Manual review of potentially stigmatizing variables

`manual_check` provides an interactive way to record whether filtered variables are indeed stigmatizing. It uses the list of stigmatizing variables and also takes an optional argument `ex_vars` that provides a manual review of the excluded terms. A dataframe of the stigmatizing variables with recorded responses and (if applicable) a dataframe of excluded variables and recorded responses are returned.

To use this function, simply call it on the list of filtered variables (and excluded variables if needed) and follow the interactive instructions.

In [None]:
sex_stigs, ex_sex_stigs = manual_check(final_sex_vars)

In [None]:
sex_disease_stigs, ex_sex_disease_stigs = manual_check(final_sex_disease_vars)

In [None]:
mental_health_stigs, ex_mental_health_stigs = manual_check(final_mental_health_vars)

### Export potentially stigmatizing variables

In [None]:
sex_stigs.to_csv("sexual_health_stig_vars.tsv", sep='\t')

In [None]:
sex_disease_stigs.to_csv("sexual_disease_stig_vars.tsv", sep='\t')