# BioData Catalyst Powered by PIC-SURE: Identify stigmatizing variables

The purpose of this notebook is to identify stigmatizing variables in [BioData Catalyst Powered by PIC-SURE](https://picsure.biodatacatalyst.nhlbi.nih.gov/). Specifically, stigmatizing variables will be identified in PIC-SURE Authorized Access and removed for PIC-SURE Open Access.

For more information about stigmatizing variables, please view the [README.md](https://github.com/hms-dbmi/biodata_catalyst_stigmatizing_variables#biodata_catalyst_stigmatizing_variables).

### Prerequisites
This notebook assumes knowledge of the BioData Catalyst Powered by PIC-SURE platform and API. For more information about the API, please visit the [Access to Data using PIC-SURE GitHub repository](https://github.com/hms-dbmi/Access-to-Data-using-PIC-SURE-API).

Developer login credentials or access to all data in PIC-SURE Authorized Access is also required to ensure all variables are reviewed. 

### Install packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from collections import Counter
from pprint import pprint
import json
from shutil import copyfile

In [None]:
import sys
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-client.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-python-adapter-hpds.git
!{sys.executable} -m pip install --upgrade --force-reinstall git+https://github.com/hms-dbmi/pic-sure-biodatacatalyst-python-adapter-hpds.git

In [None]:
import PicSureClient
import PicSureBdcAdapter
from python_lib.utils import get_multiIndex_variablesDict, joining_variablesDict_onCol
from python_lib.stig_utils import check_simplified_name, regex_filter_out, manual_check

### Connect to PIC-SURE

In [None]:
PICSURE_network_URL = "https://picsure.biodatacatalyst.nhlbi.nih.gov/picsure"
resource_id = "02e23f52-f354-4e8b-992c-d37c8b9ba140" # Be sure to use Authorized Access resource ID
token_file = "token.txt" # Be sure to use developer token to get all variables

In [None]:
with open(token_file, "r") as f:
    my_token = f.read()

In [None]:
client = PicSureClient.Client()
connection = client.connect(PICSURE_network_URL, my_token, True)
adapter = PicSureBdcAdapter.Adapter(connection)
resource = adapter.useResource(resource_id)

### Save all variables in PIC-SURE Authorized Access to DataFrame

In [None]:
fullVariableDict = resource.dictionary().find().DataFrame()
#fullVariableDict
multiindex = get_multiIndex_variablesDict(fullVariableDict)

In [None]:
fullVariableDict.head()

### Identify stigmatizing variables using `simplified_name`

There are two functions to identify stigmatizing variables: `check_simplified_name` and `regex_filter_out`. 

`check_simplified_name` selects all variables from the `multiindex` dataframe where the `simplified_name` contains any of the terms in the given list. It also takes an optional argument `exclude_vars` that removes variable with specified `simplified_name` variables.

For example, 

`check_simplified_name(['bio', 'data', 'catalyst'], multiindex, ['biology variable'])`

would find all variables where the `simplified_name` contains 'bio', 'data', and/or 'catalyst' but excludes `simplified_name`s equal to 'biology variable' (ignoring capitalization).

`regex_filter_out` uses the list of potentially stigmatizing variables and filters out any `simplified_name` variables containing the given list of terms. Unlike `check_simplified_name` where excluded variables must match the `simplified_name` completely, this function excludes the variable if the term is *contained* by the `simplified_name`. 

For example,

`regex_filter_out(['biodata catalyst', 'terra', 'heliobacter pylori'], ['ter'])`

would exclude all variables containing '*ter*'. In this case, '*ter*ra' and 'heliobac*ter* pylori' would be removed.


***Note:*** `regex_filter_out` ***can use regular expressions as input while*** `check_simplified_name` ***input must match exactly.***

| Function | Arguments / Input | Output|
|--------|-------------------|-------|
| `check_simplified_name()` | (1) list of search terms, (2) multiindex dataframe, (3) optional: variables to exclude | (1) list of potentially stigmatizing variables, (2) variables excluded using provided criteria|
| `regex_filter_out()` | (1) list of stigmatizing variables, (2) list of terms or regular expressions to filter | list of stigmatizing variables that do not contain any of terms to filter |

### Load stigmatizing terms, simplified variables to exclude, and terms to filter out

The following files provide information about terms used to select and filter stigmatizing variables. These files are located in the `stigmatizing_terms` directory.

| File | Information |
|--------|-------------------|
| `stigmatizing_keywords.tsv` | List of terms used to filter out potentially stigmatizing variables from PIC-SURE Authorized Access and associated reasons for selection |
| `simplified_vars_excluded.tsv` | List of `simplified_name` variables that will be filtered out of the list of potentially stigmatizing variables and associated reasons for exclusion |
| `terms_excluded.tsv` |  List of terms that will be used to filter out non-stigmatizing variables and the associated reasons for exclusion |

In [None]:
stigmatizing_df = pd.read_csv("stigmatizing_terms/stigmatizing_keywords.tsv", sep="\t")
exclude_vars_df = pd.read_csv("stigmatizing_terms/simplified_vars_excluded.tsv", sep="\t")
terms_excluded_df = pd.read_csv("stigmatizing_terms/terms_excluded.tsv", sep="\t")

In [None]:
stig_terms = list(stigmatizing_df["Search keyword"])
print("Search keywords:\n\n", stig_terms)

In [None]:
exclude_vars = list(exclude_vars_df["Variables to exclude"])
print("Variables to exclude:\n\n", exclude_vars)

In [None]:
terms_excluded = list(terms_excluded_df["Terms to exclude"])
print("Terms to exclude:\n\n", terms_excluded)

### Run functions to find potentially stigmatizing variables

In [None]:
# Takes a while
stig_vars, ex_vars = check_simplified_name(stig_terms, multiindex, exclude_vars)

In [None]:
final_vars = regex_filter_out(stig_vars, terms_excluded)

In [None]:
print("Total number of vars", len(stig_vars))
print("After filtering", len(final_vars))

### Manual review of potentially stigmatizing variables

`manual_check` provides an interactive way to record whether filtered variables are indeed stigmatizing. It uses the list of stigmatizing variables and also takes an optional argument `ex_vars` that provides a manual review of the excluded terms. A dataframe of the stigmatizing variables with recorded responses and (if applicable) a dataframe of excluded variables and recorded responses are returned.

To use this function, simply call it on the list of filtered variables (and excluded variables if needed) and follow the interactive instructions.

Please save results from this function to the `stigmatizing_variable_results` directory.

In [None]:
# Rename output_file to appropriate filename
output_file = "stigmatizing_variable_results/stigmatizing_variable_decisions_8sept2021.txt"
stigmatizing_variables, excluded_stigmatizing_variables = manual_check(final_vars, output_file)

You can review your decisions in the specified `output_file` to double-check the final results.

### Export stigmatizing variables as tab-delimited text file

After ensuring the proper decisions were made and stigmatizing variables were selected, you can run the following code to create a tab-delimited text file of the stigmatizing variables. 

In [None]:
stig_vars_for_output = pd.read_csv(output_file, sep='\t')
stig_mask = stig_vars_for_output["stigmatizing"] == "y"
stig_vars_for_output = stig_vars_for_output[stig_mask]
stig_vars_for_output = stig_vars_for_output["full name"]
stig_vars_for_output.reset_index(drop=True, inplace=True)

In [None]:
final_output = 'stigmatizing_variable_results/stigmatizing_variables.txt'
stig_vars_for_output.to_csv(final_output, sep='\t', header=False, index=False)

In [None]:
dst = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/general/data/conceptsToRemove.txt'
src = '/home/ec2-user/SageMaker/biodata_catalyst_stigmatizing_variables/'+final_output
copyfile(src, dst)