

# 📖 KeyClass Tutorial: Text Classification with Label-Descriptions Only


<hr>


***Author(s):*** Arnab Dey, Chufan Gao, Mononito Goswami, correspondence to &lt;mgoswami@andrew.cmu.edu&gt;

<img align="right" src="../assets/autonlab_logo.png" width="20%"/>

## Contents


### 1. [Problem Background & Motivation](#introduction) 

####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  1.1 [Electronic Health Records (EHR)](#ehr)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  1.2 [Generalizable Insights in Healthcare Contexts](#insights)


### 2. [Methodology](#methodology) 

####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.1 [Prior Work](#prior)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.2 [KeyClass](#keyclass)
<!-- ####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.3 [Problem Formulation](#math) -->
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.3 [Find Class Descriptions](#classdesc)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.4 [Find Relevant Keywords](#keywords)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  2.5 [Probabilistically Labeling the Data](#label)


### 3. [Experimentation: Training](#exp_training) 

####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  3.1 [Training the Downstream Model](#downstream)
####    &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;  3.2 [Self-Training the Model](#self)

### 4. [Experimentation: Testing](#exp_testing) 

### 5. [References](#references) 

<hr>


<a id='introduction'></a>
## 1. Problem Background & Motivation 

<a id='ehr'></a>
### 1.1 Electronic Health Records (EHR) 

The Electronic Health Record __(EHR)__ system is a digital version of a patient’s paper chart. __EHRs__ are almost-real-time, patient-centered records that contain `patient history`, `diagnoses`, `procedures`, `medications`, and more in an easily accessible format. Since the _Health Information Technology for Economic and Clinical Health_ <b>(HITECH)</b> Act was signed into law in 2009, adoption rates of these systems have steadily increased<sup><a href="#references"><b>1</b></a></sup>. Adler-Milstein et al.<sup><a href="#references"><b>2</b></a></sup>, who analyzed survey data collected by American Hospital Association found that EHR adoption rates were at <code><b>80%</b></code> in __2017__, twice the rate in __2008__. With higher adoption rates comes the rising challenge of _data processing and analysis of unstructured clinical text._ Due to the unstructured nature of clinical notes, providers often employ trained staff and/or third-party vendors to help assign diagnostic codes using coding systems such as the International Classification of Diseases __(ICD)__<sup><a href="#references"><b>3</b></a></sup>. 

__However, manual assignment of ICD codes is problematic:__
1. It is both time consuming and error-prone, with only <code><b>60-80%</b></code> of the assigned codes reflecting actual patient diagnoses<sup><a href="#references"><b>4</b></a></sup>
1. A significant portion of code assignment results in misjudged severity of conditions and code omissions<sup><a href="#references"><b>5</b></a></sup>
1. For healthcare providers, billing and coding errors may not only lead to loss of revenue and claim denials, but also federal penalties for erroneous Medicare and Medicaid claims

Thus, there is a clear need for reliable automated classification of unstructured clinical notes.

<a id='insights'></a>
### 1.2 Generalizable Insights in Healthcare Contexts 

Managing costs and quality of healthcare is a persistent societal challenge of enormous magnitude and impact on daily lives of all people. Our approach proposes a low-cost solution that has the potential to address some of the identified pressing issues with accessibility to affordable yet accurate automated disease coding tools. Our contributions lie in using a novel strategy
to efficiently acquire interpretable weak supervision sources from readily available text to learn effective text classifiers without the need for human-labeled data.

__Our work demonstrates:__
1. Pre-trained language models can efficiently and effectively inform weakly supervised models for text classification
1. Self-training improves downstream classifier performance, especially when classifiers are initially trained on a subset of the training data
1. Data programming performs on par with simple majority vote when relying on a large number of automatically generated weak supervision sources of similar quality
1. Keywords are excellent sources of weak supervision

<a id='methodology'></a>
## 2. Methodology 

<!-- <center> -->
<img align="top" src="../assets/KeyClass.png" width="50%"/>
<!-- </center> -->

<b>Figure A:</b> Overview of our methodology. From only class descriptions, KeyClass classifies documents without access to any labeled data. It automatically creates interpretable labeling functions (LFs) by extracting frequent keywords and phrases that are highly indicative of a particular class from the unlabeled text using a pre-trained language model. It then uses these LFs along with Data Programming (DP) to generate probabilistic labels for training data, which are used to train a downstream classifier <sup><a href="#references"><b>13</b></a></sup>.

<a id='prior'></a>
### 2.1 Prior Work

__Assigning ICD codes to Clinical Notes<sup><a href="#references"><b>[6,7,8]</b></a></sup>:__
1. To the best of our knowledge, all prior work on ICD code assignment utilized __fully supervised ML techniques__, most of them relying on vast quantities of labeled training data
1. In this work, we explore the use of our proposed __weakly supervised model _KeyClass___ to assign top-level `ICD-9` codes to long patient discharge summaries
1. Its training signal is retrieved automatically from readily available descriptions of the ICD codes, therefore it requires no human-produced supervisory feedback to build effective downstream text classifiers

__Text Classification with Sparse Training Labels<sup><a href="#references"><b>[9,10]</b></a></sup>:__
1. Our work differs from prior work because the foundation of our weak supervision methodology, i.e., frequent keywords and phrases as LFs, is highly interpretable
1. Secondly, while previously proposed state-of-the-art models are committed to specific language model architectures for linguistic knowledge and representation learning, KeyClass offers a high degree of modularity, enabling end users to adapt the neural language model (encoder) and downstream classifiers to specific problems, such as clinical text classification
1. Finally, we explore the use of weak supervision for multilabel multiclass classification, a problem which, to the best of our knowledge, has not been tackled by prior work on weak text classification

__Weak Supervision for Clinical Text Classification<sup><a href="#references"><b>[11,12]</b></a></sup>:__
1. Prior work on weakly supervised clinical text classification had an explicit dependence on manually created rule-based labeling functions
1. In this work, however, we demonstrate that we can quickly and automatically create simple keyword based labeling functions, with minimal to no human involvement

<a id='keyclass'></a>
### 2.2 KeyClass

As a potential remedy, we present KeyClass, a general weakly supervised text classification framework combining Data Programming<sup><a href="#references"><b>13</b></a></sup> with a novel method of automatically acquiring interpretable weak supervision sources (keywords and phrases) from class-label descriptions only without the need to access to any labeled documents. The successful application of KeyClass to solve an important clinical text classification problem demonstrates its potential for making social impact by allowing quick and affordable development and deployment of effective text classifiers.

<img align="top" src="../assets/flowchart.png" width="50%"/>

<b>Figure B:</b> Data programming, or weak supervision compared to fully supervised ML. The orange boxes indicate the effort required by expert annotators. Instead of having to label extensive quantities of data by hand, the effort in data programming framework lies in obtaining labeling functions. In KeyClass, these labeling functions are our keyword-matching rules automatically extracted from reference data, to further reduce required human effort.

<a id='classdesc'></a>
### 2.3 Find Class Descriptions
Unlike traditional supervised learning where each document needs to be labeled, KeyClass only relies on meaningful and succinct class descriptions, also removing the requirement of expert heuristics as in prior weak supervision work. As a concrete example, let's consider the IMDb movie review sentiment classification problem, where the objective is to classify a movie re-
view as being `positive` or `negative`. In order to initiate the classification process, domain experts provide <code><b>KeyClass</b></code> with common sense descriptions of a __positive__ (`good amazing exciting positive`) and __negative review__ (`terrible bad boring negative`). In most cases, these descriptions can be automatically generated from Wikipedia articles or reference manuals and validated by domain experts, further reducing manual effort. Class Descriptions used in this tutorial can be found [here](./config_files/config_imdb.yml)

<a id='keywords'></a>
### 2.4 Find Relevant Keywords / Encoding the Dataset

Once we have the class descriptions, KeyClass automatically discovers highly suggestive keywords and phrases for each class. KeyClass first obtains frequent n-grams from the training corpus to serve as keywords or key-phrases for its automatically composed labeling functions. In order to transform the keywords into labeling functions of the prescribed form, KeyClass leverage the general linguistic knowledge stored within pre-trained neural language models such as Bidirectional Encoder Representations from Transformers __(BERT)__<sup><a href="#references"><b>14</b></a></sup>, to map each keyword to the most semantically related category description. To create a labeling function, KeyClass simply assigns a keyword to its closest category as measured by the cosine similarity between their embeddings. In order to ensure equal representation of all classes, KeyClass sub-samples the top-k labeling functions per class, ordering them by cosine similarity. While theoretically data programming benefits from as many labeling functions as possible, the sampling is required due to computational and space constraints.

In [1]:
# Import statements
import sys
sys.path.append('../keyclass/')
sys.path.append('../scripts/')

import argparse
import label_data, encode_datasets, train_downstream_model
import torch
import pickle
import numpy as np
import os
from os.path import join, exists
from datetime import datetime
import utils
import models
import create_lfs
import train_classifier

[nltk_data] Downloading package stopwords to /Users/lux/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Input arguments
config_file_path = r'../config_files/config_imdb.yml' # Specify path to the configuration file
random_seed = 0 # Random seed for experiments

In [3]:
args = utils.Parser(config_file_path=config_file_path).parse()

if args['use_custom_encoder']:
    model = models.CustomEncoder(pretrained_model_name_or_path=args['base_encoder'], 
        device='cuda' if torch.cuda.is_available() else 'cpu')
else:
    model = models.Encoder(model_name=args['base_encoder'], 
        device='cuda' if torch.cuda.is_available() else 'cpu')

for split in ['train', 'test']:
    sentences = utils.fetch_data(dataset=args['dataset'], split=split, path=args['data_path'])
    embeddings = model.encode(sentences=sentences, batch_size=args['end_model_batch_size'], 
                                show_progress_bar=args['show_progress_bar'], 
                                normalize_embeddings=args['normalize_embeddings'])
    with open(join(args['data_path'], args['dataset'], f'{split}_embeddings.pkl'), 'wb') as f:
        pickle.dump(embeddings, f)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: paraphrase-mpnet-base-v2


Batches:   0%|          | 0/196 [00:00<?, ?it/s]

Batches:   0%|          | 0/196 [00:00<?, ?it/s]

KeyboardInterrupt: 

<a id='label'></a>
### 2.5 Probabilistically Labeling the Data

Next, _KeyClass_ constructs the labeling function vote matrix and generates probabilistic labels for all training documents using a label model. Specifically, we use the open-source label model implementation of the ___Snorkel Python library___<sup><a href="#references"><b>13</b></a></sup>.

In [4]:
# Load training data
train_text = utils.fetch_data(dataset=args['dataset'], path=args['data_path'], split='train')

training_labels_present = False
if exists(join(args['data_path'], args['dataset'], 'train_labels.txt')):
    with open(join(args['data_path'], args['dataset'], 'train_labels.txt'), 'r') as f:
        y_train = f.readlines()
    y_train = np.array([int(i.replace('\n','')) for i in y_train])
    training_labels_present = True
else:
    y_train = None
    training_labels_present = False
    print('No training labels found!')

with open(join(args['data_path'], args['dataset'], 'train_embeddings.pkl'), 'rb') as f:
    X_train = pickle.load(f)

# Print dataset statistics
print(f"Getting labels for the {args['dataset']} data...")
print(f'Size of the data: {len(train_text)}')
if training_labels_present:
    print('Class distribution', np.unique(y_train, return_counts=True))

# Load label names/descriptions
label_names = []
for a in args:
    if 'target' in a: label_names.append(args[a])

# Creating labeling functions
labeler = create_lfs.CreateLabellingFunctions(base_encoder=args['base_encoder'], 
                                            device=torch.device(args['device']),
                                            label_model=args['label_model'])
proba_preds = labeler.get_labels(text_corpus=train_text, label_names=label_names, min_df=args['min_df'], 
                                ngram_range=args['ngram_range'], topk=args['topk'], y_train=y_train, 
                                label_model_lr=args['label_model_lr'], label_model_n_epochs=args['label_model_n_epochs'], 
                                verbose=True, n_classes=args['n_classes'])

y_train_pred = np.argmax(proba_preds, axis=1)

# Save the predictions
if not os.path.exists(args['preds_path']): os.makedirs(args['preds_path'])
with open(join(args['preds_path'], f"{args['label_model']}_proba_preds.pkl"), 'wb') as f:
    pickle.dump(proba_preds, f)

# Print statistics
print('Label Model Predictions: Unique value and counts', np.unique(y_train_pred, return_counts=True))
if training_labels_present:
    print('Label Model Training Accuracy', np.mean(y_train_pred==y_train))

    # Log the metrics
    training_metrics_with_gt = utils.compute_metrics(y_preds=y_train_pred, y_true=y_train, average=args['average'])
    utils.log(metrics=training_metrics_with_gt, filename='label_model_with_ground_truth', 
        results_dir=args['results_path'], split='train')

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: paraphrase-mpnet-base-v2


Getting labels for the imdb data...
Size of the data: 25000
Class distribution (array([0, 1]), array([12500, 12500]))


AssertionError: Torch not compiled with CUDA enabled

<a id='exp_training'></a>
## 3. Experimentation: Training 

<a id='downstream'></a>
### 3.1 Training the Downstream Model

After obtaining a probabilistically labeled training dataset, KeyClass can train any downstream classifier using rich document feature representations provided by the neural language model. Instead of using all the automatically labeled documents, KeyClass initially trains the downstream classifier using top-$k$ documents with the most confident label estimates only.

In [None]:
args = utils.Parser(config_file_path=config_file_path).parse()

# Set random seeds
random_seed = random_seed
torch.manual_seed(random_seed)
np.random.seed(random_seed)

X_train_embed_masked, y_train_lm_masked, y_train_masked, \
	X_test_embed, y_test, training_labels_present, \
	sample_weights_masked, proba_preds_masked = train_downstream_model.load_data(args)

# Train a downstream classifier

if args['use_custom_encoder']:
	encoder = models.CustomEncoder(pretrained_model_name_or_path=args['base_encoder'], device=args['device'])
else:
	encoder = models.Encoder(model_name=args['base_encoder'], device=args['device'])

classifier = models.FeedForwardFlexible(encoder_model=encoder,
										h_sizes=args['h_sizes'], 
										activation=eval(args['activation']),
										device=torch.device(args['device']))
print('\n===== Training the downstream classifier =====\n')
model = train_classifier.train(model=classifier, 
							device=torch.device(args['device']),
							X_train=X_train_embed_masked, 
							y_train=y_train_lm_masked,
							sample_weights=sample_weights_masked if args['use_noise_aware_loss'] else None, 
							epochs=args['end_model_epochs'], 
							batch_size=args['end_model_batch_size'], 
							criterion=eval(args['criterion']), 
							raw_text=False, 
							lr=eval(args['end_model_lr']), 
							weight_decay=eval(args['end_model_weight_decay']),
							patience=args['end_model_patience'])


end_model_preds_train = model.predict_proba(torch.from_numpy(X_train_embed_masked), batch_size=512, raw_text=False)
end_model_preds_test = model.predict_proba(torch.from_numpy(X_test_embed), batch_size=512, raw_text=False)

<a id='self'></a>
### 3.2 Self-Training the Model
Finally, KeyClass self-trains the downstream model-encoder combination on the entire training dataset to refine the end model classifier.

In [None]:
# Fetching the raw text data for self-training
X_train_text = utils.fetch_data(dataset=args['dataset'], path=args['data_path'], split='train')
X_test_text = utils.fetch_data(dataset=args['dataset'], path=args['data_path'], split='test')

model = train_classifier.self_train(model=model, 
									X_train=X_train_text, 
									X_val=X_test_text, 
									y_val=y_test, 
									device=torch.device(args['device']), 
									lr=eval(args['self_train_lr']), 
									weight_decay=eval(args['self_train_weight_decay']),
									patience=args['self_train_patience'], 
									batch_size=args['self_train_batch_size'], 
									q_update_interval=args['q_update_interval'],
									self_train_thresh=eval(args['self_train_thresh']), 
									print_eval=True)


end_model_preds_test = model.predict_proba(X_test_text, batch_size=args['self_train_batch_size'], raw_text=True)


# Print statistics
testing_metrics = utils.compute_metrics_bootstrap(y_preds=np.argmax(end_model_preds_test, axis=1),
													y_true=y_test, 
													average=args['average'], 
													n_bootstrap=args['n_bootstrap'], 
													n_jobs=args['n_jobs'])
print(testing_metrics)

<a id='exp_testing'></a>
## 4. Experimentation: Testing 

In [None]:
end_model_path='../models/imdb/end_model_26-Jul-2022-03_29_41.pth'
end_model_self_trained_path='../models/imdb/end_model_self_trained_26 Jul 2022 03:59:43.pth'

args = utils.Parser(config_file_path=config_file_path).parse()

# Set random seeds
random_seed = random_seed
torch.manual_seed(random_seed)
np.random.seed(random_seed)

X_train_embed_masked, y_train_lm_masked, y_train_masked, \
	X_test_embed, y_test, training_labels_present, \
	sample_weights_masked, proba_preds_masked = train_downstream_model.load_data(args)

model = torch.load(end_model_path)

end_model_preds_train = model.predict_proba(torch.from_numpy(X_train_embed_masked), batch_size=512, raw_text=False)
end_model_preds_test = model.predict_proba(torch.from_numpy(X_test_embed), batch_size=512, raw_text=False)

# Print statistics
if training_labels_present:
	training_metrics_with_gt = utils.compute_metrics(y_preds=np.argmax(end_model_preds_train, axis=1), 
														y_true=y_train_masked, 
														average=args['average'])
	print('training_metrics_with_gt', training_metrics_with_gt)

training_metrics_with_lm = utils.compute_metrics(y_preds=np.argmax(end_model_preds_train, axis=1), 
													y_true=y_train_lm_masked, 
													average=args['average'])
print('training_metrics_with_lm', training_metrics_with_lm)

testing_metrics = utils.compute_metrics_bootstrap(y_preds=np.argmax(end_model_preds_test, axis=1), 
													y_true=y_test, 
													average=args['average'], 
													n_bootstrap=args['n_bootstrap'], 
													n_jobs=args['n_jobs'])
print('testing_metrics', testing_metrics)


print('\n===== Self-training the downstream classifier =====\n')

# Fetching the raw text data for self-training
X_train_text = utils.fetch_data(dataset=args['dataset'], path=args['data_path'], split='train')
X_test_text = utils.fetch_data(dataset=args['dataset'], path=args['data_path'], split='test')

model = torch.load(end_model_self_trained_path)

end_model_preds_test = model.predict_proba(X_test_text, batch_size=args['self_train_batch_size'], raw_text=True)


# Print statistics
testing_metrics = utils.compute_metrics_bootstrap(y_preds=np.argmax(end_model_preds_test, axis=1),
													y_true=y_test, 
													average=args['average'], 
													n_bootstrap=args['n_bootstrap'], 
													n_jobs=args['n_jobs'])
print('testing_metrics after self train', testing_metrics)


<a id='references'></a>
## 5. References 

[[1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3270933/)] Nir Menachemi and Taleah H Collum. Benefits and drawbacks of electronic health record systems. Risk management and healthcare policy, 4:47, 2011.

[[2](https://academic.oup.com/jamia/article/24/6/1142/4091350)] Julia Adler-Milstein, A Jay Holmgren, Peter Kralovec, Chantal Worzala, Talisha Searcy, and Vaishali Patel. Electronic health record adoption in us hospitals: the emergence of a digital “advanced use” divide. Journal of the American Medical Informatics Association, 24(6):1142–1148, 2017.

[[3](https://www.tandfonline.com/doi/full/10.1080/2331205X.2021.1893422)] Musaed Ali Alharbi, Godfrey Isouard, and Barry Tolchard. Historical development of the statistical classification of causes of death and diseases. Cogent Medicine, 8(1):1893422, 2021. doi: 10.1080/2331205X.2021.1893422. URL https://doi.org/10.1080/2331205X.2021.1893422.

[[4](https://n.neurology.org/content/49/3/660.short)] Curtis Benesch, DM Witter, AL Wilder, PW Duncan, GP Samsa, and DB Matchar. Inaccuracy of the international classification of diseases (icd-9-cm) in identifying the diagnosis of ischemic cerebrovascular disease. Neurology, 49(3):660–664, 1997.

[[5](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234647)] Guhan Ram Venkataraman, Arturo Lopez Pineda, Oliver J Bear Don’t Walk IV, Ashley M Zehnder, Sandeep Ayyar, Rodney L Page, Carlos D Bustamante, and Manuel A Rivas. Fastag: Automatic text classification of unstructured medical narratives. PLoS one, 15(6):e0234647, 2020.

[[6](https://www.aaai.org/ocs/index.php/WS/AAAIW18/paper/view/16881/0)] Tal Baumel, Jumana Nassour-Kassis, Raphael Cohen, Michael Elhadad, and No ́emie El- hadad. Multi-label classification of patient notes: case study on icd code assignment. In Workshops at the thirty-second AAAI conference on artificial intelligence, 2018.

[[7]()] Sepp Hochreiter and J ̈urgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 11 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.

[[8](https://link.springer.com/chapter/10.1007/978-3-319-21843-4_12)] Stefano Giovanni Rizzo, Danilo Montesi, Andrea Fabbri, and Giulio Marchesini. Icd code retrieval: Novel approach for assisted disease classification. In International Conference on Data Integration in the Life Sciences, pages 147–161. Springer, 2015.

[[9](https://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf?ref=https://githubhelp.com)] Evgeniy Gabrilovich, Shaul Markovitch, et al. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJcAI, volume 7, pages 1606–1611, 2007.

[[10](https://arxiv.org/abs/2010.07245)] Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang, and Jiawei Han. Text classification using label names only: A language model self-training approach. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing, 2020.

[[11](https://link.springer.com/article/10.1186/s12911-018-0723-6)] Yanshan Wang, Sunghwan Sohn, Sijia Liu, Feichen Shen, Liwei Wang, Elizabeth J Atkinson, Shreyasee Amin, and Hongfang Liu. A clinical text classification paradigm using weak supervision and deep representation. BMC medical informatics and decision making, 19(1):1–13, 2019.

[[12](https://www.sciencedirect.com/science/article/pii/S0022395621000637)] Marika Cusick, Prakash Adekkanattu, Thomas R Campion Jr, Evan T Sholle, Annie Myers, Samprit Banerjee, George Alexopoulos, Yanshan Wang, and Jyotishman Pathak. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. Journal of psychiatric research, 136:95–102, 2021.

[[13](https://proceedings.neurips.cc/paper/2016/hash/6709e8d64a5f47269ed5cea9f625f7ab-Abstract.html)] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R ́e. Data programming: Creating large training sets, quickly. In Advances in neural infor- mation processing systems, pages 3567–3575, 2016.

[[14](https://arxiv.org/abs/1810.04805)] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171– 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.