> Contrary to popular opinion, complex medical terminology is actually the easiest part for DeepScribe to pick up. The trickiest part for DeepScribe is to pick up on unique contextual statements a patient may give a physician. The more they stray from a typical conversation, the more we see the AI stumble.

\-- Akilesh Bapu

Context is what makes language, language. Every act of speech is situated in a physical environment, and comes from an embodied agent with motivations, pre-existing knowledge, and four other senses-worth of information. This makes the difference between a sentence like "Can you pass the salt?" meaning for one to hand the salt to the speaker, and the speaker wondering if the listener can physically pass the salt.

While non-verbal context is often critical for us when interpreting speech, language technologies often lack such data. Nonetheless, we can attempt to make intelligent assumptions about certain speech.

Suppose DeepScribe was having difficulty knowing when a conversation strayed from the patient's condition to small talk. It would be helpful to identify these digressions for a number of reasons: 1) NLP technology can priortize computational resources on medically relevant speech, 2) engineers can spend less time anticipating edge case conversational speech, and more time improving the product on the speech that makes the most buiness impact, and 3) the business can save costs by storing non-medically relevant conversation in a low cost data lake.

# Medetect

**by Alex Liebscher**

In this brief demo, I'll attempt to classify incoming message data as either medically relevant or not. As a proof-of-concept, this largely disregards many fine nuances of actual conversation, but conveys the general idea that syntactically, the two categories of speech vary enough to distinguish them.

As an example of the benefits listed above, we can do a back-of-the-napkin calculation: suppose we have just under 50 TB of text data from conversations between physicians and their patients. Suppose this is stored in an S3 bucket. At \\$0.023 per GB per month, this costs the business about \\$1,150 per month on storage. 

Suppose that 30% of this data is simple greetings, small talk, and the occassional conversational divergence. This data might not be essential for the success of the NLP models underlying the product, but we still want to keep the data, just somewhere cheaper. When we (infrequently) want the data, we don't want to have too much delay, so we might scale this data back to an S3 One Zone tier. With 30% of the data eligible to move: $0.7*50000*0.023 + 0.3*50000*0.01 = 955$.

Hence, **by identifying medically irrelevant data and moving it to a longer-term storage option, we can cut monthly storage expenses by 17%, saving nearly \\$200 per month.**

### Overview of the demo

I found two data sources (one of characters talking in movie transcripts, another of medical questions and answers) and built a prototype classification algorithm to identify new text as friendly or medical. Data loading and cleaning involved reading both text and XML files, embedding individual messages into a vector space using TF-IDF, and estimating a fine-tuned, cross-validated model.

## Load Libraries

In [1]:
from os import scandir

import xml.etree.ElementTree as ET

import pandas as pd
import numpy as np

In [2]:
from transformers import BertTokenizer

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.svm import LinearSVC

from sklearn.model_selection import train_test_split, RandomizedSearchCV

I'm making use of the BERT base uncased tokenizer as a baseline. This tokenizer is slightly more sophisticated than a traditional tokenizer, since it breaks text into Byte-Pair Encodings (BPEs). These BPEs are common subwords pieces, somewhat analogous to syllables in a word.

In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Here's a quick example of BPE encodings (notice how `subwords` was brooken into two component pieces, this places less weight in the final model on rare compound words):

In [185]:
print(tokenizer.tokenize("short words don't split, but words with multiple subwords do"))

['short', 'words', 'don', "'", 't', 'split', ',', 'but', 'words', 'with', 'multiple', 'sub', '##words', 'do']


# Load Data

Our objective is to classify incoming messages as either medically relevant or not. The latter category, medically irrelevant, is going to be built using a dataset of lines from movies. In particular, this dataset is the [Cornell Movie-Dialogs Dataset](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). I felt like this would represent small talk fairly well. On closer inspection, I noticed there's a lot of fantasy talk in the dataset: stuff about crime scenes, or bars, or generally stuff you wouldn't (usually) find in a conversation with a medical professional.

In [5]:
with open('data/cornell-movie-dialogs-corpus/movie_lines.txt', 'rb') as f:
    irl_data = f.read().splitlines()

In [137]:
irl_data_lines = []

# to balance out the dataset a little better, we'll only grab at most 15k lines
for line in irl_data[:15000]:
    
    try:
        # the original data was not utf-8 encoded, and contained some meta data we throw out
        text = line.decode('utf-8').split()[8:]
        if text[0] == '+++$+++':
            text = text[1:]

        # to better match the next dataset, we're keeping only lines with more than 4 words
        if len(text) > 4:
            irl_data_lines.append(' '.join(text[:30]).lower())
    except:
        # for simplicity, an empty try-except to filter out a few outlier data formats
        pass

Here's a quick example of a datum:

In [138]:
irl_data_lines[60]

"exactly so, you going to bogey lowenbrau's thing on saturday?"

Next I'm loading up a few sub-datasets of the [MedQuAD dataset](https://github.com/abachaa/MedQuAD), which consists of factual, medical question-answer pairs.

In [139]:
dirs = ['data/MedQuAD-master/2_GARD_QA/', 'data/MedQuAD-master/4_MPlus_Health_Topics_QA', 'data/MedQuAD-master/7_SeniorHealth_QA', ]

files = [f"{dr}/{file.name}" for dr in dirs for file in scandir(dr)]

med_data = []
for file in files:
    xml_data = open(file, 'r').read()
    root = ET.XML(xml_data)
    for qa_pair in root[2]:
        try:
            # we're keeping the question and the answer, but only the first 30 words of the answer (which can get too large otherwise)
            med_data.append({
                'q': qa_pair[0].text.lower(),
                'a': ' '.join(qa_pair[1].text.lower().split()[:30])
            })
        except:
            # for simplicity, an empty try-except to filter out a few outlier data formats
            pass

A quick example datum:

In [140]:
med_data[60]

{'q': 'is chronic inflammatory demyelinating polyneuropathy inherited ?',
 'a': 'is chronic inflammatory demyelinating polyneuropathy (cidp) inherited? cidp is not known to be inherited and is considered an acquired disorder. no clear genetic predisposition or other predisposing factors for cidp'}

We have about 7.1k and 10.1k examples in the final medical QA and movie lines datasets, respectively.

In [141]:
(len(med_data),len(irl_data_lines))

(7139, 10170)

# Tokenize data and build vocab

I'm computing the TF-IDF vector for each text sample. The TF portion of the algorithm weighs term frequencies higher, and the inverse document frequency portion decreases the weight of terms which show up in many documents. Together, higher weight is placed on tokens which are generally unique, and thus add more information to the model. This is the difference between seeing "zebra" and "the". Clearly, when we see "zebra", that means more for classifying that we're talking about a zoo than "the" would.

In [142]:
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize)

# we're only going to take the first 7k examples from both datasets, for an artifically balanced dataset
N = 7000 
med_data_lines = [qa['a'] for qa in med_data][:N]
corpus = irl_data_lines[:N]
corpus.extend(med_data_lines)

# compute TF-IDF vectors for each example in the dataset
data = vectorizer.fit_transform(corpus)

We're next creating a training-testing split, so we can estimate the model using the training dataset (and potentially overfit), but lastly test our model on the test set (to see how well we generalize).

In [143]:
X = data.toarray()
y = np.concatenate(([0]*N, [1]*N))

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)

In [144]:
data.shape, train_X.shape, test_X.shape, train_y.shape, test_y.shape

((14000, 11661), (11200, 11661), (2800, 11661), (11200,), (2800,))

To verify, we have a pretty balanced dataset:

In [145]:
train_y.mean(), test_y.mean()

(0.4986607142857143, 0.5053571428571428)

# Model Building

We'll construct our model beginning with a vanilla logistic regression. Logistic regression offers us simplicity right off the bat, and is highly interpretable (which can help debug in the intial stages of model development).

We'll fine-tune the parameter `C`, which defines how much regularization exists in the model (small numbers being near no regularization, large being a lot). There are a variety of ways to optimize hyperparameters, here we use randomized search over the parameter space. The most common technique is probably grid search, but it tends to be inefficient. Another popular optimization technique is Bayesian hyperparameter optimization. Under this paradigm, priors are set over each parameter space and then a function is estimated after sampling points in the space. The minimum or maximum (i.e. optimal) parameter combination is found in this latent space.

Back to the topic at hand: we'll fit our model using 5-fold cross validation, to ensure we balance bias and variance.

In [105]:
from sklearn.linear_model import LogisticRegression

In [146]:
model = LogisticRegression()

params = {
    "C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.5, 10.0]
}

grid_search = RandomizedSearchCV(model, param_distributions=params, scoring=["accuracy", "f1"], refit="f1")

In [147]:
grid_search.fit(train_X, train_y)



RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=LogisticRegression(C=1.0, class_weight=None,
                                                dual=False, fit_intercept=True,
                                                intercept_scaling=1,
                                                l1_ratio=None, max_iter=100,
                                                multi_class='auto', n_jobs=None,
                                                penalty='l2', random_state=None,
                                                solver='lbfgs', tol=0.0001,
                                                verbose=0, warm_start=False),
                   iid='deprecated', n_iter=10, n_jobs=None,
                   param_distributions={'C': [0.001, 0.01, 0.1, 0.5, 1.0, 2.5,
                                              10.0]},
                   pre_dispatch='2*n_jobs', random_state=None, refit='f1',
                   return_train_score=False, scoring=['accuracy', 'f1'

In [148]:
grid_search.cv_results_

{'mean_fit_time': array([2.03435502, 2.35411024, 4.22772932, 6.52013159, 4.96001663,
        4.80387273, 8.22929516]),
 'std_fit_time': array([0.98665002, 0.18216249, 0.95373969, 1.67167898, 0.15167101,
        0.16987305, 2.56395942]),
 'mean_score_time': array([0.07364011, 0.04572358, 0.04655681, 0.09730306, 0.04726491,
        0.04469023, 0.10558405]),
 'std_score_time': array([0.07942197, 0.00865253, 0.00405342, 0.09593816, 0.01044198,
        0.00588322, 0.08816913]),
 'param_C': masked_array(data=[0.001, 0.01, 0.1, 0.5, 1.0, 2.5, 10.0],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.001},
  {'C': 0.01},
  {'C': 0.1},
  {'C': 0.5},
  {'C': 1.0},
  {'C': 2.5},
  {'C': 10.0}],
 'split0_test_accuracy': array([0.9625    , 0.971875  , 0.98794643, 0.99151786, 0.99330357,
        0.99375   , 0.99419643]),
 'split1_test_accuracy': array([0.95357143, 0.95714286, 0.97410714, 0.98526786, 0.98973214

Above are the raw results of the cross validation and hyperparameter tuning. It's a little difficult to read at first, but one of the most important fields here is `mean_test_f1`, which tells us the mean f1 score on the validation data for each fold while estimating. It's pretty consistently > 0.95 which is a good sign, but could signal that our input data sources are just too different from each other (and therefore too easily distinguished).

Below I talk about some of the limitations of this model and steps to improve the model building procedure.

# Model Testing

To get a quick glance at how well the model generalizes, I'm going to compute a confusion matrix for predictions on the test set. This is relevant because we can identify false positive and false negative rates, which are invaluable for diagnosing a model and making improvements.

Ideally, predicting labels on the test set should be the *very* last step in model building. Realistically, it sometimes gets mixed into model development. If this happens, it's improvement to recognize that and mentally note the inherent overfitting that will happen on the next development iteration.

In [149]:
from sklearn.metrics import confusion_matrix

In [153]:
pred_y = grid_search.predict(test_X)

We see that there were very few false negatives and positives, a good sign. An easy next step in this development would be to closely check out the examples and figure out what made them different and difficult to classify.

In [154]:
confusion_matrix(test_y, pred_y)

array([[1375,   10],
       [   2, 1413]])

As a reality check, we can feed in a couple examples of real data. Suppose we saw the following two statements in the DeepScribe datasets. Would the model above be able to identify the medically relevant one so we can cut costs on computation and data storage?

In [186]:
text_medical = "yes at your age melanoma can be a serious concern, but you seem to be taking care of yourself"
text_conversation = "yes at my son's age i know it can be a concern, but he gets along just well with the others"

grid_search.predict(vectorizer.transform([text_medical, text_conversation]))

array([1, 0])

Great, the model was fit well enough to make the distinction.

# Next Steps

This demo would be insufficient for production because of more reasons than we could count here. As a proof-of-concept, I'd say it gets the point across.

Suppose we wanted to move forward with it though, what would some next steps be?

1) The datasets aren't great for a number of reasons, and as we say, Garbage In, Garbage Out. Neither dataset perfectly represented any conversation we might hear in a doctor's office. Building better datasets would ideally be my first step.

2) The data preprocessing is simple and lets a lot of unhelpful data get through. For example, the 4 word minimum I set on the movie lines data was pretty arbitrary. We could build better model inputs by rigorously cleaning this data, identifying patterns of poor/good input, and really understanding the data.

3) The model was perhaps the simplest baseline we could get. While a simple, interpretable model is great a starting point, eventually we would need to something more sophisticated. With Huggingface making attention-based models incredibly easy to get setup, we could almost nearly jump to contextual word embeddings and recurrent neural networks to build a classifier, instead of relying on vector space embeddings and linear models. However, the trade-off is efficiency. Prepping the data and fitting the model may take magnitudes longer if we introduce these steps.

4) We limited ourselves to using only the message at hand in classification. In reality, we could use a number of contextual features to improve our results, such as the previous message, the tone of the speech, the pace of the speech, and more.

5) We should be comparing models, both in terms of accuracy (e.g. f1), efficiency (e.g. training time/example), and cost (e.g. hours needed to train and EC2 costs). Here we only used one model though.

6) Visualization. We didn't visualize the data, the estimation process, the model output, or the predictions at all. This is crucial if we were to be comparing this model to any others.

In essence, language is highly context dependent and conversations in doctors offices probably rely on context to a large extent. Non-verbal cues may also play a large role in the experience. With just speech and text data, we can nonetheless still make an attempt to work efficiently with what we're given. Classifying incoming data as either relevant to the medical aspect of the conversation or not could help improve data storage efficiency and save comptutaional resources by working only with the most relevant data.

\- Alex Liebscher