# This script trains a test model from to resolve acronyms and abbreviations
## It later analyzes the results of the model on the test set
### Let's start by doing standard imports

In [1]:
import pandas as pd
import simpletransformers
from simplet5 import SimpleT5

Global seed set to 42


- Open the dataset, edit the prompt prefix to include a ':' and the column names to work with SimpleT5, drop old prefix

In [11]:
df = pd.read_csv('/home/karl/PycharmProjects/RABAC/finetune_SciFive/reverse_sub_mt_training_data/t5_training_data_20221105-083136.tsv', delimiter='\t')
df['input_text'] = df['prefix'] + ": " + df['input_text']
df.drop('prefix', inplace=True, axis=1)
df = df.rename(columns={"prefix": "prefix", "input_text": "source_text"})

- Now let's shuffle the dataset
- Then we split for test and training

In [12]:
# Shuffle data
df = df.sample(frac=1)
# Split it into training and eval datasets
num_for_training = int(len(df)*0.8)
train_df = df[:num_for_training]
test_df = df[num_for_training:]

## Training time
- Call the simpleT5 module
- Use T5 base model that was finetuned on Pubmed, PMC for natural language inference on biomedical data
- [For details on the base model and citation of the paper click here](https://huggingface.co/razent/SciFive-large-Pubmed_PMC-MedNLI)
- Of note, I had to edit the SimpleT5 module to handle GPU parallelization via Lightning Pytorch, this is beyond scope of this notebook
- Essentially Simple T5 only enables single GPU use, which was giving OOM issues with batch sizes >2, even with this, I could only do batch size of 4

In [5]:
model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="razent/SciFive-large-Pubmed_PMC-MedNLI")
model.train(train_df=train_df,
            eval_df=test_df,
            batch_size=4,
            max_epochs=16,
            use_gpu=True,
            num_gpu=2,
            dataloader_num_workers=24
           )

  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 737 M 
-----------------------------------------------------
737 M     Trainable params
0         Non-trainable params
737 M     Total params
2,950.672 Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 42


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [None]:
model.load_model("t5","/home/karl/PycharmProjects/RABAC/finetune_SciFive/outputs/simplet5-epoch-15-train-loss-0.018-val-loss-0.0802", use_gpu=True)

## Now that the model is trained, let's do a test prediction on the source text, we will evaluate the model on test data below

In [9]:
# This is an unprocessed source text being passed through the model, so the output should
# have resolved acronyms and abbreviations (acabs)
model.predict(test_df['source_text'].iloc[1])

["her evaluation today reveals restriction in the range of motion of the cervical and lumbar region with tenderness and spasms of the paraspinal musculature. motor strength was 5/5 on the mrc scale. reflexes were 2+ and symmetrical. palpable trigger points were noted bilaterally in the trapezius and lumbar paraspinal musculature bilaterally. palpable trigger points were noted on today's evaluation. she is suffering from ongoing myofascitis. her treatment plan will consist of a series of trigger point injections which were performed today. she tolerated the procedure well. i have asked her to ice the region intermittently for 15 minutes off and on x 3. she will be followed in four weeks' time for repeat trigger point injections if indicated."]

In [10]:
# This output shows what the text looked like prior to being passed to the model, so it should have the original acabs
# It looks like it got every acab except mrc, which is not in our training data - this is a great result!
print(test_df['source_text'].iloc[1])

dejargon: her eval today reveals restriction in the range of motion of the cervical and lumbar region w tenderness and spasms of the paraspinal musculature. motor strength was 5/5 on the mrc scale. reflexes were 2+ and symmetrical. palp trigger points were noted blly in the trapezius and lumbar paraspinal musculature blly. palp trigger points were noted on today's eval. she is suffering from ongoing myofascitis. her tp will consist of a series of trigger point injections which were performed today. she tolerated the procedure well. i have asked her to ice the region intermittently for 15 min off and on x 3. she will be followed in four wks' time for rpt trigger point injections if indicated.


## Okay time to run the model on all of the source_texts for downstream analysis
- this will take some time, so go grab lunch!

In [13]:
test_df['predicted'] = test_df['source_text'].apply(lambda x: model.predict(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['predicted'] = test_df['source_text'].apply(lambda x: model.predict(x))


## It finished running, let's have a look at some of the output

In [51]:
for i in test_df[['source_text','predicted']].iloc[0:9].iterrows():
    print(i[1][0])
    print('\n')
    print(i[1][1])
    print('\n###########################################')
    print('###########################################\n')

dejargon: preop dx: symptomatic disk herniation c7-t1. final dx: symptomatic disk herniation c7-t1. procedures performed 1. ant cervical discectomy w decompression of sp cd c7-t1. 2. ant cervical fusion c7-t1. 3. ant cervical instrumentation ant c7-t1. 4. insertion of intervertebral device c7-t1. 5. use of operating microscope. anesthesiology: gen et. est bld loss: a 30 ml. procedure in detail: the pt was taken to the or where he was poly intub per the anesthesiology service. he was placed in the supine position on an or table. his arms were carefully taped down. he was sterilely prepped and draped in the usual fashion. a 4-cm incision was made obliquely over the l side of his neck. sq tissue was dissected down to the level of the platysma. the platysma was incised using electrocautery. blunt dissection was done to create a plane btwn the strap muscles and the sternoclavicular mastoid muscle. this allowed us to get r down on to the ant cervical sp. blunt dissection was done to sweep of

In [25]:
test_df['predicted'] = test_df['predicted'].apply(lambda x: x[0])
test_df['predicted'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['predicted'] = test_df['predicted'].apply(lambda x: x[0])


2717    preoperative diagnosis: symptomatic disk herni...
2079    her evaluation today reveals restriction in th...
3389    discharge diagnoses: 1. acute cerebrovascular ...
722     pre and postoperative diagnosis: left cervical...
951     preoperative diagnosis: carcinoma of the prost...
Name: predicted, dtype: object

In [30]:
tuple_for_cf = list(zip(test_df['target_text'], test_df['predicted']))
tuple_for_cf[0:3]

[('preoperative diagnosis: symptomatic disk herniation c7-t1. final diagnosis: symptomatic disk herniation c7-t1. procedures performed 1. anterior cervical discectomy with decompression of spinal cord c7-t1. 2. anterior cervical fusion c7-t1. 3. anterior cervical instrumentation anterior c7-t1. 4. insertion of intervertebral device c7-t1. 5. use of operating microscope. anesthesiology: general endotracheal. estimated blood loss: a 30 ml. procedure in detail: the patient was taken to the operating room where he was orally intubated by the anesthesiology service. he was placed in the supine position on an or table. his arms were carefully taped down. he was sterilely prepped and draped in the usual fashion. a 4-cm incision was made obliquely over the left side of his neck. subcutaneous tissue was dissected down to the level of the platysma. the platysma was incised using electrocautery. blunt dissection was done to create a plane between the strap muscles and the sternoclavicular mastoid

# Let's see how many detected acabs are resolved from our model
## To do this we need to:

### First calculate how many substitutions our model made both in total and at the document level
### Aka the rate of identification and substitution
- First see how many acabs were in our training dataset (resolved in the gold standard target text) from our acab list.
- This is to ensure we are being fair to the model with respect to what it was trained on.
- We then count the number of those per document
- Then see how many remain in the target text
- We should do this at the document level and the total level

### Next for accuracy, etc. take random samples and manually evaluate if the correct long form was completed
- We should calculate how many samples we would need as an average representative sample of the total in the test set

### First let's do the rate of indentification and substitution

In [58]:
# load in the acab list
acab_df = pd.read_csv('/home/karl/PycharmProjects/RABAC/resources/acab.txt', delimiter='|')
acab_df.head()

Unnamed: 0,acab,long_form
0,(r),refused
1,l,liter
2,l,liters
3,10l,ten liters
4,11l,eleven liters


In [73]:
# Get unique short forms of acabs
acab_set = set(acab_df['acab'].to_list())

In [75]:
# Count the total number of acabs in the source text
import re

total_count_source = 0
for text in test_df['source_text'].to_list():
    text = re.sub(r'[^\w\s]', ' ', text)
    words = text.split(' ')
    for word in words:
        if word in acab_set:
            total_count_source+=1
print(total_count_source)

76170


In [76]:
# Count the total number of acabs in predicted output
total_count_pred = 0
for text in test_df['predicted'].to_list():
    text = re.sub(r'[^\w\s]', ' ', text)
    words = text.split(' ')
    for word in words:
        if word in acab_set:
            total_count_source+=1
print(total_count_pred)

0


## Okay wow, so if the model was trained on any acab, it will always sub it out, that's pretty good!
## That means we have a 100% substitution/identification rate, let's see some accuracy/confusion matrix stats
- We must unfortunately do this manually because there is no easy way to automatically tell if the correct acab was substituted in, we could look at target text, and compute from that, but say for instance UTI is solved to urinary tract infection, this will mess up mapping of the calculation. Or if text was cut due to length, or semi-colon, output may not have the same length. Programmatically doing this would probably be slower than just manually calculating it from a well representative sample size.
### We should calculate the needed representative sample size to detect an expected sensitivity/specificity of 90% (okay if lower)
- Its okay if our sensitivity and specificity are lower than 90%, the higher we use it just means we need a larger sample size
### We need to know the prevalanced of acabs within a given text
- For prevalance: we will calculate this as a percentage of total acabs to total words
### We will use a high confidence of 95% or greater

In [79]:
# First let's calculate prevlance to know how much to sample
# We already have the total acabs in the test dataset from above, let's check how many words there are
total_words = 0
for text in test_df['source_text'].to_list():
    # we don't need to sub out punctuation here
    # and in fact when we do that we add spaces which during the split will artificially increase word count
    words = text.split(' ')
    total_words += len(words)
prevalence = total_count_source / total_words
prevalence # round to 19%

0.1872825666795516

## To calculate the sample size needed for prevalance we use the formula from
- [Statistical  Methodology: I.  Incorporating the Prevalence of Disease  into the Sample Size Calculation for Sensitivity and Specificity](https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1553-2712.1996.tb03538.x)
- This comes out to be 182 for sensitivity, and 43 for specificity, which we can then use to calculate accuracy, recall (same as sensitivity), and preciscion as well.

In [83]:
sample_for_cf = test_df.sample(frac=(182 / len(test_df)))
sample_for_cf

Unnamed: 0,source_text,target_text,predicted
3531,dejargon: preop dx: pos peptic ulcer dis. post...,preoperative diagnosis: positive peptic ulcer ...,preoperative diagnosis: positive peptic ulcer ...
4779,dejargon: reason for referral: elevd bnp. hpi:...,reason for referral: elevated bnp. history of ...,reason for referral: elevated bnp. history of ...
2017,dejargon: preop diagnoses: h/o compartment syn...,preoperative diagnoses: history of compartment...,preoperative diagnoses: history of compartment...
4888,dejargon: preop dx: post infarct angina. type ...,preoperative diagnosis: post infarct angina. t...,preoperative diagnosis: post infarct angina. t...
249,dejargon: preop dx: degenerative arthritis of ...,preoperative diagnosis: degenerative arthritis...,preoperative diagnosis: degenerative arthritis...
...,...,...,...
4332,dejargon: hpi: this is a 70-yr-old f w a pmhx ...,history of present illness: this is a 70-year-...,history of present illness: this is a 70-year-...
2026,dejargon: preop dx: severe scoliosis. anesthes...,preoperative diagnosis: severe scoliosis. anes...,preoperative diagnosis: severe scoliosis. anes...
4499,dejargon: reason for neuroal cons: cervical sp...,reason for neurological consultation: cervical...,reason for neurological consultation: cervical...
1668,dejargon: technique: sequential axial ct image...,technique: sequential axial ct images were obt...,technique: sequential axial ct images were obt...


In [85]:
# Time for some manual review of the 182 samples to compute confusion matrix manually ugh.

In [84]:
test_df.to_csv('./model_results.csv')

In [86]:
sample_for_cf.to_csv('./samples_for_confusion_matrix.csv')

## Initial impression of the model:

- It has an identification and substitution rate of 100%!!!
- The above means that if the model saw an acab in training, it identified it and substituted it 100% of the time, whether or not it was the correct substitution is described below. We will need to do a confusion matrix for this.
- The confusion matrix for (accuracy, recall, preciscion, F1, sensitivity and specifity is going to take me a bit
- The reason for time need to compute confusion matrix is because I cannot automate knowing if the correct acab long form was subbed, or else we wouldn't need this model
- I calculated a representative sample size to extrapolate these terms, it will require manually reviewing 182 samples
- Otherwise the model did really well on basic sanity checks, which means we are taking the right approach (described below)
- Additionally, if we have good confusion matrix scores, this model literally took a day to train, compared to the nature paper which would've taken 4 years or whatever larry calculated.
- Some of the training data did not include resolved abbreviations, for example 'ct' remained ct in the target text, so ct would not be resolved
- I think lower casing the text may make it difficult to differentiate certain tokens for dejargoning, we should avoid this for the full model
- Something to be concerned about with GPT or T5 models in this use case is "hallucinations", aka ramblings to a prompt, or deviation from original text
- I didn't see a lot of hallucinations and overall it had high fidelity (little to no changing of original text unless it was an ACAB), still
- Some input had text chopped off at the end, we need to investigate whether this was from allowable sequence length, or if because the model thinks a semi colon means new input
