# Demo: Fine-Tuning NM Results Management Language Model with a Custom Dataset 

In this notebook, we will be using a sample of 10 radiology reports to show how we can preprocess the data, load the NM Results Management language model checkpoints, and use them for fine-tuning on your in-house data.

## Load the Data

First, the data is loaded. In this example, we will train the model to perform a three-class classification problem, determining whether a report contains lung, adrenal, or no findings.

In [1]:
import os
import joblib
from IPython.display import display, HTML

# Define the path to the data
base_path = os.path.dirname("__file__")
data_path = os.path.abspath(os.path.join(base_path, "..", "demo_data.gz"))

# Import data
modeling_df = joblib.load(data_path)

display(HTML(modeling_df.head(3).to_html()))

Unnamed: 0,rpt_num,note,selected_finding,selected_proc,selected_label,new_note
0,1,"PROCEDURE: CT CHEST WO CONTRAST. HISTORY: Wheezing TECHNIQUE: Non-contrast helical thoracic CT was performed. COMPARISON: There is no prior chest CT for comparison. FINDINGS: Support Devices: None. Heart/Pericardium/Great Vessels: Cardiac size is normal. There is no calcific coronary artery atherosclerosis. There is no pericardial effusion. The aorta is normal in diameter. The main pulmonary artery is normal in diameter. Pleural Spaces: Few small pleural calcifications are present in the right pleura for example on 2/62 and 3/76. The pleural spaces are otherwise clear. Mediastinum/Hila: There is no mediastinal or hilar lymph node enlargement. Subcentimeter minimally calcified paratracheal lymph nodes are likely related to prior granulomas infection. Neck Base/Chest Wall/Diaphragm/Upper Abdomen: There is no supraclavicular or axillary lymph node enlargement. Limited, non-contrast imaging through the upper abdomen is within normal limits. Mild degenerative change is present in the spine. Lungs/Central Airways: There is a 15 mm nodular density in the nondependent aspect of the bronchus intermedius on 2/52. The trachea and central airways are otherwise clear. There is mild diffuse bronchial wall thickening. There is a calcified granuloma in the posterior right upper lobe. The lungs are otherwise clear. CONCLUSIONS: 1. There is mild diffuse bronchial wall thickening suggesting small airways disease such as asthma or bronchitis in the appropriate clinical setting. 2. A 3 mm nodular soft tissue attenuation in the nondependent aspect of the right bronchus intermedius is nonspecific, which could be mucus or abnormal soft tissue. A follow-up CT in 6 months might be considered to evaluate the growth. 3. Stigmata of old granulomatous disease is present. &#x20; FINAL REPORT Attending Radiologist:",Lung Findings,CT Chest,"A 3 mm nodular soft tissue attenuation in the nondependent aspect of the right bronchus intermedius is nonspecific, which could be mucus or abnormal soft tissue. A follow-up CT in 6 months might be considered to evaluate the growth.","support devices: none. heart/pericardium/great vessels: cardiac size is normal. there is no calcific coronary artery atherosclerosis. there is no pericardial effusion. the aorta is normal in diameter. the main pulmonary artery is normal in diameter. pleural spaces: few small pleural calcifications are present in the right pleura for example on 2/62 and 3/76. the pleural spaces are otherwise clear. mediastinum/hila: there is no mediastinal or hilar lymph node enlargement. subcentimeter minimally calcified paratracheal lymph nodes are likely related to prior granulomas infection. neck base/chest wall/diaphragm/upper abdomen: there is no supraclavicular or axillary lymph node enlargement. limited, non-contrast imaging through the upper abdomen is within normal limits. mild degenerative change is present in the spine. lungs/central airways: there is a 15 mm nodular density in the nondependent aspect of the bronchus intermedius on 2/52. the trachea and central airways are otherwise clear. there is mild diffuse bronchial wall thickening. there is a calcified granuloma in the posterior right upper lobe. the lungs are otherwise clear. conclusions: 1. there is mild diffuse bronchial wall thickening suggesting small airways disease such as asthma or bronchitis in the appropriate clinical setting. 2. a 3 mm nodular soft tissue attenuation in the nondependent aspect of the right bronchus intermedius is nonspecific, which could be mucus or abnormal soft tissue. a follow-up ct in 6 months might be considered to evaluate the growth. 3. stigmata of old granulomatous disease is present."
1,2,"PROCEDURE: CT ABDOMEN PELVIS W CONTRAST COMPARISON: date INDICATIONS: Lower abdominal/flank pain on the right TECHNIQUE: After obtaining the patients consent, CT images were created with intravenous iodinated contrast. FINDINGS: LIVER: The liver is normal in size. No suspicious liver lesion is seen. The portal and hepatic veins are patent. BILIARY: No biliary duct dilation. The biliary system is otherwise unremarkable. PANCREAS: No focal pancreatic lesion. No pancreatic duct dilation. SPLEEN: No suspicious splenic lesion is seen. The spleen is normal in size. KIDNEYS: No suspicious renal lesion is seen. No hydronephrosis. ADRENALS: No adrenal gland nodule or thickening. AORTA/VASCULAR: No aneurysm. RETROPERITONEUM: No lymphadenopathy. BOWEL/MESENTERY: The appendix is normal. No bowel wall thickening or bowel dilation. ABDOMINAL WALL: No hernia. URINARY BLADDER: Incomplete bladder distension limits evaluation, but no focal wall thickening or calculus is seen. PELVIC NODES: No lymphadenopathy. PELVIC ORGANS: Status post hysterectomy. No pelvic mass. BONES: No acute fracture or suspicious osseous lesion. LUNG BASES: No pleural effusion or consolidation. OTHER: Small hiatal hernia. CONCLUSION: 1. No acute process is detected. 2. Small hiatal hernia &#x20; FINAL REPORT Attending Radiologist:",No Findings,,No label,"liver: the liver is normal in size. no suspicious liver lesion is seen. the portal and hepatic veins are patent. biliary: no biliary duct dilation. the biliary system is otherwise unremarkable. pancreas: no focal pancreatic lesion. no pancreatic duct dilation. spleen: no suspicious splenic lesion is seen. the spleen is normal in size. kidneys: no suspicious renal lesion is seen. no hydronephrosis. adrenals: no adrenal gland nodule or thickening. aorta/vascular: no aneurysm. retroperitoneum: no lymphadenopathy. bowel/mesentery: the appendix is normal. no bowel wall thickening or bowel dilation. abdominal wall: no hernia. urinary bladder: incomplete bladder distension limits evaluation, but no focal wall thickening or calculus is seen. pelvic nodes: no lymphadenopathy. pelvic organs: status post hysterectomy. no pelvic mass. bones: no acute fracture or suspicious osseous lesion. lung bases: no pleural effusion or consolidation. other: small hiatal hernia. conclusion: 1. no acute process is detected. 2. small hiatal hernia"
2,3,"EXAM: MRI ABDOMEN W WO CONTRAST CLINICAL INDICATION: Cirrhosis of liver without ascites, unspecified hepatic cirrhosis type (CMS-HCC) TECHNIQUE: MRI of the abdomen was performed with and without contrast. Multiplanar imaging was performed. 8.5 cc of Gadavist was administered. COMPARISON: DATE and priors FINDINGS: On limited views of the lung bases, no acute abnormality is noted. There may be mild distal esophageal wall thickening. On the out of phase series, there is suggestion of some signal gain within the hepatic parenchyma. This is stable. A tiny cystic nonenhancing focus is seen anteriorly in the right hepatic lobe (9/10), unchanged. A subtly micronodular hepatic periphery is noted. There are few subtle hypervascular lesions in the right hepatic lobe, without significant washout. The portal vein is patent. Some splenorenal shunting is redemonstrated, similar to the comparison exam. The spleen measures 12.4 cm in length. No focal splenic lesion is appreciated. There are several small renal lesions again seen, many of which again demonstrate T1 shortening. On the postcontrast subtraction series, no obvious enhancement is noted. The adrenal glands and pancreas are intact. There is mild cholelithiasis, without gallbladder wall thickening or pericholecystic fluid. No free abdominal fluid is visualized. IMPRESSION: 1. Stable cirrhotic appearance of the liver. Few subtly hypervascular hepatic lesions do not demonstrate washout, and probably relate to perfusion variants. No particularly suspicious hepatic mass is seen. 2. Mild splenomegaly to 12.4 cm redemonstrated. Splenorenal shunting is again seen. 3. Scattered simple and complex renal cystic lesions, nonenhancing, stable from March 2040. 4. Incidentally, there is evidence of signal gain in the liver on the out of phase series. This occasionally may represent iron overload. &#x20; FINAL REPORT Attending Radiologist:",No Findings,,No label,"on limited views of the lung bases, no acute abnormality is noted. there may be mild distal esophageal wall thickening. on the out of phase series, there is suggestion of some signal gain within the hepatic parenchyma. this is stable. a tiny cystic nonenhancing focus is seen anteriorly in the right hepatic lobe (9/10), unchanged. a subtly micronodular hepatic periphery is noted. there are few subtle hypervascular lesions in the right hepatic lobe, without significant washout. the portal vein is patent. some splenorenal shunting is redemonstrated, similar to the comparison exam. the spleen measures 12.4 cm in length. no focal splenic lesion is appreciated. there are several small renal lesions again seen, many of which again demonstrate t1 shortening. on the postcontrast subtraction series, no obvious enhancement is noted. the adrenal glands and pancreas are intact. there is mild cholelithiasis, without gallbladder wall thickening or pericholecystic fluid. no free abdominal fluid is visualized. impression: 1. stable cirrhotic appearance of the liver. few subtly hypervascular hepatic lesions do not demonstrate washout, and probably relate to perfusion variants. no particularly suspicious hepatic mass is seen. 2. mild splenomegaly to 12.4 cm redemonstrated. splenorenal shunting is again seen. 3. scattered simple and complex renal cystic lesions, nonenhancing, stable from march 2040. 4. incidentally, there is evidence of signal gain in the liver on the out of phase series. this occasionally may represent iron overload."


## Preprocess the Data

First, the impression (i.e., the findings / conclusions section) of the report is extracted, any doctor signatures are removed, and the report lowercased. This preprocessing section may need to be modified to accommodate your healthcare system's reports, formatting, etc. The ``preprocess_note`` function is modified from ``nmrezman.utils.preprocess_input``.

In [2]:
def keyword_split(x, keywords, return_idx: int=2):
    """
    Extract portion of string given a list of possible delimiters (keywords) via partition method
    """
    for keyword in keywords:
        if x.partition(keyword)[2] !='':
            return x.partition(keyword)[return_idx]
    return x
    
def preprocess_note(note):
    """
    Get the impression from the note, remove doctor signature, and lowercase
    """
    impression_keywords = [
            "impression:",
            "conclusion(s):",
            "conclusions:",
            "conclusion:",
            "finding:",
            "findings:",
    ]
    signature_keywords = [
        "&#x20",
        "final report attending radiologist:",
    ]
    impressions = keyword_split(str(note).lower(), impression_keywords)
    impressions = keyword_split(note, signature_keywords, return_idx=0)
    return impressions

# Preprocess the note
modeling_df["impression"] = modeling_df["note"].apply(preprocess_note)
modeling_df = modeling_df[modeling_df["impression"].notnull()]
modeling_df["impression"] = modeling_df["impression"].apply(lambda x: str(x.encode('utf-8')) +"\n"+"\n")    

Here we encode the findings label into integer labels for the model to interpret.

In [3]:
from sklearn import preprocessing

# Encode the Lung, Adrenal, and No Finding into integer labels
le = preprocessing.LabelEncoder()
le.fit(modeling_df["selected_finding"])
modeling_df["int_labels"] = le.transform(modeling_df["selected_finding"])

The data is split into train and test sets (as lists so that it is formatted for the ``Dataset``).

In [4]:
from sklearn.model_selection import train_test_split

# Split the data into train and test
train_df, test_df = train_test_split(modeling_df, test_size=0.3, stratify=modeling_df["selected_finding"], random_state=37)
train_note = list(train_df["impression"])
train_label = list(train_df["int_labels"])
test_note = list(test_df["impression"])
test_label = list(test_df["int_labels"])

## Tokenize and Define the Datasets

First, we define a tokenizer to mask words or word fragments to tokens. Here, we are using [ðŸ¤—'s pretrained RoBERTa base model's](https://huggingface.co/roberta-base) checkpoint. Padding is done on the left side since NM radiology reports generally have the findings at the end of the report. Note that you can change out the tokenizer and model to start from a different RoBERTa checkpoint (e.g., ``roberta-large``).

In [5]:
from transformers import AutoTokenizer

# Define the tokenizer (from a pre-trained checkpoint) and tokenize the notes
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True, padding_side="left")
train_encodings = tokenizer(train_note, truncation=True, padding=True)
val_encodings = tokenizer(test_note, truncation=True, padding=True)   

Next, we define a custom Pytorch ``Dataset`` class. This will return the tokenized report text and integer label for a given index. ðŸ¤— can easily use custom Pytorch ``Dataset``s for training data. 

In [6]:
import torch

class Reports_Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings: dict, labels: list) -> None:
        self.encodings = encodings
        self.labels = labels
        return

    def __getitem__(self, idx: int) -> dict:
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self) -> int:
        return len(self.labels)
    
# Define the trainign dataset with tokenized notes and labels
train_dataset = Reports_Dataset(train_encodings, train_label)
val_dataset = Reports_Dataset(val_encodings, test_label)

## Fine-Tune the Model

First, we load the pretrained model (similar to the one that was pretrained via the notebook ``Demo: Phase 02 Pretraining the NM Results Management Language Model with Custom Corpus`` only trained on thousands of reports). This model will be fine-tuned to a specific task, which, in this case, is a multi-class classification problem that determines if a report has Lung Findings, Adrenal Findings, or No Findings.

In [7]:
from transformers import AutoModelForSequenceClassification

# TODO: point to the pretrained model trained as part of the pretraining process
# Here, we are using a pretrained checkpoint trained on thousands of reports (vs the pretrained model wieghts generated via the notebook ``demo_pretrain``)
# To use the only directly trained by the notebook, use "/path/to/results/phase02/demo/checkpoint-4" 
model_pretrained_path = "/path/to/results/phase02/demo/checkpoint-14500"

# Fine-tune the model from the pre-trained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(model_pretrained_path, num_labels=3)

Some weights of the model checkpoint at /path/to/results/phase02/demo/checkpoint-14500 were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /path/to/results/phase02/demo/checkpoint-14500 and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'c

Here we begin training using the ðŸ¤— ``Trainer``, which will train according to the parameters specified in the ðŸ¤— ``TrainingArguments``. ðŸ¤— will take care of all the training for us! When done, the last checkpoint will be used as for classifying reports for Lung, Adrenal, or No Findings.

In [8]:
from transformers import Trainer, TrainingArguments

# Define the training parameters and ðŸ¤— Trainer
training_args = TrainingArguments(
                    output_dir="/path/to/results/phase02/demo/findings",    # output directory
                    num_train_epochs=40,                                    # total number of training epochs
                    per_device_train_batch_size=16,                         # batch size per device during training
                    per_device_eval_batch_size=8,                           # batch size per device during evaluation
                    warmup_steps=100,                                       # number of warmup steps for learning rate scheduler
                    weight_decay=0.015,                                     # strength of weight decay
                    fp16=True,                                              # mixed precision training
                    do_predict=True,                                        # run predictions on test set
                    load_best_model_at_end=True,                            # load best model at end so we can run confusion matrix
                    logging_steps=2,                                        # remaining args are related to logging
                    save_total_limit=2,
                    evaluation_strategy="epoch",
                    save_strategy="epoch",              
                    report_to="none",                  
)
trainer = Trainer(
                    model=model,                                            # the instantiated ðŸ¤— Transformers model to be trained
                    args=training_args,                                     # training arguments, defined above
                    train_dataset=train_dataset,                            # training dataset
                    eval_dataset=val_dataset,                               # test (evaluation) dataset: save and eval strategy to match
)

# Train!
trainer.train()

Using amp half precision backend
***** Running training *****
  Num examples = 7
  Num Epochs = 40
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 40


Epoch,Training Loss,Validation Loss
1,No log,1.09082
2,1.093700,1.090332
3,1.093700,1.089355
4,1.086800,1.088135
5,1.086800,1.086182
6,1.101900,1.083984
7,1.101900,1.081299
8,1.077500,1.078125
9,1.077500,1.074707
10,1.096700,1.071045


***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
Saving model checkpoint to /path/to/results/phase02/demo/findings/checkpoint-1
Configuration saved in /path/to/results/phase02/demo/findings/checkpoint-1/config.json
Model weights saved in /path/to/results/phase02/demo/findings/checkpoint-1/pytorch_model.bin
Deleting older checkpoint [/path/to/results/phase02/demo/findings/checkpoint-14] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
Saving model checkpoint to /path/to/results/phase02/demo/findings/checkpoint-2
Configuration saved in /path/to/results/phase02/demo/findings/checkpoint-2/config.json
Model weights saved in /path/to/results/phase02/demo/findings/checkpoint-2/pytorch_model.bin
Deleting older checkpoint [/path/to/results/phase02/demo/findings/checkpoint-15] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 4
  Batch size = 8
Saving model checkpoint to /path/to/results/phase02/demo/find

TrainOutput(global_step=40, training_loss=0.8405103325843811, metrics={'train_runtime': 75.1838, 'train_samples_per_second': 3.724, 'train_steps_per_second': 0.532, 'total_flos': 37091533086720.0, 'train_loss': 0.8405103325843811, 'epoch': 40.0})

## Evaluate the Results

Using `sklearn`'s ``classification_report`` and ``confusion_matrix``, we can evaluate how well the model performs on the test dataset.

In [9]:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

# Perform confusion matrix and print the results
y_pred = trainer.predict(val_dataset)
y_pred = np.argmax(y_pred.predictions, axis=1)
report = classification_report(test_label, y_pred)
matrix = confusion_matrix(test_label, y_pred)
print(report)
print(matrix)

***** Running Prediction *****
  Num examples = 4
  Batch size = 8


              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       0.50      1.00      0.67         1
           2       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.83      0.78         4
weighted avg       0.88      0.75      0.75         4

[[1 0 0]
 [0 1 0]
 [0 1 1]]
