## Using Clinical BERT

ClinicalBERT is an application of the. bert model [11] to clinical corpora to address the challenges of. clinical text. Representations are learned using medical notes and further processed for clinical tasks; we demonstrate ClinicalBERT on the task of hospital readmission prediction.

Original Source - 
https://www.kaggle.com/code/kerenhalevy/nmbe-bio-medical-bert/notebook

download and unpack the kaggle data

first lets get the data tools

just get all the data from kaggle <br/>
https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/data

In [1]:
import pandas as pd

In [2]:
! pip install transformers
! pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import Libs

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from ast import literal_eval
from itertools import chain
from sklearn.metrics import precision_recall_fscore_support
from tqdm.notebook import tqdm, trange
from sklearn.model_selection import StratifiedKFold
import torch
from transformers import AutoModel, AutoTokenizer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

## Set up Config

In [18]:
class CFG:
    root = "../input/nbme-score-clinical-patient-notes"
    debug = False
    n_fold=1
    # n_fold=5
    #model_path = "emilyalsentzer/Bio_ClinicalBERT"
    #model="../input/bio-clinicalbert"
    model ='emilyalsentzer/Bio_ClinicalBERT'
    max_length = 512
    doc_stride = 128
    device = "cuda" if torch.cuda.is_available() else "cpu"
    lr = 1e-5
    batch_size = 16
    epochs = 1
    # epochs = 3


## Create df

In [5]:
def create_train_df():
    feats = pd.read_csv("data/features.csv")
    notes = pd.read_csv("data/patient_notes.csv")
    train = pd.read_csv(f"data/train.csv")

    train["annotation_list"] = [literal_eval(x) for x in train["annotation"]]
    train["location_list"] = [literal_eval(x) for x in train["location"]]
    merged = train.merge(notes, how = "left")
    merged = merged.merge(feats, how = "left")
    merged = merged.loc[merged["annotation"] != "[]"].copy().reset_index(drop = True) # comment out if you train all samples


    def process_feature_text(text):
            return text.replace("-OR-", ";-").replace("-", " ")
  
    merged["feature_text"] = [process_feature_text(x) for x in merged["feature_text"]]
    
    merged["feature_text"] = merged["feature_text"].apply(lambda x: x.lower())
    merged["pn_history"] = merged["pn_history"].apply(lambda x: x.lower())
    
    merged['location_prediction'] = -1
    merged['token_proba'] = -1
    merged['token_offsets'] = -1

    if CFG.debug:
        merged = merged.sample(frac = 0.5).reset_index(drop = True)

    skf = StratifiedKFold(CFG.n_fold)
    merged["stratify_on"] = merged["case_num"].astype(str) + merged["feature_num"].astype(str)
    merged["fold"] = -1

    for fold, (_, valid_idx) in enumerate(skf.split(merged["id"], y = merged["stratify_on"])):
        merged.loc[valid_idx, "fold"] = fold
    
    print(merged.shape)
    return merged


df = create_train_df()

(9901, 15)




In [6]:
df.head()

Unnamed: 0,id,case_num,pn_num,feature_num,annotation,location,annotation_list,location_list,pn_history,feature_text,location_prediction,token_proba,token_offsets,stratify_on,fold
0,00016_000,0,16,0,['dad with recent heart attcak'],['696 724'],[dad with recent heart attcak],[696 724],hpi: 17yo m presents with palpitations. patien...,family history of mi; family history of myocar...,-1,-1,-1,0,0
1,00016_001,0,16,1,"['mom with ""thyroid disease']",['668 693'],"[mom with ""thyroid disease]",[668 693],hpi: 17yo m presents with palpitations. patien...,family history of thyroid disorder,-1,-1,-1,1,0
2,00016_002,0,16,2,['chest pressure'],['203 217'],[chest pressure],[203 217],hpi: 17yo m presents with palpitations. patien...,chest pressure,-1,-1,-1,2,0
3,00016_003,0,16,3,"['intermittent episodes', 'episode']","['70 91', '176 183']","[intermittent episodes, episode]","[70 91, 176 183]",hpi: 17yo m presents with palpitations. patien...,intermittent symptoms,-1,-1,-1,3,0
4,00016_004,0,16,4,['felt as if he were going to pass out'],['222 258'],[felt as if he were going to pass out],[222 258],hpi: 17yo m presents with palpitations. patien...,lightheaded,-1,-1,-1,4,0


In [7]:
first = df.loc[0]
example = {
    "feature_text": first.feature_text,
    "pn_history": first.pn_history,
    "location_list": first.location_list,
    "annotation_list": first.annotation_list
}
for key in example.keys():
    print(key)
    print(example[key])
    print("=" * 100)

feature_text
family history of mi; family history of myocardial infarction
pn_history
hpi: 17yo m presents with palpitations. patient reports 3-4 months of intermittent episodes of "heart beating/pounding out of my chest." 2 days ago during a soccer game had an episode, but this time had chest pressure and felt as if he were going to pass out (did not lose conciousness). of note patient endorses abusing adderall, primarily to study (1-3 times per week). before recent soccer game, took adderrall night before and morning of game. denies shortness of breath, diaphoresis, fevers, chills, headache, fatigue, changes in sleep, changes in vision/hearing, abdominal paun, changes in bowel or urinary habits. 
pmhx: none
rx: uses friends adderrall
fhx: mom with "thyroid disease," dad with recent heart attcak
all: none
immunizations: up to date
shx: freshmen in college. endorses 3-4 drinks 3 nights / week (on weekends), denies tabacco, endorses trying marijuana. sexually active with girlfrien

In [8]:
def loc_list_to_ints(loc_list):
    to_return = []
    for loc_str in loc_list:
        loc_strs = loc_str.split(";")
        for loc in loc_strs:
            start, end = loc.split()
            to_return.append((int(start), int(end)))
    return to_return

print(example["location_list"])
example_loc_ints = loc_list_to_ints(example["location_list"])[0]
print(example_loc_ints)
print(example["pn_history"][example_loc_ints[0] : example_loc_ints[1]])

['696 724']
(696, 724)
dad with recent heart attcak


## Build the Tokenizer

In [9]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model)

In [10]:
# Questions: 
# 1. why not use doc_stride  -> treated
# 2. why using duoble instead of int -> treated
def tokenize_and_add_labels(tokenizer, example):
    tokenized_inputs = tokenizer(
        example["feature_text"],
        example["pn_history"],
        max_length = CFG.max_length,
        stride = CFG.doc_stride,
        padding = "max_length",
        truncation = "only_second",
        return_offsets_mapping = True
    )
    labels = [0.0] * len(tokenized_inputs["input_ids"])
    tokenized_inputs["location_int"] = loc_list_to_ints(example["location_list"])
    tokenized_inputs["sequence_ids"] = tokenized_inputs.sequence_ids()

    for idx, (seq_id, offsets) in enumerate(zip(tokenized_inputs["sequence_ids"], tokenized_inputs["offset_mapping"])):
        if seq_id is None or seq_id == 0:
            labels[idx] = -100
            continue
        exit = False
        token_start, token_end = offsets
        for feature_start, feature_end in tokenized_inputs["location_int"]:
            if exit:
                break
            if token_start >= feature_start and token_end <= feature_end:
                labels[idx] = 1.0
                exit = True
    tokenized_inputs["labels"] = labels
    
    return tokenized_inputs

In [11]:
tokenized_inputs = tokenize_and_add_labels(tokenizer, example)
for key in tokenized_inputs.keys():
    print(key)
    print(tokenized_inputs[key])
    print("=" * 100)

input_ids
[101, 1266, 1607, 1104, 1940, 132, 1266, 1607, 1104, 1139, 13335, 2881, 2916, 1107, 14794, 5796, 102, 6857, 1182, 131, 1542, 7490, 182, 8218, 1114, 185, 1348, 18965, 6006, 119, 5351, 3756, 124, 118, 125, 1808, 1104, 27946, 3426, 1104, 107, 1762, 5405, 120, 9683, 1149, 1104, 1139, 2229, 119, 107, 123, 1552, 2403, 1219, 170, 5862, 1342, 1125, 1126, 2004, 117, 1133, 1142, 1159, 1125, 2229, 2997, 1105, 1464, 1112, 1191, 1119, 1127, 1280, 1106, 2789, 1149, 113, 1225, 1136, 3857, 14255, 9589, 1757, 114, 119, 1104, 3805, 5351, 1322, 18649, 1116, 170, 7441, 1158, 5194, 21716, 1233, 117, 3120, 1106, 2025, 113, 122, 118, 124, 1551, 1679, 1989, 114, 119, 1196, 2793, 5862, 1342, 117, 1261, 5194, 1200, 4412, 1233, 1480, 1196, 1105, 2106, 1104, 1342, 119, 26360, 1603, 1757, 1104, 2184, 117, 4267, 25890, 12238, 1548, 117, 10880, 1116, 117, 11824, 1116, 117, 16320, 117, 18418, 117, 2607, 1107, 2946, 117, 2607, 1107, 4152, 120, 4510, 117, 24716, 185, 3984, 1179, 117, 2607, 1107, 7125, 1883, 1

we need "input_ids" and "attention_mask" for BERT.

labels are 1.0 at annotation.

so we can train as binary classification; does this word(token) represent the feature? -> 1 or 0

## Our Dataset

In [12]:
class NBMEData(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        example = self.data.loc[idx]
        tokenized = tokenize_and_add_labels(self.tokenizer, example)

        input_ids = np.array(tokenized["input_ids"]) # for input BERT
        attention_mask = np.array(tokenized["attention_mask"]) # for input BERT
        labels = np.array(tokenized["labels"]) # for calculate loss and cv score

        offset_mapping = np.array(tokenized["offset_mapping"]) # for calculate cv score
        sequence_ids = np.array(tokenized["sequence_ids"]).astype("float16") # for calculate cv score
        
        return input_ids, attention_mask, labels, offset_mapping, sequence_ids

## The Actual Model Implimentation

In [13]:
class NBMEModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(CFG.model) # BERT model
        self.dropout = torch.nn.Dropout(p = 0.2)
        self.classifier = torch.nn.Linear(768, 1) # BERT has last_hidden_state(size: sequqence_length, 768)
    
    def forward(self, input_ids, attention_mask):
        last_hidden_state = self.backbone(input_ids = input_ids, attention_mask = attention_mask)[0] # idx 0 is last_hidden_state; backbone().last_hidden_state is also good
        logits = self.classifier(self.dropout(last_hidden_state)).squeeze(-1)
        return logits

## Actual Model Training

In [14]:
def train_loop(fold):
    model = NBMEModel().to(CFG.device)
    #criterion = torch.nn.BCEWithLogitsLoss()
    optimizer = torch.optim.AdamW(model.parameters(), CFG.lr)

    train = df.loc[df["fold"] != fold].reset_index(drop = True)
    valid = df.loc[df["fold"] == fold].reset_index(drop = True)
    train_ds = NBMEData(train, tokenizer)
    valid_ds = NBMEData(valid, tokenizer)
    train_dl = torch.utils.data.DataLoader(train_ds, batch_size = CFG.batch_size, pin_memory = True, shuffle = True, drop_last = True)
    valid_dl = torch.utils.data.DataLoader(valid_ds, batch_size = CFG.batch_size * 2, pin_memory = True, shuffle = False, drop_last = False)
    
    return train_dl, valid_dl, model, optimizer

In [15]:
class AverageMeter(object):
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n = 1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def get_location_predictions(preds, offset_mapping, sequence_ids, test = False):
    all_predictions = []
    for pred, offsets, seq_ids in zip(preds, offset_mapping, sequence_ids):
        pred = sigmoid(pred)
        start_idx = None
        current_preds = []
        for p, o, s_id in zip(pred, offsets, seq_ids):
            if s_id is None or s_id == 0:
                continue
            if p > 0.75:
                if start_idx is None:
                    start_idx = o[0]
                end_idx = o[1]
            elif start_idx is not None:
                if test:
                    current_preds.append(f"{start_idx} {end_idx}")
                else:
                    current_preds.append((start_idx, end_idx))
                start_idx = None
        if test:
            all_predictions.append("; ".join(current_preds))
        else:
            all_predictions.append(current_preds)
    return all_predictions

def calculate_char_CV(predictions, offset_mapping, sequence_ids, labels):
    all_labels = []
    all_preds = []
    for preds, offsets, seq_ids, labels in zip(predictions, offset_mapping, sequence_ids, labels):
        num_chars = max(list(chain(*offsets)))
        char_labels = np.zeros((num_chars))
        for o, s_id, label in zip(offsets, seq_ids, labels):
            if s_id is None or s_id == 0:
                continue
            if int(label) == 1:
                char_labels[o[0]:o[1]] = 1
        char_preds = np.zeros((num_chars))
        for start_idx, end_idx in preds:
            char_preds[start_idx:end_idx] = 1
        all_labels.extend(char_labels)
        all_preds.extend(char_preds)
    results = precision_recall_fscore_support(all_labels, all_preds, average = "binary")
    return {
        "precision": results[0],
        "recall": results[1],
        "f1": results[2]
    }

In [16]:
def model_loop():    
    history = {}
    for fold in range(CFG.n_fold):
        print(f"========== fold: {fold} training ==========")
        train_dl, valid_dl, model, optimizer = train_loop(fold)
        history[fold] = {"train": [], "valid": []}
        best_loss = np.inf
        
        for epoch in range(CFG.epochs):
            print(f"========== EPOCH: {epoch} training ==========")
            #training
            model.train()
            train_loss = AverageMeter()
            pbar = tqdm(train_dl)
            for batch in pbar:
                optimizer.zero_grad()
                input_ids = batch[0].to(CFG.device)
                attention_mask = batch[1].to(CFG.device)
                labels = batch[2].to(CFG.device)
                offset_mapping = batch[3]
                sequence_ids = batch[4]
                logits = model(input_ids, attention_mask)
                loss_fct = torch.nn.BCEWithLogitsLoss(reduction = "none")
                loss = loss_fct(logits, labels)
                loss = torch.masked_select(loss, labels > -1).mean() # we should calculate at "pn_history"; labels at "feature_text" are -100 < -1
                loss.backward()
                optimizer.step()
                train_loss.update(val = loss.item(), n = len(input_ids))
                pbar.set_postfix(Loss = train_loss.avg)
            print(epoch, train_loss.avg)
            history[fold]["train"].append(train_loss.avg)

            #evaluation
            model.eval()
            valid_loss = AverageMeter()
            preds = []
            offsets = []
            seq_ids = []
            lbls = []
            with torch.no_grad():
                for batch in tqdm(valid_dl):
                    input_ids = batch[0].to(CFG.device)
                    attention_mask = batch[1].to(CFG.device)
                    labels = batch[2].to(CFG.device)
                    offset_mapping = batch[3]
                    sequence_ids = batch[4]
                    logits = model(input_ids, attention_mask)
                    loss_fct = torch.nn.BCEWithLogitsLoss(reduction = "none")
                    loss = loss_fct(logits, labels)
                    loss = torch.masked_select(loss, labels > -1).mean()
                    valid_loss.update(val = loss.item(), n = len(input_ids))
                    pbar.set_postfix(Loss = valid_loss.avg)
                    preds.append(logits.cpu().numpy())
                    offsets.append(offset_mapping.numpy())
                    seq_ids.append(sequence_ids.numpy())
                    lbls.append(labels.cpu().numpy())
            print(epoch, valid_loss.avg)
            history[fold]["valid"].append(valid_loss.avg)          
            
            # save model
            if valid_loss.avg < best_loss:
                best_loss = valid_loss.avg
                torch.save(model.state_dict(), f"nbme_{fold}.pth")
                preds = np.concatenate(preds, axis = 0)
                # print(preds.shape)
                # print([preds[i][0] for i in range(preds.shape[0])])
                # print(preds)
                offsets = np.concatenate(offsets, axis = 0)
                seq_ids = np.concatenate(seq_ids, axis = 0)
                lbls = np.concatenate(lbls, axis = 0)
                location_preds= get_location_predictions(preds, offsets, seq_ids)
                # print(offsets.shape)
                # df.loc[df['fold'] == fold, 'location_prediction'] = written_predictions
                index = df[df['fold'] == fold].index
                df.loc[index,'location_prediction'] = pd.Series(location_preds, index=index)
                df.loc[index,'token_proba'] = pd.Series([list(preds[i]) for i in range(preds.shape[0])], index=index)
                df.loc[index,'token_offsets'] = pd.Series([list(offsets[i]) for i in range(offsets.shape[0])], index=index)
                score = calculate_char_CV(location_preds, offsets, seq_ids, lbls)
                # if epoch == 1:
                #     return location_preds
                print(score)
    print(history)
    return()

## Run the Model Training in Full
Now that we have everything st up we can run everything with one command 

In [19]:
model_loop()



Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).




  0%|          | 0/495 [00:00<?, ?it/s]

0 0.06715106574243122


  0%|          | 0/62 [00:00<?, ?it/s]

0 0.028155514836027337
{'precision': 0.8495557350565428, 'recall': 0.49760482583239696, 'f1': 0.6276060120091}
{0: {'train': [0.06715106574243122], 'valid': [0.028155514836027337]}}


()

#Final Model Outputs 
And we are all set we can see the model outputs here more folds and epoches can improve this but be sure to ballance it ass to much can laed to over fiting last here is how to save your newly trained model asset 

In [20]:
df.to_pickle("df_pred_medical.pkl")