# Injury narrative classification using BERT, DistilBERT and Roberta with Hugging Face transformer library
This notebook shows how to use the tansformer library to fine tune BERT for Text classification on the Injury narrative dataset
___________________________________________________________________________________________________________________________________

## Overview and Problem statement

The National Institute for Occupational Safety and Health (NIOSH) is responsible for conducting research to reduce worker injuries and illnesses in the United States. Every year millions of Americans are injured on the job. The data from those events are collected in medical records and surveillance system as unstructured text heavy injury narratives. A better understanding of the characteristics of these incidents can help us prevent them in the future. However, having human read and manually code these injury narratives to analyze the data is expensive, time consuming, and error-prone, thus the idea to use NLP to auto-code the narratives.

In 2018 National Institute for Occupational Safety and Health (NIOSH), as part of the Centers for Disease Control and Prevention (CDC), along with NASA, worked with Topcoder to organize an intramural (within CDC ) and extramural (international) Natural language processing competition to classify unstructured free-text "injury narratives" recorded in surveillance system into injury codes from the Occupational Injuries and Illnesses Classification System (OIICS). This is known as a Multi-class text classification problem. For example the text 'DOING UNSPECIFIED LIFTING AT WORK AND DEVELOPED PAIN ACROSS CHEST CHEST PAIN' is coded by 71 which means 'Overexertion involving outside sources'. More details on the categories and event codes can be found [here](https://wwwn.cdc.gov/wisards/oiics/Trees/MultiTree.aspx?TreeType=Event).


***In this notebook we will use the dataset available on Topcoder and [Github](https://github.com/NASA-Tournament-Lab/CDC-NLP-Occ-Injury-Coding) to experiment text classification using BERT, DistilBERT and Roberta***

## Workflow

Our workflow will follow the generic pipeline for modern-day, data-driven NLP system development as describe in the book "Practical Natural Language Processing". The key stages in the pipeline are as follows:

1. Data acquisition
2. Text cleaning
3. Pre-processing
4. Feature engineering
5. Modeling
6. Evaluation
6. Deployment
7. Monitoring and model updating

However, we will only go from step 1 to step 6 evaluating 3 transformer algorithm. 

![NLP Pipeline](./assets/pnlp_0201r.png)

## Step 0 -  Install and Import Libraries

In [1]:
%%capture
!pip install transformers
!pip install nltk
!pip install -U tensorflow
!pip install -U sagemaker

In [2]:
# please ignore warning messages during the installation
#!pip install --disable-pip-version-check -q sagemaker==2.35.0
#!pip install --disable-pip-version-check -q transformers==3.5.1

In [162]:
#!jupyter nbextension enable --py widgetsnbextension

In [4]:
import nltk 

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [177]:
import pandas as pd
import tensorflow as tf
import re
import nltk
import string
from nltk import word_tokenize
from sklearn.model_selection import train_test_split
from tensorflow.keras import activations, optimizers, losses
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
import numpy as np
import transformers

In [178]:
print(tf.__version__)
print(transformers.__version__)

2.6.0
4.10.0


## Step 1 - Data Acquisition

### Dataset

The training dataset used in the competition can be downloaded from here and the test dataset from here. You can also download the complete dataset from Github. The training dataset includes 48 classifiable event codes distributed across 7 categories:
* Violence and other injuries by persons and animals
* Transportation incidents
* Fires and explosions
* Falls, slips, and trips
* Exposure to harmful substances or environments
* Contact with objects and equipment
* Overexertion and bodily reaction

More details on the categories and event codes can be found here.

In [7]:
## Set print limits
pd.options.display.max_colwidth = 3000
## Import Data
df_injury = pd.read_csv('./data/raw/train.csv')
df_injury

Unnamed: 0,text,sex,age,event
0,57YOM WITH CONTUSION TO FACE AFTER STRIKING IT WITH A POST POUNDER WHILE SETTING A FENCE POST,1,57,62
1,A 45YOM FELL ON ARM WHILE WORKING HAD SLIPPED ON WATER FX WRIST,1,45,42
2,58YOM WITH CERVICAL STRAIN BACK PAIN S P RESTRAINED TAXI DRIVER IN LOW SPEED REAR END MVC NO LOC NO AB DEPLOYED,1,58,26
3,33 YOM LAC TO HAND FROM A RAZOR KNIFE,1,33,60
4,53YOM AT WORK IN A WAREHOUSE DOING UNSPECIFIED LIFTING AND STRAINED LO WER BACK,1,53,71
...,...,...,...,...
153951,19YOF DOING UNSPECIFIED LIFTING AT WORK AND DEVELOPED PAIN ACROSS CHES T CHEST PAIN,2,19,71
153952,58 YOM ACCIDENTAL CONTACT WITH AN ELECTRIC SAW AT WORK BLEEDING FROM LEFT HAND WAS A MITER SAW DX FRACTURE HAND OPEN LACERATION HAND,1,58,63
153953,X 18 YOM GOT HIS HAND CAUGHT IN A DOUGH PRESS AT WORK DX FINGER CONTUSION,1,18,64
153954,53YOM WAS DOING SOME KIND OF WORK IN THE DRIVEWAY OF A PARKING LOT WAS HIT BY CAR COMING OUT OF PARKING LOT DX LT FOOT FX,1,53,24


In [8]:

def get_samples(ratio,train):
    great_than2_classes = train['event'].value_counts()[train['event'].value_counts() >2].index
    train = train[train['event'].isin(great_than2_classes.to_list())]
    train_samples, _ = train_test_split(train,train_size=ratio,random_state=42,stratify=train['event'])
    print("nb classes",train_samples['event'].nunique())
    print("nb oservations:",train_samples.shape)
    print(train_samples['event'].value_counts())
    
    return train_samples



In [9]:
def get_data(data_file,is_sample=None,ratio=None,is_test_split=False):
  
    train = pd.read_csv(data_file)
    if is_sample:
        train = get_samples(ratio = ratio, train=train)
    great_than2_classes = train['event'].value_counts()[train['event'].value_counts() >5].index 
    train_filter = train[train['event'].isin(great_than2_classes.to_list())]

    X = train_filter['text']
    y = train_filter['event']

    print(f"X.shape {X.shape} y.shape : {y.shape}")

    X_train_valid,X_test,y_train_valid, y_test = train_test_split(X,y,train_size=0.9,random_state=42,stratify=y)
    
    if is_test_split:
        X_train,X_valid,y_train,y_valid = train_test_split(X_train_valid,y_train_valid,train_size=0.8,random_state=42,stratify=y_train_valid)

    if is_test_split :
        print(f"X_train shape {X_train.shape} y_train shape : {y_train.shape}")
        print(f"X_valid shape {X_valid.shape} y_valid shape : {y_valid.shape}")
        print(f"X_test shape {X_test.shape} y_test shape : {y_test.shape}")
        
        return {
          'train': (X_train,y_train),
          'valid': (X_valid,y_valid),
          'test': (X_test,y_test)
      }

    else:
        print(f"X_train shape {X_train_valid.shape} y_train shape : {y_train_valid.shape}")
        print(f"X_valid shape {X_test.shape} y_valid shape : {y_test.shape}")
        
        return {
          'train': (X_train_valid,y_train_valid),
          'valid': (X_test,y_test),
      }


### Loading and spliting the training data

The dataset includes 153,956 records and 48 classes.For this experiment we did not use any class imbalanced methods , we only kept classes with more than five records and split the train data: 90% for training and 10 % for validation.This reduces the number of classes to 43.

In [30]:
pd.options.display.max_colwidth = 3100
data = get_data('./data/raw/train.csv',is_sample=True,ratio=0.05)
X_train, y_train = data['train']
X_valid,y_valid = data['valid']
#X_test,y_test = data['test']

nb classes 41
nb oservations: (7697, 4)
71    1295
62    1220
42     781
55     584
63     453
60     449
11     447
73     416
43     327
70     266
64     219
53     195
13     163
66     144
26     134
12     112
41      76
99      69
24      51
31      45
78      44
27      42
72      38
51      25
52      24
44      19
32      16
23      14
25       5
69       5
61       3
67       3
21       2
65       2
40       2
22       2
45       1
49       1
79       1
20       1
50       1
Name: event, dtype: int64
X.shape (7668,) y.shape : (7668,)
X_train shape (6901,) y_train shape : (6901,)
X_valid shape (767,) y_valid shape : (767,)


In [31]:
import numpy as np
print('classes in train :',len(np.unique(y_train)))
print('classes in valid :',len(np.unique(y_valid)))

CLASSES = y_train.unique().tolist()
print('CLASSSES : ',CLASSES)

classes in train : 28
classes in valid : 28
CLASSSES :  [62, 71, 63, 11, 43, 55, 42, 52, 60, 73, 13, 66, 12, 53, 64, 27, 24, 99, 26, 72, 70, 51, 44, 41, 31, 78, 32, 23]


##  Step 3 and 4 -  Cleaning and Pre-processing

We clean each narrative in training and validation dataset by removing HTML , tags, punctuation, stop words using NLTK.


In [32]:

def clean_text(text):
    """
    Applies some pre-processing on the given text.

    Steps :
    - Removing HTML tags
    - Removing punctuation
    - Lowering text
    """

    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # remove the characters [\], ['] and ["]
    text = re.sub(r"\\", "", text)
    text = re.sub(r"\'", "", text)
    text = re.sub(r"\"", "", text)

    # convert text to lowercase
    text = text.strip().lower()

    # remove all non-ASCII characters:
    text = re.sub(r'[^\x00-\x7f]', r'', text)

    # replace punctuation characters with spaces
    filters = '!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
    translate_dict = dict((c, " ") for c in filters)
    translate_map = str.maketrans(translate_dict)
    text = text.translate(translate_map)
    text = " ".join(text.split())
    return text

In [33]:
def remove_useless_words(text,useless_words):
    sentence = [word for word in word_tokenize(text)]
    sentence_stop = [word for word in sentence if word not in useless_words]

    text = " ".join(sentence_stop)

    return text

In [34]:
def preprocess_data(X):
    """ Preprocess : cleaning and remove stop words"""

    X = X.apply(lambda x: re.sub(r'\d+', '', x))
    X = X.apply(lambda x: clean_text(x))

    stopwords = nltk.corpus.stopwords.words('english')
    useless_words = stopwords + list(string.punctuation) + ['yom', 'yof', 'yowm', 'yf', 'ym', 'yo']
    # print("useless word : ",useless_words)
    X = X.apply(lambda x: remove_useless_words(x,useless_words))

    return X

In [35]:
X_train_processed = preprocess_data(X_train)
X_valid_processed = preprocess_data(X_valid)

print("after preprocessing...")
print(X_train_processed.head(20))

after preprocessing...
78811                                                                yomcontusion foot metal fell foot work
29443                            c hip pain pulling heavy trash bags work friday dx strain right hip flexor
146033                                                  hurt chest leaning fish aquarium dx contusion chest
92679                                                   assaulted work punched face customer contusion face
124907                    reports sus laceration rt palm lost footing whilecoming ladder dx palm laceration
110259                                                                    stuck self needle dx ppunc finger
62601                                                          unspecified lifting work strained lower back
152848                                h ms worsening leg numbness fell tdy work lle weakness ms excerbation
113306                            corneal keratitis person welding work pt denies fb eyes woke eyes burning
11120

##  Step 5 - Feature Engineering

### Tokenize Train and Validation data for DistilBERT
To get our features we tokenize our train and validation data using the DistilBertTokenizerFast. To use a BERT like model we need to tokenize the data in the format requested by BERT. Fortunately the transformers library provides the function to do so. DistilBertTokenizerFast converts each text into token_ids and attention_mask using the BERT vocabulary.

In [163]:
from transformers import  DistilBertTokenizerFast,BertTokenizerFast,AutoTokenizer

MAX_LEN = 45
MODEL_NAME = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

narratives = [X_train_processed.iloc[2],X_valid_processed.iloc[2]]
inputs = tokenizer(narratives, max_length=MAX_LEN, truncation=True, padding='max_length')

print(f"max_len:{MAX_LEN}")
print(f'narrative: \'{narratives}\'')
print(f'input ids: {inputs["input_ids"]}')
print(f'attention mask: {inputs["attention_mask"]}')

max_len:45
narrative: '['hurt chest leaning fish aquarium dx contusion chest', 'lbp rad post leg p mech fall parking lot work sciatica']'
input ids: [[101, 3480, 3108, 6729, 3869, 18257, 1040, 2595, 9530, 5809, 3258, 3108, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 6053, 2361, 10958, 2094, 2695, 4190, 1052, 2033, 2818, 2991, 5581, 2843, 2147, 16596, 12070, 2050, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


In [37]:
def construct_encodings(x, tkzr, max_len, trucation=True, padding=True):
    return tkzr(x, max_length=max_len, truncation=trucation, padding=padding)

x_train = X_train_processed.to_list()
x_valid = X_valid_processed.to_list()


train_encodings = tokenizer(x_train, max_length=MAX_LEN, truncation=True, padding='max_length',return_tensors='tf')
valid_encodings = tokenizer(x_valid, max_length=MAX_LEN, truncation=True, padding='max_length',return_tensors='tf')


In [38]:
print(MAX_LEN)
print(len(train_encodings['input_ids'][0]))
print(len(valid_encodings['input_ids'][0]))

45
45
45


### Encoding Labels for training and validation data

In [39]:
from sklearn.preprocessing import LabelEncoder

print("Encoding Labels .....")
encoder = LabelEncoder()
encoder.fit(y_train)
y_train_encode = np.asarray(encoder.transform(y_train))
y_valid_encode = np.asarray(encoder.transform(y_valid))

Encoding Labels .....


In [40]:
print('classes in train :',len(np.unique(y_train_encode)))
print('classes in valid :',len(np.unique(y_valid_encode)))

classes in train : 28
classes in valid : 28


### Convert our data into TensorFlow Dataset
To improve training performance we convert the data to Tensorflow dataset.

In [86]:
%%time
def construct_tfdataset(encodings, y=None):
    if y is not None:
        return tf.data.Dataset.from_tensor_slices((dict(encodings),y))
    else:
        # this case is used when making predictions on unseen samples after training
        return tf.data.Dataset.from_tensor_slices(dict(encodings))
    
train_tfdataset = construct_tfdataset(train_encodings, y_train_encode)
valid_tfdataset = construct_tfdataset(valid_encodings, y_valid_encode)

CPU times: user 2.56 ms, sys: 94 µs, total: 2.65 ms
Wall time: 2.08 ms


In [87]:
print(train_tfdataset)
print(valid_tfdataset)


<TensorSliceDataset shapes: ({input_ids: (45,), attention_mask: (45,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>
<TensorSliceDataset shapes: ({input_ids: (45,), attention_mask: (45,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>


In [88]:
N_EPOCHS= 1
BATCH_SIZE = 16 
steps_per_epoch = len(X_train) // BATCH_SIZE

train_tfdataset = train_tfdataset.repeat(N_EPOCHS * steps_per_epoch)
train_tfdataset = train_tfdataset.prefetch(tf.data.AUTOTUNE)
train_tfdataset = train_tfdataset.batch(BATCH_SIZE)

In [89]:
valid_batch_size = 16
validation_steps = len(X_valid) // valid_batch_size

valid_tfdataset = valid_tfdataset.repeat(N_EPOCHS * validation_steps)
valid_tfdataset = valid_tfdataset.prefetch(tf.data.AUTOTUNE)
valid_tfdataset = valid_tfdataset.batch(validation_steps)

In [90]:
print(train_tfdataset)
print(valid_tfdataset)

<BatchDataset shapes: ({input_ids: (None, 45), attention_mask: (None, 45)}, (None,)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>
<BatchDataset shapes: ({input_ids: (None, 45), attention_mask: (None, 45)}, (None,)), types: ({input_ids: tf.int32, attention_mask: tf.int32}, tf.int64)>


## Step 6 - Modeling and Training 

###  6.1 Fine Tune DistilBERT
Fine-tuning is the process of leveraging the knowledge from pre-trained weights by initializing your model with those weights and training it for your specific downstream task. In our case we used the 67M pre-trained parameters from DistilBERT and added on top of it a classifier using Convolutional 1D Neural Network.

In [154]:
from transformers import  TFDistilBertForSequenceClassification
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from transformers.optimization_tf import WarmUp, AdamWeightDecay

class DistillBERT:
    def __init__(self, params):
        self.num_labels = params['num_labels']
        self.max_len = params['max_len']
        self.learning_rate = params['learning_rate']
        self.num_records = params['num_records']
        self.batch_size = params['batch_size']
        self.epochs = params['epochs']
        
    def build(self):
        
        freeze_bert_layer = False
        
        # Load model and tokenizer
        transformer_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=self.num_labels)
        input_ids = tf.keras.Input(shape = (self.max_len,),name='input_ids',dtype='int32')
        attention_mask = tf.keras.Input(shape = (self.max_len,),name='attention_mask',dtype='int32')
        
        embbeding_layer = transformer_model.distilbert(input_ids,attention_mask=attention_mask)[0]
        #X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256,return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(embbeding_layer)
        X = tf.keras.layers.Conv1D(512, 3, activation='relu')(embbeding_layer)
        X = tf.keras.layers.GlobalMaxPool1D()(X)
        X = tf.keras.layers.Dense(256, activation='relu')(X)
        X = tf.keras.layers.Dropout(0.2)(X)
        X = tf.keras.layers.Dense(self.num_labels, activation='softmax')(X)
        
        model = tf.keras.Model(inputs=[input_ids,attention_mask], outputs = X)
        
        for layer in model.layers[:3]:
            layer.trainable = not freeze_bert_layer

        num_train_steps = (self.num_records // self.batch_size) * self.epochs
        decay_schedule_fn = PolynomialDecay(
            initial_learning_rate=self.learning_rate,
            end_learning_rate=0.,
            decay_steps=num_train_steps
            )
        
        warmup_steps  = num_train_steps * 0.1
        warmup_schedule = WarmUp(self.learning_rate,decay_schedule_fn,warmup_steps)

          # fine optimizer and loss
        #optimizer = tf.keras.optimizers.Adam(learning_rate=warmup_schedule,epsilon=1e-06)

        optimizer = AdamWeightDecay(learning_rate=warmup_schedule, weight_decay_rate=0.01, epsilon=1e-6)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits= False)
        metrics = ['acc']
        model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

        return model

#### Model Summary( DistilBERT)

We used the TFDistilBertForSequenceClassification class provided by the transformer library. We also used recommended hyperparameters (after several tuning experiments) suggested by Chi Sun et al 2019 and Mosbash et al 2020. We trained for 5 epochs with a batch size 16. We used slanted triangular learning rates (Howard and Ruder, 2018) with a base learning rate of 5e-5 and a warmup portion 0.1. We used AdamWeightDecay optimizer with a weight_decay_rate = 0.01 and epsilon 1e-6. We trained all layers which includes the complete 67M+ parameters of our model.


***We will use same configuration for BERT and Roberta***

In [210]:
LEARNING_RATE = 5e-5
num_records = len(X_train)
num_valid_records = len(X_valid)
max_len = MAX_LEN
epochs = 1
batch_size = 16
valid_batch_size = 16
steps_per_epoch = num_records // batch_size
validation_steps = num_valid_records // valid_batch_size

params = {
        'num_labels': len(CLASSES),
        'max_len': MAX_LEN,
        'learning_rate': LEARNING_RATE,
        'num_records':len(X_train),
        'batch_size':batch_size,
        'epochs':epochs
    }
  
model = DistillBERT(params).build()   
print(model.summary())


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'activation_13', 'vocab_layer_norm', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_261']
You should probably TRAIN this model on a down-stream task to be able to use 

Model: "model_9"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 45)]         0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 45)]         0                                            
__________________________________________________________________________________________________
distilbert (TFDistilBertMainLay TFBaseModelOutput(la 66362880    input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
conv1d_8 (Conv1D)               (None, 43, 512)      1180160     distilbert[0][0]           

#### Training / Finetune DistilBERT

In [211]:
%%time
model.fit(x=train_tfdataset,
          steps_per_epoch = steps_per_epoch,
          batch_size=batch_size, 
          epochs=epochs,
          validation_steps = validation_steps,
          validation_data=valid_tfdataset)

CPU times: user 1h 1min 28s, sys: 4min 17s, total: 1h 5min 46s
Wall time: 9min 58s


<keras.callbacks.History at 0x7f5d32090790>

#### Evaluate validation DistilBERT

In [212]:
num_valid_records = len(X_valid)
steps_per_epoch = num_records // batch_size
validation_steps = num_valid_records // valid_batch_size
print(num_valid_records)
print(steps_per_epoch)
print(validation_steps)

767
431
47


In [213]:
%%time
distil_bert_valid_score = model.evaluate(x=valid_tfdataset, steps = validation_steps,return_dict=True)
distil_bert_valid_score

CPU times: user 4min 42s, sys: 1 s, total: 4min 43s
Wall time: 40.6 s


{'loss': 0.8851924538612366, 'acc': 0.7315527200698853}

#### Evaluate Training DistilBERT

In [214]:
num_records = len(X_train)
steps_per_epoch = num_records // batch_size
print(num_records)
print(steps_per_epoch)

6901
431


In [215]:
%%time
distil_bert_train_score = model.evaluate(x=train_tfdataset,steps = steps_per_epoch, return_dict=True)
distil_bert_train_score

CPU times: user 15min 6s, sys: 6.89 s, total: 15min 13s
Wall time: 2min 19s


{'loss': 0.6924773454666138, 'acc': 0.7942285537719727}

In [216]:
import os
os.makedirs('./output/',exist_ok = True)
model.save("./output/distilbert.h5")

In [217]:
## Create dict to store all these results:
result_scores = {}

## Score the Model on Training and Testing Set
result_scores['DistilBERT'] =(distil_bert_train_score,distil_bert_valid_score)

In [218]:
result_scores

{'DistilBERT': ({'loss': 0.6924773454666138, 'acc': 0.7942285537719727},
  {'loss': 0.8851924538612366, 'acc': 0.7315527200698853})}

In [219]:
## Create Function to Print Results
def get_results(x1):
    print("\n{0:20}   {1:4}    {2:4} ".format('Model','Train','Validation'))
    print('-------------------------------------------')
    for key in x1.keys():
        print("{0:20}   {1:<6.4}   {2:<6.4}".format(key,x1[key][0]['acc'],x1[key][1]['acc']))

In [220]:
get_results(result_scores)


Model                  Train    Validation 
-------------------------------------------
DistilBERT             0.7942   0.7316


### Step 6.2 - FineTune  BaseBERT

In [229]:
from transformers import TFRobertaForSequenceClassification,TFBertForSequenceClassification
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from transformers.optimization_tf import WarmUp, AdamWeightDecay
class BaseBERT:
    def __init__(self, params):
        self.num_labels = params['num_labels']
        self.max_len = params['max_len']
        self.learning_rate = params['learning_rate']
        self.num_records = params['num_records']
        self.batch_size = params['batch_size']
        self.epochs = params['epochs']
        
    def build(self):
        
        freeze_bert_layer = False
        
        # Load model and tokenizer
        transformer_model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=self.num_labels)
        input_ids = tf.keras.Input(shape = (self.max_len,),name='input_ids',dtype='int32')
        attention_mask = tf.keras.Input(shape = (self.max_len,),name='attention_mask',dtype='int32')
        
        embbeding_layer = transformer_model.bert(input_ids,attention_mask=attention_mask)[0]
        #X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256,return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(embbeding_layer)
        X = tf.keras.layers.Conv1D(512, 3, activation='relu')(embbeding_layer)
        X = tf.keras.layers.GlobalMaxPool1D()(X)
        X = tf.keras.layers.Dense(256, activation='relu')(X)
        X = tf.keras.layers.Dropout(0.2)(X)
        X = tf.keras.layers.Dense(self.num_labels, activation='softmax')(X)
        
        model = tf.keras.Model(inputs=[input_ids,attention_mask], outputs = X)
        
        for layer in model.layers[:3]:
            layer.trainable = not freeze_bert_layer

        num_train_steps = (self.num_records // self.batch_size) * self.epochs
        decay_schedule_fn = PolynomialDecay(
            initial_learning_rate=self.learning_rate,
            end_learning_rate=0.,
            decay_steps=num_train_steps
            )
        
        warmup_steps  = num_train_steps * 0.1
        warmup_schedule = WarmUp(self.learning_rate,decay_schedule_fn,warmup_steps)

          # fine optimizer and loss
        #optimizer = tf.keras.optimizers.Adam(learning_rate=warmup_schedule,epsilon=1e-06)

        optimizer = AdamWeightDecay(learning_rate=warmup_schedule, weight_decay_rate=0.01, epsilon=1e-6)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits= False)
        metrics = ['acc']
        model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

        return model

####  Model Summary (BaseBERT)

In [230]:
LEARNING_RATE = 5e-5
num_records = len(X_train)
num_valid_records = len(X_valid)
max_len = MAX_LEN
epochs = 1
batch_size = 16
valid_batch_size = 16
steps_per_epoch = num_records // batch_size
validation_steps = num_valid_records // valid_batch_size
params = {
        'num_labels': len(CLASSES),
        'max_len': MAX_LEN,
        'learning_rate': LEARNING_RATE,
        'num_records':len(X_train),
        'batch_size':16,
        'epochs':1
    }
  
model2 = BaseBERT(params).build()   
print(model2.summary())

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "model_10"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 45)]         0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 45)]         0                                            
__________________________________________________________________________________________________
bert (TFBertMainLayer)          TFBaseModelOutputWit 109482240   input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
conv1d_9 (Conv1D)               (None, 43, 512)      1180160     bert[0][0]                

#### Training BaseBERT

In [231]:
%%time
model2.fit(x=train_tfdataset,
          steps_per_epoch = steps_per_epoch,
          batch_size=batch_size, 
          epochs=epochs,
          validation_steps = validation_steps,
          validation_data=valid_tfdataset)

CPU times: user 1h 57min 13s, sys: 5min 11s, total: 2h 2min 24s
Wall time: 18min 11s


<keras.callbacks.History at 0x7f5e02f33610>

In [None]:
#### Evaluate Validation BaseBERT

In [232]:
%%time
bert_valid_score = model2.evaluate(x=valid_tfdataset, steps = validation_steps,return_dict=True)
bert_valid_score

CPU times: user 9min 18s, sys: 1.3 s, total: 9min 19s
Wall time: 1min 20s


{'loss': 0.8822953701019287, 'acc': 0.735174298286438}

#### Evaluate Training BaseBERT

In [233]:
%%time
bert_train_score = model2.evaluate(x=train_tfdataset,steps = steps_per_epoch, return_dict=True)
bert_train_score

CPU times: user 29min 40s, sys: 12 s, total: 29min 52s
Wall time: 4min 32s


{'loss': 0.6526983380317688, 'acc': 0.799593985080719}

In [234]:
os.makedirs('./output/',exist_ok = True)
model2.save("./output/bert.h5")

In [235]:
## Score the Model on Training and Testing Set
result_scores['BERT'] =(bert_train_score,bert_valid_score)

In [236]:
get_results(result_scores)


Model                  Train    Validation 
-------------------------------------------
DistilBERT             0.7942   0.7316
BERT                   0.7996   0.7352


###  Step 6.3 -  FineTune  Base Roberta

In [241]:
#import logging
#transformers.logging.get_verbosity = lambda: logging.NOTSET

In [242]:
from transformers import  DistilBertTokenizerFast,BertTokenizerFast,AutoTokenizer,RobertaTokenizer

MAX_LEN = 45
MODEL_NAME = 'roberta-base'
#MODEL_NAME = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

narratives = [X_train_processed.iloc[2],X_valid_processed.iloc[2]]
inputs = tokenizer(narratives, max_length=MAX_LEN, truncation=True, padding='max_length')

print(f"max_len:{MAX_LEN}")
print(f'narrative: \'{narratives}\'')
print(f'input ids: {inputs["input_ids"]}')
print(f'attention mask: {inputs["attention_mask"]}')

max_len:45
narrative: '['hurt chest leaning fish aquarium dx contusion chest', 'lbp rad post leg p mech fall parking lot work sciatica']'
input ids: [[0, 298, 7363, 7050, 19146, 3539, 33734, 49386, 8541, 15727, 7050, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 17243, 642, 13206, 618, 2985, 181, 46833, 1136, 2932, 319, 173, 19974, 5183, 102, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
attention mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


In [243]:
train_tfdataset = construct_tfdataset(train_encodings, y_train_encode)
valid_tfdataset = construct_tfdataset(valid_encodings, y_valid_encode)

In [244]:
N_EPOCHS= 1
BATCH_SIZE = 16 
steps_per_epoch = len(X_train) // BATCH_SIZE

train_tfdataset = train_tfdataset.repeat(N_EPOCHS * steps_per_epoch)
train_tfdataset = train_tfdataset.prefetch(tf.data.AUTOTUNE)
train_tfdataset = train_tfdataset.batch(BATCH_SIZE)

In [245]:
valid_batch_size = 16
validation_steps = len(X_valid) // valid_batch_size

valid_tfdataset = valid_tfdataset.repeat(N_EPOCHS * validation_steps)
valid_tfdataset = valid_tfdataset.prefetch(tf.data.AUTOTUNE)
valid_tfdataset = valid_tfdataset.batch(validation_steps)

In [246]:
from transformers import TFRobertaForSequenceClassification,TFBertForSequenceClassification

class BaseRoberta:
    
    def __init__(self, params):
        self.num_labels = params['num_labels']
        self.max_len = params['max_len']
        self.learning_rate = params['learning_rate']
        self.batch_size = params['batch_size']
        self.epochs = params['epochs']
        self.num_records = params['num_records']
        
    def build(self):      
        
        freeze_bert_layer = False
        
        transformer_model = TFRobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=self.num_labels)
        input_ids = tf.keras.Input(shape = (self.max_len,),name='input_ids',dtype='int32')
        attention_mask = tf.keras.Input(shape = (self.max_len,),name='attention_mask',dtype='int32')
        
        embbeding_layer = transformer_model.roberta(input_ids,attention_mask=attention_mask)[0]
        X = tf.keras.layers.GlobalMaxPool1D()(embbeding_layer)
        X = tf.keras.layers.Dense(self.num_labels, activation='softmax')(X)
        
        model = tf.keras.Model(inputs=[input_ids,attention_mask], outputs = X)
        
        for layer in model.layers[:3]:
            layer.trainable = not freeze_bert_layer
        
        num_train_steps = (self.num_records // self.batch_size) * self.epochs
        decay_schedule_fn = PolynomialDecay(
            initial_learning_rate=self.learning_rate,
            end_learning_rate=0.,
            decay_steps=num_train_steps
            )
        
        warmup_steps = num_train_steps * 0.1
        warmup_schedule  = WarmUp(self.learning_rate,decay_schedule_fn,warmup_steps)
        
        #optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
        optimizer = AdamWeightDecay (learning_rate=warmup_schedule, weight_decay_rate=0.01, epsilon=1e-6)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
        
        model.compile(optimizer=optimizer, loss=loss, metrics=['acc'])
        
        return model

#### Model Summary ( Base Roberta)

In [247]:
LEARNING_RATE = 5e-5
num_records = len(X_train)
num_valid_records = len(X_valid)
max_len = MAX_LEN
epochs = 1
batch_size = 16
valid_batch_size = 16
steps_per_epoch = num_records // batch_size
validation_steps = num_valid_records // valid_batch_size
params = {
        'num_labels': len(CLASSES),
        'max_len': MAX_LEN,
        'learning_rate': LEARNING_RATE,
        'num_records':len(X_train),
        'batch_size':batch_size,
        'epochs':epochs
    }
  
model3 = BaseRoberta(params).build()   
print(model3.summary())

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "model_11"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 45)]         0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 45)]         0                                            
__________________________________________________________________________________________________
roberta (TFRobertaMainLayer)    TFBaseModelOutputWit 124055040   input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
global_max_pooling1d_11 (Global (None, 768)          0           roberta[0][0]             

#### Training BaseRoberta

In [248]:
%%time
model3.fit(x=train_tfdataset,
          steps_per_epoch = steps_per_epoch,
          batch_size=batch_size, 
          epochs=epochs,
          validation_steps = validation_steps,
          validation_data=valid_tfdataset)

CPU times: user 1h 58min 39s, sys: 7min 41s, total: 2h 6min 20s
Wall time: 19min 1s


<keras.callbacks.History at 0x7f5e28ba3290>

#### Evaluate validation Base Roberta

In [249]:
%%time
roberta_valid_score = model3.evaluate(x=valid_tfdataset, steps = validation_steps,return_dict=True)
roberta_valid_score

CPU times: user 9min 17s, sys: 1.98 s, total: 9min 19s
Wall time: 1min 20s


{'loss': 1.5880324840545654, 'acc': 0.5215029716491699}

#### Evaluate Training data Base Roberta

In [250]:
%%time
roberta_train_score = model3.evaluate(x=train_tfdataset,steps = steps_per_epoch, return_dict=True)
roberta_train_score

CPU times: user 28min 52s, sys: 12.6 s, total: 29min 5s
Wall time: 4min 25s


{'loss': 1.441288709640503, 'acc': 0.5292923450469971}

In [251]:
os.makedirs('./output/',exist_ok = True)
model.save('./output/roberta.h5')

In [252]:
## Score the Model on Training and Testing Set
result_scores['Roberta'] =(roberta_train_score,roberta_valid_score)

In [253]:
get_results(result_scores)


Model                  Train    Validation 
-------------------------------------------
DistilBERT             0.7942   0.7316
BERT                   0.7996   0.7352
Roberta                0.5293   0.5215


## Step 7 - Evaluation on Test data

### Get Test data

In [221]:
def get_test_data(test_data_file, is_sample, ratio):
    
    test_data = pd.read_csv(test_data_file)
    test_data = test_data[~test_data['event'].isin([10,29,30,59,74])]
    test_data = test_data[['text','event']]
    
    if is_sample:
        test_data = get_samples(ratio = ratio, train=test_data) 
        great_than5_classes = test_data['event'].value_counts()[test_data['event'].value_counts() >5].index
        test_data = test_data[test_data['event'].isin(great_than5_classes.to_list())]
    
    print("nb classes in final data:",test_data['event'].nunique())
    print(f"test_data_small.shape {test_data.shape}")
    
    return test_data

In [222]:
test_data = get_test_data('./data/raw/test_data.csv',is_sample=True, ratio=0.05)
X_test, y_test = test_data['text'], test_data['event']
X_test_processed = preprocess_data(X_test)
X_test_processed

nb classes 39
nb oservations: (3784, 2)
71    634
62    601
42    385
55    284
63    223
60    221
11    220
73    205
43    161
70    130
64    108
53     96
13     80
66     71
26     66
12     55
41     37
99     34
24     25
31     22
78     22
27     21
72     19
51     12
52     12
44      9
32      8
23      7
69      3
25      3
67      2
40      1
22      1
65      1
50      1
61      1
49      1
45      1
21      1
Name: event, dtype: int64
nb classes in final data: 28
test_data_small.shape (3768, 2)


37427                                                              male hurt bending work dx knee pain b
3526                                    works construction door fell hitting head loc c neck pain chi ms
8292       c l finger pain work l th digit removing door panel crushed dx finger contu subungal hematoma
63604                                                 wks lows heavy lifting h worsening lbp atypical cp
11228    f pt work yesterday slipped fell onto floor hitting head loc altered mental status today dx chi
                                                      ...                                               
36432                               drives subject bus lots lifting pushing people wheelchairs back pain
16967                                          work handling concrete got rash hands contact dermat itis
49205                                                                                sexual assault work
2439                               work hit open freeze

In [223]:
train_classes =y_train.unique().tolist()
test_classes = y_test.unique().tolist()

print(sorted(train_classes))
print(sorted(test_classes))

[11, 12, 13, 23, 24, 26, 27, 31, 32, 41, 42, 43, 44, 51, 52, 53, 55, 60, 62, 63, 64, 66, 70, 71, 72, 73, 78, 99]
[11, 12, 13, 23, 24, 26, 27, 31, 32, 41, 42, 43, 44, 51, 52, 53, 55, 60, 62, 63, 64, 66, 70, 71, 72, 73, 78, 99]


### Load the model

In [259]:
loaded_model = tf.keras.models.load_model('./output/roberta.h5',custom_objects={'AdamWeightDecay':AdamWeightDecay})

### Calculate predictions

In [260]:
%%time
MAX_LEN = 45
model_name = 'bert-base-uncased'

x_test = X_test_processed.reset_index(drop=True).tolist()
y_test = y_test.reset_index(drop=True)



CPU times: user 1.41 ms, sys: 57 µs, total: 1.47 ms
Wall time: 1.16 ms


### Evaluate test

In [261]:
MODEL_NAME = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

encodings_x_test =  tokenizer(x_test, max_length=MAX_LEN, truncation=True, padding='max_length',return_tensors='tf')
y_test_encode = np.asarray(encoder.transform(y_test))
tfdataset_test = construct_tfdataset(encodings_x_test,y_test_encode).batch(16)

print("Evaluating Test data...")
test_score = loaded_model.evaluate(tfdataset_test, steps = validation_steps,batch_size=16)
print("Test loss: ", test_score[0])
print("Test accuracy: ", test_score[1])

Evaluating Test data...


InvalidArgumentError:  indices[14,11] = 49386 is not in [0, 30522)
	 [[node model_9/distilbert/embeddings/Gather (defined at /usr/local/lib/python3.7/site-packages/transformers/models/distilbert/modeling_tf_distilbert.py:112) ]] [Op:__inference_test_function_299355]

Errors may have originated from an input operation.
Input Source operations connected to node model_9/distilbert/embeddings/Gather:
 IteratorGetNext (defined at <ipython-input-261-6e8d573a5fa9>:9)

Function call stack:
test_function
