## Label Injnury Narratives Using Generative and Deep Learning Models
This project compares the performances of generative and deep learning models on document classification using real-world and synthesized data. The document classification task is to label a piece of injury narrative with the type of injury to each narrative.

### Dataset
The dataset used is from a competition organized by NASA-Tournament Lab and National Institute for Occupational Safety & Health (NIOSH). The goal is to automate the processing of data in occupational safety and health (OSH)OSH surveillance systems. Specifically, given a free text injury report, such as "_worker fell from the ladder after reaching out for a box._” , the task is to **assign a injury code** from the Occupational Injuries and Illnesses Classification System (OIICS). The details of the task and competition can be found in a [blogpost](https://blogs.cdc.gov/niosh-science-blog/2020/02/26/ai-crowdsourcing/) by CDC (Center for Disease Control). The dataset is downloaded from [hugging face](https://huggingface.co/datasets/mayerantoine/injury-narrative-coding). The winning solutions can be found on [NASA Tournament Lab's Github Page](https://github.com/NASA-Tournament-Lab/CDC-NLP-Occ-Injury-Coding).

In [11]:
# import libraries
#!pip install -r requirements.txt
import torch
import pandas as pd
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from tqdm import tqdm

In [12]:
# load dataset
Dataset = pd.read_csv('./Data/full_dataset.csv')
train_set = Dataset.sample(frac=0.8, random_state=3275)
dev_set = Dataset.drop(train_set.index)
test_set = dev_set.sample(frac=0.5, random_state=3276)
dev_set = dev_set.drop(test_set.index)

train_set = train_set.reset_index(drop=True)
dev_set = dev_set.reset_index(drop=True)
test_set = test_set.reset_index(drop=True)

print(train_set.shape, dev_set.shape, test_set.shape)

(183856, 5) (22982, 5) (22982, 5)


### Preprocessing

In [13]:
# load pretrained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize(text):
    return tokenizer(
        text,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )


# tokenize dataset
TrainTextTokenized = train_set['text'].apply(tokenize)
DevTextTokenized = dev_set['text'].apply(tokenize)
TestTextTokenized = test_set['text'].apply(tokenize)



In [None]:
# Get Event
train_labels = train_set['event'].values
dev_labels = dev_set['event'].values
test_labels = test_set['event'].values

# convert tokenized dataset to tensor
train_inputs = torch.cat([example["input_ids"] for example in TrainTextTokenized], dim=0)
dev_inputs = torch.cat([example["input_ids"] for example in DevTextTokenized], dim=0)
test_inputs = torch.cat([example["input_ids"] for example in TestTextTokenized], dim=0)

train_labels = torch.tensor(train_labels)
dev_labels = torch.tensor(dev_labels)
test_labels = torch.tensor(test_labels)

In [None]:
# create tensor dataset

train_dataset = TensorDataset(train_inputs, train_labels)
dev_dataset = TensorDataset(dev_inputs, dev_labels)
test_dataset = TensorDataset(test_inputs, test_labels)

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

In [None]:
#check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [None]:
# load pretrained model
# Number of labels
num_labels = len(train_set["event"].unique())
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
#Attention: Loading model to device 
model = model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training

In [None]:
# define optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()



Epoch 1:   0%|          | 0/5746 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


: 

In [None]:
# training
num_epochs = 100

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}", leave = False):
        input_ids = batch[0].to(device)
        labels = batch[1].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids)
        logits = outputs.logits
        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
    
    train_loss /= len(train_loader)

    if epoch % 10 == 0:
        print(f"Epoch {epoch + 1} - Training loss: {train_loss:.4f}")