## Label Injnury Narratives Using Generative and Deep Learning Models
This project compares the performances of generative and deep learning models on document classification using real-world and synthesized data. The document classification task is to label a piece of injury narrative with the type of injury to each narrative.

### Dataset
The dataset used is from a competition organized by NASA-Tournament Lab and National Institute for Occupational Safety & Health (NIOSH). The goal is to automate the processing of data in occupational safety and health (OSH)OSH surveillance systems. Specifically, given a free text injury report, such as "_worker fell from the ladder after reaching out for a box._” , the task is to **assign a injury code** from the Occupational Injuries and Illnesses Classification System (OIICS). The details of the task and competition can be found in a [blogpost](https://blogs.cdc.gov/niosh-science-blog/2020/02/26/ai-crowdsourcing/) by CDC (Center for Disease Control). The dataset is downloaded from [hugging face](https://huggingface.co/datasets/mayerantoine/injury-narrative-coding). The winning solutions can be found on [NASA Tournament Lab's Github Page](https://github.com/NASA-Tournament-Lab/CDC-NLP-Occ-Injury-Coding).

In [4]:
# import libraries
#!pip install -r requirements.txt
import torch
import pandas as pd
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from tqdm import tqdm

In [24]:
# load dataset
train_set = pd.read_csv('./Data/train.csv')
test_set = pd.read_csv('./Data/test.csv')

train_set.head()

Unnamed: 0,text,sex,age,event
0,57YOM WITH CONTUSION TO FACE AFTER STRIKING IT...,1,57,62
1,A 45YOM FELL ON ARM WHILE WORKING HAD SLIPPED ...,1,45,42
2,58YOM WITH CERVICAL STRAIN BACK PAIN S P REST...,1,58,26
3,33 YOM LAC TO HAND FROM A RAZOR KNIFE,1,33,60
4,53YOM AT WORK IN A WAREHOUSE DOING UNSPECIFIED...,1,53,71


### Preprocessing

In [25]:
# load pretrained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize(text):
    return tokenizer(
        text,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )


# tokenize dataset
TrainTextTokenized = train_set['text'].apply(tokenize)
TestTextTokenized = test_set['text'].apply(tokenize)



In [29]:
# Get Event
train_labels = train_set['event'].values
test_labels = test_set['event'].values

# convert tokenized dataset to tensor
train_inputs = torch.cat([example["inpnut_ids"] for example in TrainTextTokenized], dim=0)
test_inputs = torch.cat([example["inpnut_ids"] for example in TestTextTokenized], dim=0)
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

Unnamed: 0,text,sex,age
0,54 Y O F PUNCTURE WOUND OF FIINGER RE ATTACHIN...,2,54
1,22 YOM CONTUSION TO LT LOWER LEG S P MVC HIT B...,1,22
2,20 YOM PT WORKS IN A QUARRY WAS ATTEMPTING TO...,1,20
3,38 YOF WAS WALKING AT WORK TWISTED HER LT ANKL...,2,38
4,44 YOM C O LOW BACK PAIN AFTER LIFTING A BOX A...,1,44


In [21]:
X[0]

{'input_ids': tensor([[  101,  5401,  7677,  2213,  2007,  9530,  5809,  3258,  2000,  2227,
          2044,  8478,  2009,  2007,  1037,  2695, 20091,  2096,  4292,  1037,
          8638,  2695,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  