# Introduction

I have been working over the last couple of months to train an LLM on health and safety data. After fine-tuning, the model has been deployed to production and is being exposed to a front-end application through an API. I unfortunaly have neglacted to document the process of research and development and this repo will serve as a retrospective peice on my work. I will share the code to train the model and insights I learned along the way. Unfortunatly, since it is company IP, I cannot share the data or model itself, only my process for developing it.


# Background and Motivation

On construction sites, we have many hazards that pose health and safety threats to our workers or members of the public. To capture these hazards, our site workers use an app to log when they see an issue. These observations are in free text form, such as: "Cables left laying on walkway", "Oil spill on pathway", "Cement truck parked in wrong place" etc. !here are over 1_000_000 of these records in the observations table. 

When entering an observation, the user is given the option to enter a category. This is used by the Health and Saftey Team: to triage and action the hazard, to decide which team to send it to, whether to notify someone immediately, etc. For example, "Cables left laying on walkway" would probably be given the category "Slips/Trips". 

Most of the time the appropriate category is set in the data but around 18% of the time the category is empty and sometimes when it is set it isn't set to a value that matches the input text. This is where Machine Learning and LLMs can come in. With over 1 million records in the dataset we can use some filtering to pull out a subset of labelled records with good observations and fine-tune a model to learn which category is most likely set when certain words and combinations of words are present. We can then use this to predict/suggest a category and improve the process of submitting an observation as well as triaging and actioning.

# Part 1: Data Cleansing

To start, we can load the data into a Dataframe in the notebook. 

In [None]:
import pandas as pd
import numpy as np

# Setting to fix bug later on in model.
pd.options.mode.chained_assignment = None

observations = pd.read_csv("observations-data.csv")
display(observations)

There are two key fields to pick out from the data. The one with the free text input, and the one with the category. The plan is to infer category2 to predict whatdidyousee.

In [None]:
# Category2 as a Category1 is always "Hazard" due to the way the app stores the data.
labelled = observations[["whatdidyousee"], ["category2"]]

This is a supervised learning model so we need to remove the null categories. This will give us a fully labelled Dataframe.

In [None]:
labelled = labelled.loc[labelled["category2"].notnull()]

# Cleaning up labelled Dataset.
labelled["category"] = labelled["category2"].str.strip()
labelled = labelled.reset_index(drop=True)

display(labelled.head())

To train our model we will need to have a sorted list of possible category options that we can later convert to a Tensor so our model can perform computations on it. Let's begin by getting a unique list and sorting it.

In [None]:
category_list = [cat for cat in labelled["category"].unique()]
category_list.sort()
print(category_list)

Let's graph the category distribution.

In [None]:
labelled["category"].value_counts()[category_list].plot(kind="bar")

If we are happy with that let's store the category labels for later use.

In [None]:
import json
with open("observation_categories.json", "w") as f:
    json.dump({"categories": category_list})

Before we begin any training or data manipulation let's create an untouched cut of the data to do testing on later. This is important as we will run over multiple epochs when traingin and there will be some bias towards the validation set as the model will adjust it's predictions to reduce the loss on that dataset.

Therefore we will need a sample of the data that the model has never seen before so we can properly evaluate it's accuracy.

In [None]:
untouched = labelled.sample(10000)
untouched.to_csv("observations-finaltest.csv")

# After saving the test data, remove it from the training dataset.
labelled_remaining = labelled.drop(untouched.index).reset_index(drop=True)

Now we have an untouched dataset for testing down the line we can start building splitting out our model data. 

In [None]:
SAMPLE_SIZE = 20000
RANDOM_STATE = 200

# Getting a sample of data
model_data = labelled_remaining.sample(SAMPLE_SIZE,random_state=RANDOM_STATE)
labelled_remaining = labelled_remaining.drop(model_data.index).reset_index(drop=True)

After splitting out all those datasets we can now validate they all look good.

In [None]:
print(f"RAW: {observations.shape}")
print(f"LABELLED: {labelled.shape}")
print(f"LABELLED REMAINING: {labelled_remaining.shape}")
print(f"MODEL DATA: {model_data.shape}")
print(f"UNTOUCHED: {untouched.shape}")

If that looks good then save the remaining for use in later Warm Start training.

In [None]:
labelled_remaining.to_csv("observations-unseen.csv")

# Part 2: One-hot Encoding

An important first step in preparing the data for the model is called "one-hot coding". We need to do this as the model does not understand a term like "Slip/Trip" that might appear in the target category2 field. It won't be able to compute a loss function against that, to see how close or far it is from the correct label. 

To solve this in the multi-label classification context we need to encode the values using the sorted category list we prepared earlier.

In [None]:
print(category_list)

To do this we need to turn the target value e.g Slip/Trip into a tensor where the value at the Slip/Trip index is 1 and the value at all other categories is 0.

As an example if there were three categories: Ant, Bee and Cricket then a labelled data point with the category "Ant" would become a target list of [1, 0, 0]. A category of "Cricket" would become [0, 0, 1]. 

In [None]:
for category in category_list:
    model_data.loc[model_data["category"] == category, category] = 1
    model_data[category] = model_data[category].fillna(0)

model_data["input"] = model_data["whatdidyousee"].astype(str)

pd.options.display.max_colwidth = 10
display(model_data.head())

Let's convert our long set of columns into a single tensor for pytorch.

In [None]:
model_data["target_list"] = model_data[category_list].astype(bool).values.tolist()

Now let's reassign the Dataframe to just have the key columns. To train we only need the input and the target_list but I'm including the category for better result readablility.

In [None]:
model_data = model_data[["input", "category", "target_list"]]

# Taking a look at the prepared dataset.
pd.options.display.max_colwidth = 500
display(model_data.head())

# Part 3: Train-Validation Split

We have a final testing set and a 20000 record dataset to train our model. But to give feedback during training it is best practise to split the model dataset into 2 seperate sets: Training and Validation.

The idea is the model with loop over the training set in batches and repeatedly mark itself on it's prediction vs the output. Then the validation set is used as a check on the state of the model after training. If it scores highly on the validation set we know it has been generallised well on this cut of the observation data.

In [None]:
TRAIN_SIZE = 0.8

train_dataset = model_data.sample(frac=TRAIN_SIZE)
valid_dataset = model_data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print(f"FULL Dataset: {model_data.shape}")
print(f"TRAIN Dataset: {train_dataset.shape}")
print(f"TEST Dataset: {valid_dataset.shape}")

# Part 4: Model Prep

This is where we load in the bert tokenizer which will help with turning the raw text inputs into numbered tokens. We need to use the BERT tokenizer as the tokens will match with the BERT model we will load in later.

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The torch neural network we are going to use requires a particular object structure as an input. I will build this class below.

In [None]:
import torch

class CustomDataset:
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.input = dataframe["input"]
        self.targets = self.data.target_list
        self.max_len = max_len
    
    def __len__(self):
        return len(self.input)
    
    def __getitem__(self, index):
        input = str(self.input[index])
        input = " ".join(input.split())

        inputs = self.tokenizer.encode_plus(
            input,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding="max_length",
            return_token_type_ids=True,
            truncation=True,
        )
        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]
        token_type_ids = input["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "targets": torch.tensor(self.targets[index], dtype=torch.float),
        }

Now I'm going to create a small function to put the CustomDataset object into the DataLoader torch module. 

In [None]:
from torch.utils.data import DataLoader

MAX_LEN = 16

def df_loader(df, batch_size):
    custom = CustomDataset(df, tokenizer, MAX_LEN)
    test_params = {
        "batch_size": batch_size,
        "shuffle": False,
        "num_workers": 0
        }
    return DataLoader(custom, **test_params)

Next I will create a BertClass that inherits from the torch nerual network module. This class is our model. It is initially instatiated with the weights from BERT and will be fine-tuned over our data.

In [None]:
import transformers

class BERTClass(torch.nn.module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased', return_dict=False)
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768, 29)

    def forward(self, ids, mask, token_type_ids):
        _, output_1 = self.l1(ids, attention_mask=mask, token_type_ids=token_type_ids)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

Next I will create a simple loss function using BCE (binary cross-entory) with Logits Loss from torch. I have added positive weights to encourage the model to select a category.

In [None]:
def loss_fn(outputs, targets):
    pos_weight = torch.full([29], 5)
    return torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)(outputs, targets)

In [None]:
import shutil

def load_ckp(checkpoint_fpath, model, optimizer):
    checkpoint = torch.load(checkpoint_fpath)
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    valid_loss_min = checkpoint["valid_loss_min"]
    return model, optimizer, checkpoint["epoch"], valid_loss_min

def save_ckp(state, is_best, checkpoint_path, best_model_path):
    torch.save(state, checkpoint_path)
    if is_best:
        shutil.copyfile(checkpoint_path, best_model_path)

# Part 5: Training

Setting the device object for using in torch training.

In [None]:
if torch.cuda.is_available(): # check for CUDA gpu
    device = torch.device("cuda")
elif torch.backends.mps.is_available(): # Check for Apple M1/M2 chip
    device = torch.device("mps")
else:
    device = torch.device("cpu") # Otherwise just use CPU

In [None]:
def train_model(
        start_epochs,
        n_epochs,
        valid_loss_min_input,
        training_loader,
        validation_loader,
        model,
        optimizer,
        checkpoint_path,
        best_model_path
):
    # Initiialize valid loss minimum at input.
    valid_loss_min = valid_loss_min_input

    for epoch in range(start_epochs, n_epochs):
        train_loss = 0
        valid_loss = 0
        # Put model in training mode.
        model.train()

        print(f" -- Epoch {epoch}: Training Start -- ")
        
        for batch_idx, data in enumerate(training_loader):
            # Save batch info to device
            ids = data["ids"].to(device, dtype=torch.long)
            mask = data["mask"].to(device, dtype=torch.long)
            token_type_ids = data["token_type_ids"].to(device, dtype=torch.long)
            targets = data["targets"].to(device, dtype=torch.float)
            # Run prediction on model for batch
            outputs = model(ids, mask, token_type_ids)
            optimizer.zero_grad()
            # Evaluate loss
            loss = loss_fn(outputs, targets)
            
            if batch_idx%5000 == 0:
                print(f"Epoch: {epoch}, Training Loss: {loss.item()}")

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss += (1 / (batch_idx + 1))*(loss.item() - train_loss)
        
        print(f" -- Epoch {epoch}: Training End -- ")

        print(f" -- Epoch {epoch}: Validation Start -- ")

        model.eval()

        with torch.no_grad():
            val_targets = []
            val_outputs = []
            for batch_idx, data in enumerate(training_loader):
                # Save batch info to device
                ids = data["ids"].to(device, dtype=torch.long)
                mask = data["mask"].to(device, dtype=torch.long)
                token_type_ids = data["token_type_ids"].to(device, dtype=torch.long)
                targets = data["targets"].to(device, dtype=torch.float)
                # Evalutate model on batch
                outputs = model(ids, mask, token_type_ids)

                loss = loss_fn(outputs, targets)
                valid_loss += (1 / (batch_idx + 1))*(loss.item() - valid_loss)
                val_targets.extend(targets.cpu().detach().numpy().tolist())
                val_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())

        print(f" -- Epoch {epoch}: Validation End --")

        train_loss = train_loss/len(training_loader)
        valid_loss = valid_loss/len(validation_loader)

        print(f"Epoch: {epoch}\n\tAverage Training Loss: {train_loss}\n\tAverage Validation Loss: {valid_loss}")

        checkpoint = {
            "epoch": epoch + 1, 
            "valid_loss_min": valid_loss,
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict()
        }

        save_ckp(checkpoint, False, checkpoint_path, best_model_path)

        if valid_loss <= valid_loss_min:
            print(f"Validation loss decreased ({valid_loss_min} --> {valid_loss}). Saving Model...")
            save_ckp(checkpoint, True, checkpoint_path, best_model_path)
            valid_loss_min = valid_loss

        print(f" -- Epoch {epoch} Done -- ")
    
    return model

Now we can load in the data we prepared and split out earlier.

In [None]:
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 32
LEARNING_RATE = 1e-05

training_loader = df_loader(train_dataset, TRAIN_BATCH_SIZE)
validation_loader = df_loader(valid_dataset, VALID_BATCH_SIZE)

We can also define the key components to begin training our model.

In [None]:
checkpoint_path = "./current_checkpoint.pt"
best_model = "./best_model.pt"
model = BERTClass()
model.to(device)
valid_loss_min_input = np.Int
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

That leaves just one thing left to do...

In [None]:
trained_model = train_model(1, 4, valid_loss_min_input, training_loader, validation_loader, model, optimizer, checkpoint_path, best_model)

# Testing

Now that training is complete on the initial batch there should be a best_model.pt file availble in the directory. We can load that model in using the checkpoint function above. 

In [None]:
device = torch.device('cpu')
model = BERTClass()
model.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
best_model = "./best_model.pt"

model, optimizer, epoch, valid_loss_min_input = load_ckp(best_model, model, optimizer)

Now we can load in the test dataset as saved previously.

In [None]:
test_data = pd.read_csv("observations-finaltest.csv")
display(test_data)

Let's grab some of the code I wrote above to make a function that preps the dataframe for testing or training.

In [None]:
def feature_prep(dataset, category_list):
    # Get important fields
    df = dataset[["whatdidyousee", "category"]]

    # Removing Null categories to get labelled list.
    df = df.loc[df["category"].notnull()]
    df = df.reset_index(drop=True)

    # One-hot encoding of categories
    for category in category_list:
        df.loc[df["category"] == category, category] = 1
        df[category] = df[category].fillna(0)

    # Organise columns into correctly named fields.
    df["input"] = df["whatdidyousee"].astype(str)
    df["target_list"] = df[category_list].astype(bool).values.to_list()
    return df[["input", "category", "target_list"]]

In [None]:
TEST_SIZE = 1000
test_data = test_data.sample(TEST_SIZE).reset_index(drop=True)
test_data = feature_prep(test_data, category_list)

In [None]:
display(test_data)

In [None]:
test_loader = df_loader(test_data, VALID_BATCH_SIZE)

In [None]:
def do_validation(dataloader):
    model.eval()
    fin_targets = []
    fin_outputs = []
    with torch.no_grad():
        for _, data in enumerate(dataloader, 0):
            ids = data["ids"].to(device, dtype=torch.long)
            mask = data["mask"].to(device, dtype=torch.long)
            token_type_ids = data["token_type_ids"].to(device, dtype=torch.long)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

Now I've created a validation function and have the test data loaded into the custom dataset we can run evaluation on the test records. 

In [None]:
outputs, targets = do_validation(test_loader)

In [None]:
for o in outputs:
    for i, x in enumerate(o):
        if x < max(o):
            o[i] = 0
        else:
            o[i] = 1
