# Introduction

I have been working over the last couple of months to train an LLM on health and safety data. After fine-tuning, the model has been deployed to production and is being exposed to a front-end application through an API. I unfortunaly have neglacted to document the process of research and development and this repo will serve as a retrospective peice on my work. I will share the code to train the model and insights I learned along the way. Unfortunatly since it is company IP I cannot share the data or model itself, only my process for developing it.


# Background and Motivation

On construction sites we have many hazards that pose health and safety threats to our workers or members of the public. To capture these hazards our site workers use an app to log when they see an issue, such as: "Cables left laying on walkway", "Oil spill on pathway" etc. Thus, we have a huge amount of useful health and safety data stored in our database.

When entering an observation the user is given the option to enter a category which is used by the Health and Saftey Team to triage and action the hazard, to decide which team to send it to, whether to notify someone immediately, etc. For example, "Cables left laying on walkway" would probably be given the category "Slips/Trips". 

Most of the time the appropriate category is set in the data but around 18% of the time the category is empty and sometimes when it is set it isn't set to a value that matches the input text. This is where Machine Learning and LLMs can come in. With over 1 million records in the dataset we can use some filtering to pull out a subset of labelled records with good observations and fine-tune a model to learn which category is most likely set when certain words and combinations of words are present. We can then use this to predict/suggest a category and improve the process of submitting an observation as well as triaging and actioning.

# Part 1: Data Preparation

To start, we can load the data into a Dataframe in the notebook. 

In [None]:
import pandas as pd

# Setting to fix bug later on in model.
pd.options.mode.chained_assignment = None

observations = pd.read_csv("observations-data.csv")
display(observations)

There are two key fields to pick out from the data. The one with the free text input, and the one with the category. The plan is to infer category2 to predict whatdidyousee.

In [None]:
# Category2 as a Category1 is always "Hazard" due to the way the app stores the data.
labelled = observations[["whatdidyousee"], ["category2"]]

This is a supervised learning model so we need to remove the null categories. This will give us a fully labelled Dataframe.

In [None]:
labelled = labelled.loc[labelled["category2"].notnull()]

# Cleaning up labelled Dataset.
labelled["category2"] = labelled["category2"].str.strip()
labelled = labelled.reset_index(drop=True)

display(labelled.head())

To train our model we will need to have a sorted list of possible category options that we can later convert to a Tensor so our model can perform computations on it. Let's begin by getting a unique list and sorting it.

In [None]:
category_list = [cat for cat in labelled["category2"].unique()]
category_list.sort()
print(category_list)

Let's graph the category distribution.

In [None]:
labelled["category2"].value_counts()[category_list].plot(kind="bar")

If we are happy with that let's store the category labels for later use.

In [None]:
import json
with open("observation_categories.json", "w") as f:
    json.dump({"categories": category_list})

Before we begin any training or data manipulation let's create an untouched cut of the data to do testing on later. This is important as we will run over multiple epochs when traingin and there will be some bias towards the validation set as the model will adjust it's predictions to reduce the loss on that dataset.

Therefore we will need a sample of the data that the model has never seen before so we can properly evaluate it's accuracy.

In [None]:
untouched = labelled.sample(10000)
untouched.to_csv("observations-finaltest.csv")

# After saving the test data, remove it from the training dataset.
labelled_remaining = labelled.drop(untouched.index).reset_index(drop=True)

Now we have an untouched dataset for testing down the line we can start building splitting out our model data. 

In [None]:
SAMPLE_SIZE = 20000
RANDOM_STATE = 200

# Getting a sample of data
model_data = labelled_remaining.sample(SAMPLE_SIZE,random_state=RANDOM_STATE)
labelled_remaining = labelled_remaining.drop(model_data.index).reset_index(drop=True)

After splitting out all those datasets we can now validate they all look good.

In [None]:
print(f"RAW: {observations.shape}")
print(f"LABELLED: {labelled.shape}")
print(f"LABELLED REMAINING: {labelled_remaining.shape}")
print(f"MODEL DATA: {model_data.shape}")
print(f"UNTOUCHED: {untouched.shape}")

If that looks good then save the remaining for use in later Warm Start training.

In [None]:
labelled_remaining.to_csv("observations-unseen.csv")