# Introduction

I have been working over the last couple of months to train an LLM on health and safety data. After fine-tuning, the model has been deployed to production and is being exposed to a front-end application through an API. I unfortunaly have neglacted to document the process of research and development and this repo will serve as a retrospective peice on my work. I will share the code to train the model and insights I learned along the way. Unfortunatly since it is company IP I cannot share the data or model itself, only my process for developing it.


# Background and Motivation

On construction sites we have many hazards that pose health and safety threats to our workers or members of the public. To capture these hazards our site workers use an app to log when they see an issue, such as: "Cables left laying on walkway", "Oil spill on pathway" etc. Thus, we have a huge amount of useful health and safety data stored in our database.

When entering an observation the user is given the option to enter a category which is used by the Health and Saftey Team to triage and action the hazard, to decide which team to send it to, whether to notify someone immediately, etc. For example, "Cables left laying on walkway" would probably be given the category "Slips/Trips". 

Most of the time the appropriate category is set in the data but around 18% of the time the category is empty and sometimes when it is set it isn't set to a value that matches the input text. This is where Machine Learning and LLMs can come in. With over 1 million records in the dataset we can use some filtering to pull out a subset of labelled records with good observations and fine-tune a model to learn which category is most likely set when certain words and combinations of words are present. We can then use this to predict/suggest a category and improve the process of submitting an observation as well as triaging and actioning.

# Part 1: Data Analysis

To start, we can load the data into a Dataframe in the notebook. 

In [None]:
import pandas as pd

# Setting to fix bug later on in model.
pd.options.mode.chained_assignment = None

observations = pd.read_csv("observations-data.csv")
display(observations)

There are two key fields to pick out from the data. The one with the free text input, and the one with the category.

In [None]:
# Category2 as a Category1 is always "Hazard" due to the way the app stores the data.
labelled = observations[["whatdidyousee"], ["category2"]]

This is a supervised learning model so we need to remove the null categories.

In [None]:
labelled = labelled.loc[labelled["category2"].notnull()]

# Cleaning up labelled Dataset.
labelled["category2"] = labelled["category2"].str.strip()
labelled = labelled.reset_index(drop=True)

display(labelled.head())

Let's review the category distribution

In [None]:
labelled["category2"].value_counts()[labelled["category2"].unique()].plot(kind="bar")

Before we begin any training or data manipulation