# Warm Start

The purpose of this notebook is to take the initial model produced by model_train.ipynb and use the same functionality to train over more labelled records. This will build on the weights that are currently set in the model and further tune them by exposing the model to more data.

You should be able to run this over and over again and see better and better results on the test dataset. 

In [None]:
import pandas as pd
import torch 
import numpy as np
from data_helpers import feature_prep, split_sample
from model import load_checkpoint, BERTClass, train_model
from transformers import BertTokenizer

## Load in Best Model

Set the path to your favorite model here. Note that models are saved to a warm_start folder to prevent confusion.

In [None]:
load_model = "best_model.pt"

Now let's set our device.

In [None]:
if torch.cuda.is_available(): # check for CUDA gpu
    device = torch.device("cuda")
elif torch.backends.mps.is_available(): # Check for Apple M1/M2 chip
    device = torch.device("mps")
else:
    device = torch.device("cpu") # Otherwise just use CPU

**NB:** Once the optimizer has been through a few epochs the learning rate is so small that no significant improvements are made in the results. To rectify this I have been reseting the optimizer here rather than using the one loaded from the checkpoint.

In [None]:
# Creating some variables
EPOCHS = 4
LEARNING_RATE = 1e-05

checkpoint_path = "warm_start/current_checkpoint.pt"
best_model = "warm_start/best_model.pt"
valid_loss_min_input = np.Inf

# Inititialising model components
model = BERTClass()
model.to(device)
optimizer_init = torch.optim.Adam()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Loading in model
model, optimizer, epoch, valid_loss_min_input = load_checkpoint(load_model, model, optimizer_init)

## Data Sample

The next few cells we will take a cut of the remaining unseen observations ready for training and remove those recrods from the unseen data to avoid picking them out again.

In [None]:
import json

with open("observation_categories.json", "r") as f:
    categories = json.load(f)["categories"]

**NB:** observation_categories.json and observations_unseen.csv should have been created by model_train.ipynb

In [None]:
model_data = pd.read_csv("observations-unseen.csv")
print(f"FULL Dataset: {model.shape}")
display(model_data.head())

We can now take a sample of the unseen data that we have loaded in for training our notebook.

In [None]:
SAMPLE_SIZE = 50000
TRAIN_SIZE = 0.8

sample_data = model_data.sample(SAMPLE_SIZE)
sample_data = feature_prep(sample_data)
training_loader, validation_loader = split_sample(sample_data, TRAIN_SIZE)

model_data = model_data.drop(sample_data.index).reset_index(drop=True)

print(f"REMAINING Dataset: {model_data.shape}")
display(model_data)

If that looks good we can save the unseen dataset back to the CSV.

In [None]:
model_data.to_csv("observations-unseen.csv", index=False)

## Training

We can pull in the training function and run it over the sample dataset with the loaded model.

In [None]:


train_model(
        start_epochs,
        n_epochs,
        valid_loss_min_input,
        training_loader,
        validation_loader,
        model,
        device,
        optimizer,
        checkpoint_path,
        best_model_path
)