<a href="https://colab.research.google.com/github/meursault42/AHWalkins/blob/master/sample_fine_tune_hf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
While training an LLM (large language model) from scratch is always possible (with sufficient computational oomph) *in practice* this is rarely done. There are a variety of reasons for this, but for many tasks beginning with a sufficiently large model (that you are able to read into memory) and then *fine tuning* that model on your task at hand will result in generally good performance at relatively low computational cost. As a result, this notebook (which was designed to run in collab) will walk through the basics of downloading, finetuning, and applying a test LLM on a text classification task. The difference between what is demonstrated here and a more operational version will largely be decided by the bigness of the model, the toyness of the task, and the deployment method, rather than the syntax shown.

The first step will be to install the transformers library from huggingface. If you are running on this on your PC or GFE or a machine without a GPU, you will likely first need to install python and cuda, along with dependencies, prior to installing tranformers. If you'd like to run this on a CPU you'll want to uncomment the line for "vanilla" transformers. Transformers[torch] refers to the variant of the library that works on torch for gpus. Side note: torch is a high level api that allows us to create and manipulate tensors that can be operated on GPUs.

This model was fine-tuned on a v100 for approximately 2 hours.

In [None]:
! pip install datasets evaluate
#!pip install transformers
!pip install transformers[torch]
from transformers import (
   AutoConfig,
   AutoTokenizer,
   TFAutoModelForSequenceClassification,
   AdamW,
   glue_convert_examples_to_features
)
from datasets import load_dataset, Dataset, DatasetDict

Note in the above cell, the pip command is prepended by a ! indicating a so-called "magic" command. If you run this code outside of a notebook, you may need to install the libraries outside of your run time directly in the environment.

The next thing we will do is select a model. There are many ways of determing which base LLM is best, but for this example I'll simply select a suitably small (and therefore *fast*) model. Huggingface has a complete list of models (415,892 as of creating this notebook) here: https://huggingface.co/models

At this point we are just holding the model name back, we'll talk about using it later on.

In [2]:
model_name = 'edmundhui/mental_health_trainer'

# Dataset and cleanup

Next we will need a target dataset to finetune against. While this particular set is coming from datasets, which itself is a part of huggingface and comes with a number of quality of life features as a data object, this can be replaced with any arbitrary dataset. Of note: there is a line below which transforms a pandas df to a dataset object.

 To start though, we'll simply download a dataset and do some minor cleaning. This dataset is a df of mental health posts that appeared on the website Reddit between 2019 and 2021 on a series of specific, mental health related subreddits. Note: some of these posts are quite graphic.

In [3]:
reddit_health = load_dataset('solomonk/reddit_mental_health_posts', 'train')

#this dataset doesn't have a test/train split. We'll need that for our task, so we can manually implement that here.
#we'll also want to rename a two columns to make our libraries happy.
reddit_health = reddit_health.rename_column("body", "text")
reddit_health = reddit_health.rename_column("subreddit", "label")

#we then extract the dataset from the dataset dict
reddit_health_ds = reddit_health['train']

#we'll also do a bit of light cleaning to scrub nones and nans
#datasets doesn't allow this natively, so we'll pop this into pandas rq and fix that up
reddit_health_ds = reddit_health_ds.to_pandas()
reddit_health_ds = reddit_health_ds.dropna()
label_set = set(reddit_health_ds['label'])
#this was discovered while trying to tokenize the dataset. You may run into a similar issue. This is the code I used to id troublesome entries:
#result_list = []
#for value in reddit_health['text']:
#  if value is None:  # Check for None or NaN
#      result_list.append("is None")
#  elif value != value :
#      result_list.append("is NaN")
#print(result_list)

#aaaand swap it back
reddit_health_ds = Dataset.from_pandas(reddit_health_ds)

#and encode the subreddit as a psuedo label, which we can imagine as a form of supervised topic modelling
reddit_health['train'] = reddit_health_ds
reddit_health = reddit_health.class_encode_column("label")

#we then do our test trainsplit, stratified on our output label to try and balance our classes. a 75/25 split is common for finetuning tasks.
#you may occasionally do a train/tune/test split if you wish to do a more complex set of optimizations, but
#here it should be unnecessary
reddit_health = reddit_health['train'].train_test_split(test_size=0.25, stratify_by_column="label")

#we've modified the dataset object, so we'll retain this so we can interpret the labels for later inspection
label_key = {0,1,2,3,4}
id2label = {k: v for k, v in zip(label_key, label_set)}
label2id = {v: k for k, v in id2label.items()}

Downloading readme:   0%|          | 0.00/425 [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/29.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Casting to class labels:   0%|          | 0/149679 [00:00<?, ? examples/s]

We can now take a look at our dataset before we tokenize it for finetuning.  Obviously you'll want to understand your data quite well for actual work, but we'll bypass that step for this example.

In [34]:
reddit_health["train"][80]

{'author': 'jessiethehutt',
 'text': 'I’ve always felt lonely when it came to my family. We are close, but in a superficial way, and they have always ignored my anxiety, depression, SH, and discounted my PTSD.\n\nMy SH was partially a way to prove to my family that I was “that depressed,” but all I got from them was shame. They occasionally bring up that it was hard on them to see me that miserable, kind of as a joke by the way they say it. \n\nMy mom is in her own world, as she is bipolar amongst other things, and she has never once mentioned the abuse she put me through when I was living with her as a child. The rest of my family simply looks away when they know I’m struggling with my mental health. I’d honestly be okay with a hug and a “it’s going to be okay.”\n\nI know it’s hard to help someone who has a mental illness, but I never ask for anything, don’t do anything that would be a hinderance, and don’t put my problems on them. I’ve walked on eggshells my whole life to not upset o

# Tokenization

Next we'll call the tokenizer from the model we defined above to tokenize our text. This will use a mapping function that's defined in the datasets library. Because we are civilized adults we will do this by defining a function, then applying it on our dataset. We'll also drop a few extraneous columns we don't really need.

In [4]:
reddit_health = reddit_health.remove_columns(['author','created_utc','id','num_comments','score','title','upvote_ratio','url','__index_level_0__'])

In [5]:
# Initialise tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_reddit_health = reddit_health.map(preprocess_function, batched=True, remove_columns=['text'])

from transformers import DataCollatorWithPadding
#this allows automated padding based on the input set length
#as a reminder: we require our inputs to be of a fixed length
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/112259 [00:00<?, ? examples/s]

Map:   0%|          | 0/37420 [00:00<?, ? examples/s]

OK, now that we are tokenized we can take a little peak at our dataset. You'll discover that we've lost our previously legible text and have ended up with some neat numbers instead.

In [8]:
tokenized_reddit_health['train'][80]

{'text': '[deleted]',
 'label': 1,
 'input_ids': [101, 1031, 17159, 1033, 102],
 'token_type_ids': [0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1]}

# Model Set-up

We are almost ready to go. Next step: since we are going to be fine tuning this model, we'll need a metric to perform the optimization pass. A common NLP metric is accuracy (although this is not always the best metric). Here though, we'll just use the most common set up, and check accuracy against our predicted labels (which are the subreddit names).

In [6]:
import evaluate

accuracy = evaluate.load("accuracy")
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

OK: last step. We need to initilize our model and it's parameters. Here you can see we've got the automodel for sequence classification (which is what we are doing), 5 labels (which is how many we have), and label to id sets (which we defined up at the top during pre-processing).  

In [7]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=5, id2label=id2label, label2id=label2id
)

config.json:   0%|          | 0.00/947 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

And now we just define the trainer arguments. I copied these from a walkthrough from huggingface and just made small changes. Model optimization parameterization is an endless well from which we would not return from. But some highlights:


*   learning rate: this is the size step taken in SGD optimization. Smaller means more accurate but slower training speed.
*   num_train_epochs: fine-tuning is usually done with *relatively* few. Usually between 2-50 is common. More is fine, but each epoch is essentially re-running the training process. YMMV but I got sick of paying for this after 2 hours. It will also depend on your training and eval loss. Early stopping is usually a good idea when going above 10.
*   batch size: the number of samples to review before re-calculating weights. Smaller means more accurate, but slower training speed.
*   weight decay: a regularization parameter for the optimizer (which is a twist on ADAM) which penalizes large weights during training.



In [10]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_reddit_health["train"],
    eval_dataset=tokenized_reddit_health["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Let's GO!

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.8261,0.832707,0.681587
2,0.7555,0.851984,0.68287


TrainOutput(global_step=14034, training_loss=0.8148591797355574, metrics={'train_runtime': 6293.6132, 'train_samples_per_second': 35.674, 'train_steps_per_second': 2.23, 'total_flos': 5.146195418528211e+16, 'train_loss': 0.8148591797355574, 'epoch': 2.0})

# Don't forget to save

We've done it! Probably! Maybe not! Take a look at your training and eval loss to assess how well you've actually done. As you see above, we've done decently for relatively little effort. Our accuracy is around ~70%. For models you wish to publish or use in production this is nowhere near good enough. You'll want to set a baseline of .85 but aim towards .9. But for our example that's OK. Next step will be to save your model (assuming you like it and wish to retain it.) There are several ways of doing this: trainer.save_model will simply save the model to your local directory. huggingface_hub will allow you to share your model publically. Please take care if you are doing this with a model trained on VA data. You will need to get permission before doing so.

In [17]:
trainer.save_model ("sample_fine_tune_hf")

In [45]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [46]:
trainer.push_to_hub()

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

'https://huggingface.co/meursault42/my_awesome_model/tree/main/'

# Inference with our freshly tuned model

OK with that out of the way, let's see how it does. I've grabbed a sample entry from our test set (which I'll note here the model was evaluated on: this is not true out of set generalization). For true out of set inference, you'll need to have a true test set which was not used for tuning (which, again, ours was).

In [58]:
samp_text = reddit_health["test"][145]['text']
samp_label = reddit_health["test"][145]['label']

There are a lot of ways of running your model once it is in memory, but I've stolen some more code from HF that implements a pipeline. This just means it can take your novel input, run it through the tokenizer and then run the model on it.

In [59]:
from transformers import pipeline

classifier = pipeline("text-classification", model="my_awesome_model")
classifier(samp_text)

[{'label': 'depression', 'score': 0.9957908391952515}]

The model believes the sample text came from the depression subreddit with extremely high certainty (99.579% to be precise: thanks softmax!). Let's see if that pans out for us

In [60]:
print(samp_label)
print(label2id)

0
{'depression': 0, 'ptsd': 1, 'ADHD': 2, 'OCD': 3, 'aspergers': 4}


Indeed it did. The model was correct. Hooray.

To be clear: this is merely a single instance of how to perform this task. There are *many* other NLP tasks, and this is probably one of the simplest. But hopefully this sample code gives you a sense of the workflow and steps needed to do simple finetuning for a labelled task.