<a href="https://colab.research.google.com/github/peterhanlon/public-notebooks/blob/main/7ctos-title-generator-finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![7cos](https://7ctos.com/wp-content/uploads/2022/01/7CTO-Logo.png)

# Fine-tuning a model to create a document titles from a body of text
Pete Hanlon 7/6/2023

Load the workbook in Google Colab [https://colab.research.google.com/]. Before you start running the workbook check that the "Runtime" is set to a GPU, this should happen automatically for this notebook. 

The training will take about an hour and a half using the free colab. If you have a pro-account you can speed things up significantly by increasing the batch size parameter to 8 in the cell that defines Seq2SeqTrainingArguments.



---


First install 3rd party service dependencies

In [None]:
!pip install -q huggingface_hub wandb

Login to Huggingface with your access token
If you dont have an account on huggingface, you should create one so that you can upload your model once its been trained (it's free)

If you dont have a Huggingface access token create a write token https://huggingface.co/settings/tokens this will allow you to upload your model once it's been finetuned

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Start wandb (wandb.ai) this will log training progress in the wandb.ai website, you will need an account and API access code (it's free)

In [None]:
import wandb
import os
wandb.login()



---



# Setup the environment

Check we have all of the GPU memory available.

On an A100 GPU you should see something like "0MiB / 40960MiB" which indicates all the GPU memory is free

In [None]:
!nvidia-smi

Install the necessary Python dependencies

In [None]:
!pip install -q datasets transformers accelerate rouge-score nltk wandb

Install git large file support

In [None]:
!apt install git-lfs

Install nltk sentence tokenizer, used by the HF Tokenizer

In [None]:
import nltk
nltk.download('punkt')

Set key variables used during training

In [None]:
batch_size = 8                        # Number of samples processed before the model is updated. If you are getting OOM errors you can reduce this value to 4,2 or 1.
num_train_epochs=4                    # The number of epochs to train for, that's the number of times the trainer cycles through the training data.
base_model = "google/flan-t5-base"    # The name of the base model used to finetune, in this example its flan-t5-base
finetuned_model_name='nodissasemble/7CTOs-document-title-generator' # The name of the final finetuned model, change "nodissasemble" to your huggingface username

Determine if we have a GPU to use, this should also work on a CPU but you're likely to be dead by the time the training completes :)

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



---



# Load the training data

Download the raw training, test and validation files from github. Downloading the training data from github using curl isn't how I would normally load training data but I wanted to make it obvious that you could use your own files. A neater solution is to upload your training data to Huggingface and download from there.

In [None]:
!curl https://raw.githubusercontent.com/peterhanlon/public-notebooks/main/data/title-train.jsonl > title-train.jsonl
!curl https://raw.githubusercontent.com/peterhanlon/public-notebooks/main/data/title-validation.jsonl > title-validation.jsonl
!curl https://raw.githubusercontent.com/peterhanlon/public-notebooks/main/data/title-test.jsonl > title-test.jsonl

Load the training, validation and test files into memory.

In [None]:
from datasets import load_dataset, load_metric

data_files = {
    "train": "./title-train.jsonl",
    "validation": "./title-validation.jsonl",
    "test": "./title-test.jsonl",
}

raw_datasets = load_dataset("json", data_files=data_files)

FYI - The dataset object itself is DatasetDict, which contains one key for the training, validation and test set:

In [None]:
raw_datasets

FYI - Dump a single row of training data so that you can see what it looks like. 

In [None]:
raw_datasets["train"][0]



---



# Preprocess the training data

Before we can feed the training data to our model, we need to preprocess it. This is done by a Tokenizer which will tokenize the inputs and put it in a format the model expects.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(base_model)

Create a method that will take a single row of data {'text':'','summary':'','title':''} and tokenize the document and title.

Prefix the text "title:" on to the document, this will allow the model to accept prompts "title: <text to summarize>". 

In [None]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = ["title:" + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["title"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

FYI - Tokenize a single row so you can see what it looks like

*   input_ids is the tokenized document
*   labels is the tokenized title
*   attention_mask a binary tensor that indicates if the input_ids is relevant or not








In [None]:
preprocess_function(raw_datasets['train'][0])

Now tokenize all the training data. Loop through the dataset tokenizing each row by calling the preprocess_function. Setting batched to true speeds up this process this is what will be used for training

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)



---



# Fine-tune the model

Now that our training data has been tokenized its ready, we can now download the pretrained base model and fine-tune it. 

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(base_model).to(device)

Next we define the Seq2SeqTrainingArguments, this is a class that contains all the attributes to customize the training. 

batch_size is the number of sampled processed before the model is updated. Reduce this number if you are getting CUDA OOM errors

Evaluation_strategy 'epoch' tells the trainer to evaluate the model after each epoch so in this case 4 times





In [None]:
args = Seq2SeqTrainingArguments(
    output_dir = finetuned_model_name,            # Where to output the final model and checkpoints
    evaluation_strategy = "epoch",                # The model is evaluated at the end of each epoch
    per_device_train_batch_size=batch_size,       # Number of samples to process 
    per_device_eval_batch_size=batch_size,        # Number of samples to process 
    num_train_epochs=num_train_epochs,            # Number of times the trainer will run through the training set
    predict_with_generate=True,                   # Generates text from the model during evaluation - used to calc ROGUE score
    save_strategy="epoch",                        # Save checkpoints after each epoch
    load_best_model_at_end=True,                  # Loads the best model at the end of the training process so we save the best model
    report_to="wandb",                            # Sends training metrics to wandb.ai
    push_to_hub=True,
)

Define a data collector this will be used to compute metrics

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Load ROUGE as the metric we will use to measure our models performance

In [None]:
metric = load_metric("rouge")

This is boilerplate code, taken directly from Huggingface to process metrics. it just works

In [None]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Now we build a Seq2SeqTrainer as T5 is a Seq2Seq model (Encoder/Decoder model)

In [None]:
trainer = Seq2SeqTrainer(
    model,                                              # The base model to finetune
    args,                                               # Training arguments
    train_dataset=tokenized_datasets["train"],          # The tokenized training data
    eval_dataset=tokenized_datasets["validation"],      # The tokenized validation data
    data_collator=data_collator,                        
    tokenizer=tokenizer,                                
    compute_metrics=compute_metrics                     
)

We can now finetune our model by just calling the train method:

In [None]:
trainer.train()

Now upload the finetuned model to the Huggingface Hub, once uploaded you will be able to see it in the huggingface model hub (https://huggingface.co/models)

In [None]:
trainer.push_to_hub()



---



# Perform inference using the new model

You can now load your finetuned model which will download from Huggingface and perform inference on it, fancy term for using the model.

At this point you can embed the code in the cells below to use your model in your own code.

In [None]:
from transformers import pipeline
title_generator = pipeline("text2text-generation", model=finetuned_model_name, tokenizer=finetuned_model_name)

Some text takem from the BBC about AI

In [None]:
text='''title: 
The Terminator sci-fi film franchise envisages a malevolent AI "Skynet" system bent on humanity's destruction.

Last week several firms warned AI could pose a threat to human existence.

Prime Minister Rishi Sunak is about to travel to the US where AI is one of the items he will be discussing.

AI describes the ability of computers to perform tasks typically requiring human intelligence.

When it came to AI, there was a "dystopian point of view that we can follow here. There's also a utopian point of view. Both can be possible", Mr Scully told the TechUK Tech Policy Leadership Conference in Westminster.

A dystopia is an imaginary place in which everything is as bad as possible.

"If you're only talking about the end of humanity because of some, rogue, Terminator-style scenario, you're going to miss out on all of the good that AI is already functioning - how it's mapping proteins to help us with medical research, how it's helping us with climate change.

"All of those things it's already doing and will only get better at doing."

The government recently put out a policy document on regulating AI which was criticised for not establishing a dedicated watchdog, and some think additional measures may eventually needed to deal with the most powerful future systems .

Marc Warner, a member of the AI Council, an expert body set up to advise the government, told BBC News last week a ban on the most powerful AI may be necessary.

However, he argued that "narrow AI" designed for particular tasks, such as systems that look for cancer in medical images, should be regulated on the same basis as existing tech.

Responding to reports on the possible dangers posed by AI, the prime minister's spokesperson said: "We are not complacent about the potential risks of AI, but it also provides significant opportunities.

"We can not proceed with AI without the guard rails in place."

'''

Pass the text to the model and generate a title for this article

In [None]:
title_generator(text, num_beams=1, do_sample=False, truncation=True, min_length=1, max_length=200)