# Training

This is a frame of the `main` notebook. It contains the training of Google's **T5** model.  

# III.1. Loading the model

Now that we have completely preprocessed the dataset. We are going to start working around the final model we used.

Because of the computionnal power we have, we will have to use a relatively light model well suited for text summarization.

Here, we will be working with `Google-T5` model. First let's load the model.  

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

max_token_limit = tokenizer.model_max_length
print("max_token_limit", max_token_limit)

max_input_length = max_token_limit
max_target_length = 30

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# III.2. Tokenization

As such, we can not use directly the dataset because AI models are not really suited to handle texts. However they can manage numbers therefore the first step is to tokenize text.

Also because we are using Hugging Face models, the **reference** summaries are called **labels**  

In [None]:
def tokenize_dataset(examples):
    model_inputs = tokenizer(examples['dialogue'], max_length=max_input_length, truncation=True)
    labels = tokenizer(examples['generated_summary'], max_length=max_target_length, truncation=True)
    model_inputs['labels'] = labels['input_ids']

    return model_inputs

tokenized_dataset = ds.map(tokenize_dataset)

Working with a small dataset is very useful for debugging. This line of command will allow us to get a sample of the full dataset. However when training the final version of the model, we have been using the full dataset.

In [None]:
small_dataset = tokenized_dataset.filter(lambda e, i: i < 10, with_indices=True)
small_dataset

# III.3. Training and evaluation of the model

# A. Evaluation of the model's quality

First in order to train our model, we have to compute a metric to tell the model how well it performs. Again, we will be using the ROUGE score metric in order to measure the quality of the summary.

In [None]:
from nltk.tokenize import sent_tokenize

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_predictions = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_predictions]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]

    result = [scorer.score(predictions, labels) for predictions, labels in zip(decoded_predictions, decoded_labels)]

    rougeL = [score['rougeL'].fmeasure * 100 for score in result]
    rouge1 = [score['rouge1'].fmeasure * 100 for score in result]

    result = {
        'rougeL': sum(rougeL)/len(rougeL),
        'rouge1': sum(rouge1)/len(rouge1),
    }

    return result

# B. Training and quality of the final model

We’ll need to generate summaries in order to compute ROUGE scores during training.

The `Seq2SeqTrainingArguments` class will be used to do this work.

In [None]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    output_dir="notification-hub",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    push_to_hub=True,
)

The `DataCollatorForSeq2Seq` collator will dynamically pad the inputs and the labels. It is required because our model  is an encoder-decoder Transformer model, which means that during decoding we need to shift the labels to the right by one.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
)

Finally we need to instantiate the trainer with the standard arguments.


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Let's train our model.

In [None]:
import nltk
nltk.download('punkt')

trainer.train()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Epoch,Training Loss,Validation Loss,Rougel,Rouge1
1,1.6358,1.555176,37.269752,43.354466
2,1.4758,1.54369,38.159236,43.892302
3,1.3991,1.553527,38.264955,44.083064
4,1.3365,1.562817,38.419383,44.56143




TrainOutput(global_step=9928, training_loss=1.4745837387589078, metrics={'train_runtime': 4348.3588, 'train_samples_per_second': 9.133, 'train_steps_per_second': 2.283, 'total_flos': 1.457011180505088e+16, 'train_loss': 1.4745837387589078, 'epoch': 4.0})

And see the final ROUGE score

In [None]:
trainer.evaluate()

We will be pushing our model to Hugging Face in order to use it.

In [None]:
trainer.push_to_hub(commit_message="Training complete", tags="summarization")