# Fine Tune Text Summarizer With Hugging Face

Trying to adapt and follow: https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb

This will fine tune a model to summarize GitHub Issues from the GitHub repo fastai/fastai

In [1]:
! pip install ghapi



In [2]:
from ghapi.core import GhApi
from ghapi.all import github_token, paged
import os, pickle
from fastcore.all import L

# Get the Data

Uncomment this block and run it if this is the first time running this notebook.  You need to have a [personal access token](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) in an environment variable named `GH_PAT`. 

In [3]:
# api = GhApi(owner='fastai', repo='fastai', token=os.getenv('GH_PAT'))
# issues = L(paged(api.issues.list_for_repo, state='all')).concat()
# pickle.dump(issues, open( "issues.p", "wb" ) )
# len(issues)

In [4]:
from datasets import load_metric
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer

model_checkpoint = "t5-small"
metric = load_metric("rouge")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [5]:
print(model.training)

False


## Process The Data

In [6]:
import pickle
issues = pickle.load(open( "issues.p", "rb" ))

In [7]:
pairs = (issues
 .filter(lambda x: x.body and x.title and len(x.body) > 10 and len(x.title) > 5)
 .map(lambda x: {'body':x.body, 'title':x.title}).shuffle()
  )
pairs[:2]

(#2) [{'body': "First reported bug here:\r\nhttps://forums.fast.ai/t/bug-learn-summary-does-not-work-on-2nd-transfer-learning/77897\r\n\r\n**tldr:** learn.summary() crashes out with the following summary when doing a 2nd cycle of transfer learning\r\n```\r\n 57         elif val <= self.first_its or val >= self.last_v + self.wait_for or val >= self.total:\r\n 58             cur_t = time.time()\r\n 59             avg_t = (cur_t - self.start_t) / val\r\n 60             self.wait_for = max(int(self.update_every / (avg_t+1e-8)),1)\r\n 61             self.pred_t = avg_t * self.total\r\n \r\n AttributeError: 'NBProgressBar' object has no attribute 'start_t'\r\n```\r\n\r\nWhen I revert back the fastai code (in the fastai2 repo to commit: 59d878d3cf233ea24eb8fd8987098f17edd8c8ef) this crash goes away. I've isolated it to the subsequent commit d9ed4a8337bab36d3680fd787494e83ebd2f9a4b, with refactor learn.summary() that causes this error.\r\n\r\n\r\n", 'title': 'learn.summary() crashes out on 2nd

In [8]:
train_pairs = pairs[:2800]
eval_pairs = pairs[2800:]

### Tokenize

In [9]:
prefix = "summarize: "

max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples.map(lambda x: x['body'])]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(list(examples.map(lambda x: x['title'])), max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
from torch.utils.data import Dataset

class GH_Issues(Dataset):
    def __init__(self, data): self.data = data

    def __len__(self): return len(self.data)

    def __getitem__(self, idx):
        return {x: self.data[x][idx] for x in ['input_ids', 'attention_mask', 'labels']}

In [11]:
tokenized_datasets = {}
tokenized_datasets['train'] = GH_Issues(preprocess_function(train_pairs))
tokenized_datasets['eval'] = GH_Issues(preprocess_function(eval_pairs))

## Fine Tuning The Model With Trainer

In [12]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [13]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
)

In [14]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

In [15]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["eval"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [16]:
trainer.train()



Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Runtime,Samples Per Second
1,No log,4.610332,11.0,0.0,11.0,11.0,19.0,0.7235,4.146
2,No log,4.610332,11.0,0.0,11.0,11.0,19.0,0.673,4.458
3,No log,4.610332,11.0,0.0,11.0,11.0,19.0,0.6828,4.394
4,No log,4.610332,11.0,0.0,11.0,11.0,19.0,0.6866,4.37
5,No log,4.610412,11.0,0.0,11.0,11.0,19.0,0.6884,4.358


TrainOutput(global_step=5, training_loss=3.04730167388916, metrics={'train_runtime': 7.7661, 'train_samples_per_second': 0.644, 'total_flos': 2052989752320.0, 'epoch': 5.0, 'init_mem_cpu_alloc_delta': 2618056704, 'init_mem_gpu_alloc_delta': 242026496, 'init_mem_cpu_peaked_delta': 164737024, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1319022592, 'train_mem_gpu_alloc_delta': 728832512, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 157294080})

# Questions

- In the training log, why does `Training Loss` report "No log" ?   Is this a bug?
- We don't need to enable training before using the trainer?  Does the trainer automatically enable training mode, and then disable training mode when its done?  While browsing the code for Trainer, it appears that the attribute `is_in_train` is set to True and then later set to False at the end of training, but I am not 100% sure. 
- Is there a way to grab the recommended Training arguments for the model I'm using from the Hub rather that manually specifying them myself using `Seq2SeqTrainingArguments`?  It seems that the defaults for `Seq2SeqTrainingArguments` are not model-specific, but perhaps there is a way to get this?  
- Similarly is there a way to automatically grab the metrics or recommended metrics to use that were trained the model just to ensure consistency from the model hub?  
- Why is `model` passed into DataCollatorForSeq2Seq?  I can see from the code that it is using the model's `prepare_decoder_input_ids_from_labels` attribute, but isn't that something that would/should also be available in a tokenizer instead?  I'm just trying to build a better mental model of what is happening.  
- Is there a way to create my own pipeline object similar to the high level magic you have for pretrained models?  Like is there a way I can leverage some of the same machinery you have so I don't have to create my own inf?  I know I can create it myself, but wondering if there is a way to leverage something you already have so I don't have to build a thing that takes strings, tokenizes, numericalizes, does a forward pass, then decodes that back into a string with beam search etc. 