# Fine Tune Text Summarizer With Hugging Face

Trying to adapt and follow: https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb

This will fine tune a model to summarize GitHub Issues from the GitHub repo fastai/fastai

In [1]:
! pip install ghapi



In [1]:
from ghapi.core import GhApi
from ghapi.all import github_token, paged
import os, pickle
from fastcore.all import L

# Get the Data

Uncomment this block and run it if this is the first time running this notebook.  You need to have a [personal access token](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) in an environment variable named `GH_PAT`. 

In [3]:
# api = GhApi(owner='fastai', repo='fastai', token=os.getenv('GH_PAT'))
# issues = L(paged(api.issues.list_for_repo, state='all')).concat()
# pickle.dump(issues, open( "issues.p", "wb" ) )
# len(issues)

In [2]:
from datasets import load_metric
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer

model_checkpoint = "t5-small"
metric = load_metric("rouge")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to('cuda')

In [5]:
print(model.training)

False


## Process The Data

In [6]:
import pickle
issues = pickle.load(open( "issues.p", "rb" ))

In [7]:
pairs = (issues
 .filter(lambda x: x.body and x.title and len(x.body) > 10 and len(x.title) > 5)
 .map(lambda x: {'body':x.body, 'title':x.title}).shuffle()
  )
pairs[:2]

(#2) [{'body': "Currently there is a bug where if you run `lr_find()` twice in a row, you will get very different results such as below: \r\n![image](https://user-images.githubusercontent.com/7831895/96666881-f7f9e800-1325-11eb-858a-9eae1ae82b12.png)\r\n\r\nThis sort of pattern is common in models that already have trained weights. Since we know that the models weights are stored away, my investigation led me to believe that something would be wrong with how we are loading in the optimizer in `learn.load`. \r\n\r\nThe stem of the issue is the fact that if `self.opt` is none we call `create_opt` and then pass this *new* opt into `load_model` as seen below:\r\n```python\r\ndef load(self, file, with_opt=None, device=None, **kwargs):\r\n        if device is None and hasattr(self.dls, 'device'): device = self.dls.device\r\n        if self.opt is None: self.create_opt()\r\n        file = join_path_file(file, self.path/self.model_dir, ext='.pth')\r\n        load_model(file, self.model, self.o

In [8]:
train_pairs = pairs[:2800]
eval_pairs = pairs[2800:]

### Tokenize

In [5]:
prefix = "summarize: "

max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples.map(lambda x: x['body'])]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True, )

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(list(examples.map(lambda x: x['title'])), max_length=max_target_length, padding=True, truncation=True,)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [18]:
from torch.utils.data import Dataset

class GH_Issues(Dataset):
    def __init__(self, data): self.data = data

    def __len__(self): return len(self.data)
    
    def __getitem__(self, idx):
        return {x: self.data[x][idx] for x in ['input_ids', 'attention_mask', 'labels']}

In [19]:
tokenized_datasets = {}
tokenized_datasets['train'] = GH_Issues(preprocess_function(train_pairs))
tokenized_datasets['eval'] = GH_Issues(preprocess_function(eval_pairs))

## Fine Tuning The Model With Trainer

In [20]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [21]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    logging_steps=1
)

In [22]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

In [37]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["eval"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [38]:
trainer.train()



Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Runtime,Samples Per Second
1,13.485,13.994021,21.1957,4.7619,17.029,17.029,19.0,2.0527,1.462
2,12.7849,13.994021,21.1957,4.7619,17.029,17.029,19.0,1.5299,1.961
3,13.8799,13.994021,21.1957,4.7619,17.029,17.029,19.0,1.5406,1.947
4,13.0761,13.994021,21.1957,4.7619,17.029,17.029,19.0,1.3244,2.265
5,13.5541,13.971494,21.1957,4.7619,17.029,17.029,19.0,1.8974,1.581


TrainOutput(global_step=5, training_loss=13.355995559692383, metrics={'train_runtime': 10.6076, 'train_samples_per_second': 0.471, 'total_flos': 6120850083840.0, 'epoch': 5.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 124715008, 'train_mem_gpu_alloc_delta': 485873152, 'train_mem_cpu_peaked_delta': 68141056, 'train_mem_gpu_peaked_delta': 1459021824})

In [3]:
test_text = """Hi,

I have recently been looking at using fastcore.script as a way to create lightweight scripts around functions which are also imported and used elsewhere. However, I seem to be experiencing some odd behaviour, which I will illustrate with a trivial example.

Let’s say I define a script `say_hello_script.py` which contains:
```
from fastcore.script import *


@call_parse
def say_hello(greeting: Param('Greeting', str) = 'Hello',
              name: Param('Name', str) = 'World'):
    print(f'{greeting}, {name}')
```

If I call this from the command line using `python say_hello_script.py --name Chris`, I get the expected output of `Hello, Chris.` All is good so far.

However, now I want to call this function from a different script, so I will create a script `greeter.py` which contains:
```
from say_hello_script import say_hello

if __name__ == '__main__':
    say_hello(greeting='hi', name='person')
```

Running `python greeter.py` I get the output of `Hello, World`, which seems to have bypassed my arguments! Is this the normal behaviour?

Thanks,
Chris"""

In [124]:
test_text2="""New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
... 2010 marriage license application, according to court documents.
... Prosecutors said the marriages were part of an immigration scam.
... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
... """

In [6]:
inp = tokenizer.encode(prefix+test_text, 
                return_tensors="pt",
                max_length=max_input_length, 
                truncation=True, 
                padding=True).to('cuda')

# inp['labels'] = tokenizer('', 
#                 return_tensors="pt",
#                 max_length=max_input_length, 
#                 truncation=True, 
#                 padding=True).to('cuda').input_ids

In [11]:
tokenizer("Hello World foobar")

{'input_ids': [8774, 1150, 5575, 32, 1047, 1], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [12]:
tokenizer.encode("Hello World foobar")

[8774, 1150, 5575, 32, 1047, 1]

In [None]:
type(token)

In [7]:
outputs = model.generate(inp, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

In [8]:
tokenizer.decode(outputs[0])

"<pad> a script <unk>say_hello_script.py<unk> contains: <unk> from fastcore.script import * @call_parse def say_hello(greeting: Param('Greeting', str) = 'Hello', name: Param('Name', str) = 'World'. if I call this from the command line using <unk>python say_hello_script.py --name Chris<unk>, I get the expected output of</s>"

In [101]:
np.argmax(out.logits.shape)

2

In [36]:
out.keys()

odict_keys(['loss', 'logits', 'past_key_values', 'encoder_last_hidden_state'])

# Questions

- We don't need to enable training before using the trainer?  Does the trainer automatically enable training mode, and then disable training mode when its done?  While browsing the code for Trainer, it appears that the attribute `is_in_train` is set to True and then later set to False at the end of training, but I am not 100% sure. 
- Is there a way to grab the recommended Training arguments for the model I'm using from the Hub rather that manually specifying them myself using `Seq2SeqTrainingArguments`?  It seems that the defaults for `Seq2SeqTrainingArguments` are not model-specific, but perhaps there is a way to get this?  
- Similarly is there a way to automatically grab the metrics or recommended metrics to use that were trained the model just to ensure consistency from the model hub?  
- Why is `model` passed into DataCollatorForSeq2Seq?  I can see from the code that it is using the model's `prepare_decoder_input_ids_from_labels` attribute, but isn't that something that would/should also be available in a tokenizer instead?  I'm just trying to build a better mental model of what is happening.  
- Is there a way to create my own pipeline object similar to the high level magic you have for pretrained models?  Like is there a way I can leverage some of the same machinery you have so I don't have to create my own inference machinery?  I know I can create it myself, but wondering if there is a way to avoid having to build a thing that takes strings, tokenizes, numericalizes, does a forward pass, then decodes that back into a string with beam search etc. 

In [None]:
model.sample()