First of all, make sure your environment has installed the latest version of [🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) as well as other dependencies:

In [1]:
%pip install "optimum-graphcore>=0.5, <0.6" rouge-score nltk

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting absl-py
  Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.5/126.5 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Collecting joblib
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting click
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... 

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS:

In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.


Let's print out the versions of Transformers and Optimum Graphcore:

In [4]:
import transformers
import optimum.graphcore

print(transformers.__version__)
print(optimum.graphcore.__version__)

4.20.1
0.5.0


Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [5]:
import os

pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/summarization"

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](images/summarization.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `IPUSeq2SeqTrainer` API.

In [8]:
model_checkpoint = "t5-base"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library and is supported by Optimum Graphcore. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [9]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading and preparing dataset xsum/default to /tmp/huggingface_caches/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Dataset xsum downloaded and prepared to /tmp/huggingface_caches/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [10]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [11]:
raw_datasets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [12]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [13]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"The Finn has signed a new deal at Ferrari that will see him stay at the team until the end of 2018.\nThere are many fans of the charismatic driver, but just how much do you know about him?\nTake our quiz to find out.\nThis content will not work on your device, please check Javascript and cookies are enabled or update your browser",Great news for Formula 1 fans - actually the whole world - because Kimi Raikkonen is sticking around.,41015615
1,"Mido Macia was killed in 2013 when officers tied him to the back of a police van by his arms before driving off.\nHe was found in a pool of blood in police custody.\nJudge Bert Bam called the killing ""barbaric"" but acknowledged that it was not premeditated.\nSentencing the men, aged between 25 and 56, Mr Bam said: ""The continuous conduct of the accused concerning the injuries on the deceased was barbaric and totally inexplicable.\n""What made their conduct more reprehensible was their cowardly attack in the cell on a defenceless and already seriously injured man.""\nA lawyer for the former officers said they would appeal against the murder conviction.\nThe sentencing of these officers sends a message that police brutality will not be tolerated by the judiciary, following several recent cases.\nJudge Bert Bam sent a clear and unambiguous message that while the police are facing an onslaught from ruthless criminals - more than 60 police officers have been killed on duty this year alone - they cannot take the law into their own hands.\nAnother important factor here was the power of video material which is increasingly used as evidence in court.\nNo-one knows what would have happened if a bystander had not filmed Mido Macia being dragged behind a police vehicle with both his hands tied up. He was later found in a pool of blood in a holding cell.\nThere is huge public opinion support for the police as they try hard to keep communities safe against gun-toting thugs. However the people of Daveyton and indeed all South Africans will now be reassured that the rule of law prevails, even against the police.\nPolice pulled over 27-year-old Mr Macia in February 2013 after he allegedly parked a vehicle illegally in Daveyton, east of Johannesburg.\nFollowing a struggle, they overpowered the driver before handcuffing him to the back of a vehicle.\nHe was later assaulted in his cell and died from head injuries and internal bleeding.\nThe incident was filmed by a bystander and caused outrage among rights groups.",A judge has sentenced eight former South African policemen to 15 years in prison for the murder of a Mozambican taxi driver.,34786559
2,"Nick Hardwick said they were ""more dangerous"" for staff and prisoners and ""less effective"" at preparing people for release so they do not reoffend.\nHe also said he had seen prisoners who were ""out of it"" from taking legal highs.\nThe Ministry of Justice said it was investing in and reforming prisons.\nMr Hardwick told the Victoria Derbyshire programme: ""The deterioration isn't just bad for prisoners, it's bad for the communities into which they'll return because not enough has been done to stop them committing more crime.""\nHe said the deterioration was due to a combination of issues and the reasons had changed during his time in the role.\n""You've got too many prisoners, not enough staff, the men who are there now are more likely to be there for more violent offences, serving longer [sentences].\n""And particularly in the past year or two there's been a surge in the availability of drugs, particularly so-called legal highs and that then leads to bullying and debt, and that's created much worse conditions.""\nHe said there were lots of ways the drugs got into prisons, over walls, from prisoners, during visits, or through corrupt staff.\n""I was in a prison the other day, and this was quite unusual, there were so many prisoners under the influence that the worst - and it is dangerous, it kills people - they were taking to the hospital, the health centre, but the guys who were less badly affected they were leaving other prisoners to mind and look after,"" he said.\n""I walked round and saw these guys who were obviously out of it.""\nMr Hardwick said sometimes there were ""simply not enough staff"".\n""Sometimes I will go on to a wing and want to talk to someone about it and you can't find a member of staff to talk to.""\nEarlier this month, Justice Secretary Michael Gove announced that Victorian prisons would be closed and replaced with nine new establishments in England and Wales by 2020. Chancellor George Osborne confirmed the closure of Holloway women's prison in London in the Spending Review on Wednesday.\nMr Hardwick said he was encouraged that the government was ""finally listening to what we are saying,"" but that they ""had to deliver"".\nA Ministry of Justice spokesman said prisons needed reform.\n""It is only through more effective rehabilitation that we will reduce reoffending, cut crime and improve public safety.\n""That is why we are investing in a modern prison estate, where governors are empowered to run prisons in the way they think best, and prisoners are given a chance to work or learn.""\nThe Victoria Derbyshire programme is broadcast on weekdays between 09:15 and 11:00 on BBC Two and the BBC News channel.","The outgoing chief inspector of prisons has said there is ""no doubt"" jails have deteriorated in the five years he has been in the role.",34932548
3,"The Spaniard, 34, won his second title in 2006 with Renault and has come close three times since - in 2007 with McLaren and 2010 and 2012 with Ferrari.\nAsked on Spanish radio if he thought he could be champion again, he said: ""Yes, I think so.\n""I compete to win. If there was no chance, I wouldn't do it.""\nMcLaren racing director Eric Boullier has predicted the team will make ""massive progress"" in 2016.\nAlonso, whose best result last year was fifth place in Hungary, said: ""I do not think we are far from achieving a podium.\n""I would be surprised if we managed it at the start of the season, because in the first few races not everything will be in place.""\nHe said his feelings for the 2016 car were ""good"", adding: ""When everything is in place, we will make a very big improvements.""\nBut he said Mercedes are still ahead of everyone else.\nAsked about the Honda engine's performance compared to that of Mercedes, the Spaniard said: ""We will certainly have less power, between 30 and 80bhp, but not 200, no.""\nAlonso said he had ""never considered either retiring or taking a year out"", adding that this was ""unthinkable"".\nThis is a direct contradiction of remarks by McLaren chairman Ron Dennis, who said at the final race of last season that the idea of a sabbatical for Alonso had been discussed at one point in 2015.\nAlonso believes McLaren are the only team who could ultimately beat Mercedes but said: ""We are starting a project and we are still some way off.""\nHe said he had ""never regretted"" leaving Ferrari at the end of 2014, despite the team's upturn in fortunes.\nAlonso's replacement, four-time world champion Sebastian Vettel, won three races in 2015 to finish third in the standings.\n""I was offered to renew until 2019 and did not want to take it up,"" said Alonso. ""I would never have been world champion there. Now I enjoy Formula 1 more being 10 positions further behind.""",Fernando Alonso still believes he can win a third World Championship despite the struggles faced by his McLaren-Honda team.,35694454
4,"A year on from living in the restricted confines of a neck and back brace to aid recovery from career-threatening injuries sustained in a race-fall, and from hearing himself regularly written off, Beasley, 22, goes to Newbury this weekend with Alpha Delphini hoping to bag the biggest prize of his fledgling career.\nAnd with the scars now healed and grave concerns about his future career no longer echoing in his ears - one of which had to be partially rebuilt - both Beasley and the Bryan Smart-trained five-year-old could hardly be in better nick.\nA rich vein of late summer form has seen a string of visits to the winners' circle for the County Durham jockey, some of the most striking of which have come on the gritty Alpha Delphini who is now lining up in Newbury's five-furlong (1000m) Dubai International World Trophy on Saturday.\n""This time last year, I couldn't really do a great deal,"" said Beasley, ""I couldn't even put on my socks, so things have come a long way. I'm hopefully getting my career back on track with some nice winners and nice rides so my confidence is up.""\n""I fractured my skull, the middle of my neck, the middle of my back, took half a lug [ear] off, smashed my teeth and had a clot on the brain that got drained off, so I was in intensive care in Stoke [on Trent] for a while.\nThe winner of more than 30 races since returning to the saddle in March has little memory of the fall at Wolverhampton in July 2015 - in which his mount Cumbrianna suffered fatal injuries - but reels off a mind-boggling list of his own injuries with a shrug of the shoulders and an insistence it was all ""one of those things, I suppose.""\nHe went on: ""I fractured my skull, and there are six plates that will stay in there. I 'did' [fractured] the middle of my neck, the middle of my back, took half a lug [ear] off, smashed my teeth and had a clot on the brain that got drained off, so I was in intensive care in Stoke for a while.\n""It's all a bit of a blur, and I didn't know much about what was happening - my fiancée and my mother and a lot of my family came down from County Durham which was a bit of a hike, and probably more of a nightmare and trauma for them than for me.""\nOnce home there was three months living like ""a mechanical man"" in the brace - ""it was strapped right around me and there were rods and stuff from under my jaw all the way down to my lower back and waist"" - and plenty of time to think.\n""There was never a doubt in my mind that I'd be back,"" said Beasley who married Carla at Christmas, ""though I suppose I didn't really take into consideration how bad I was until I look back now.\n""But I was never negative - I was always positive - and I think that helped me get through it.""\nAfter a moment's pause he added: ""To be fair, before my accident I probably took the job a bit for granted, probably like a lot of the lads, but now I try to soak it up a lot more than I used to, and not just think of the next winner.\n""Getting back was a bit of a shock, it was like starting from scratch really. I had to learn to read a race again and had to build everything up - strength, confidence, and it took a good month just to get into the swing of things but everything's good now.""\nWhile Beasley, who has won four races with Alpha Delphini this year, is in the south for Newbury, he will be keeping an anxious eye on Scotland's premier flat race, the £200,000 William Hill Ayr Gold Cup.\nThere, another prolific part of his recent purple patch Nameitwhatyoulike lines up as one of the leading contenders behind favourite Growl for the six-furlong feature, one of the season's most celebrated sprint races. Rising star apprentice Adam McNamara takes the ride.\n""At Ripon he did it really gutsily,"" said Beasley. ""And then at York, he was even better. He travelled through the race the best he's travelled, so he goes in good form and hopefully at the top of his game.\nSo after all the turmoil following the fall, if everything worked out, how good would a televised pot at Newbury be?\n""It'd be the dream really. Obviously I'm living the dream at the minute after all that's happened but a decent prize would just top it off. Someone was looking down on me after the fall, and I'm very grateful and it'd be great if it continued.""\nNews from Newbury and commentary of the Ayr Gold Cup (15:45) on BBC Radio 5 Live on Saturday, 17 September","As flat racing's 'northern circuit' reflects on another bumper summer of success, few can feel a greater sense of satisfaction than jockey Connor Beasley.",37375145


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [14]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [15]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [16]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [17]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [18]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [19]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [20]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the three arguments. `padding="max_length"` will ensure that an input shorter than maximum length will be padded to the maximum length. `truncation=True` will ensure that an input longer than maximum length will be truncated to the maximum length. `max_length=max_input/target_length` sets the maximum length of a sequence.

Note that it is necessary to pad all the sentences to the same length since currently Graphcore's PyTorch implementation only runs in static mode.

In [None]:
max_input_length = 1024
max_target_length = 256


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, padding="max_length", truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, padding="max_length", truncation=True)
        
    # Since we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    labels["input_ids"] = [
        [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(raw_datasets['train'][:2])

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [23]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/205 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [33]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from optimum.graphcore import IPUConfig, IPUSeq2SeqTrainer, IPUSeq2SeqTrainingArguments

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/945M [00:00<?, ?B/s]

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `IPUSeq2SeqTrainer`, we will need to define four more things. The first thing we need to define is the `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We initialize it with one config name or path:

In [40]:
ipu_config_name = 'Graphcore/t5-small-ipu'
# ipu_config = IPUConfig.from_pretrained(
#     ipu_config_name,
#     executable_cache_dir=executable_cache_dir,
# )


ipu_config = IPUConfig(executable_cache_dir=executable_cache_dir)




The other thing we need to define is the `IPUSeq2SeqTrainingArguments`, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [41]:
micro_batch_size = 1
gradient_accumulation_steps = 16

# model_name = model_checkpoint.split("/")[-1]
model_name = 'pszemraj/long-t5-tglobal-base-16384-book-summary'

args = IPUSeq2SeqTrainingArguments(
    model_name,
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    pod_type=pod_type,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
#     predict_with_generate=True,
    dataloader_drop_last=True,
    logging_steps=100,
    push_to_hub=False,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the three batch-size-related arguments, namely `micro_batch_size`, `gradient_accumulation_steps` and `pod_type` defined at the top of the cell and customize the weight decay. Since the `IPUSeq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Note that the `predict_with_generate` option is disabled since it is not supported yet.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will prepare the `decoder_input_ids`:

In [42]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `IPUSeq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [43]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `IPUSeq2SeqTrainer`:

In [44]:
trainer = IPUSeq2SeqTrainer(
    model,
    ipu_config,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
#     compute_metrics=compute_metrics
    compute_metrics=None # Use None since predict_with_generate is not supported yet.
)

Overriding IPU config: gradient_accumulation_steps=16


KeyError: 'LongT5ForConditionalGeneration pipelined version not found in registry.'

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
# trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```