If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it. Note the `rouge-score` and `nltk` dependencies - even if you've used 🤗 Transformers before, you may not have these installed!

In [None]:
#! pip install transformers datasets
#! pip install rouge-score nltk
#! pip install huggingface_hub

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then run the following cell and input your token:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [None]:
# !apt install git-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.16.0 since some of the functionality we use was introduced in that version:

In [1]:
import transformers

print(transformers.__version__)

4.16.0.dev0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](images/summarization.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using Keras.

In [2]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we pick the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [3]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Using custom data configuration default
Reusing dataset xsum (/home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [5]:
raw_datasets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"He was asked to give ""a little clue as to [his] inclinations"" about the BBC's forthcoming charter renewal by shadow culture secretary Chris Bryant.\nThe minister responded quoting Mr Bryant's 2005 description of elements of the licence fee as ""regressive"".\n""You will have to await our conclusions,"" Mr Whittingdale added.\n""I would say I very much agree when you observe 'elements of the licence fee are regressive because everyone has to pay it, so it falls as a greater percentage of the income on the poorest people'.""\nThe quote was taken from a 2005 debate in which Mr Bryant went on to say the licence fee was ""a good principle because it enables everyone in the country, whether rich or poor, to watch the best programming.""\nMr Whittingdale has been a critic of the licence fee in the past, saying last year it was ""worse than a poll tax"" and ""unsustainable"" in the long term.\nSome commentators have suggested his appointment could pose a threat to the corporation, prompting Mr Bryant to ask: ""Is Auntie safe in your hands?""\nThe current royal charter, which determines the level of the licence fee and the other terms under which the corporation operates, runs out at the end of 2016.\nThe culture secretary agreed with Mr Bryant that its renewal involved ""a tight timetable"" but that he was ""hoping we will be able to renew the charter on time"".\nBefore taking over at the Department for Culture, Media and Sport (DCMS), Mr Whittingdale chaired the House of Commons Culture, Media and Sport Committee.\nLast year the committee said the licence fee was becoming ""harder to justify and sustain"" while conceding ""there appears to be no realistic alternative... in the short term"".",John Whittingdale has faced questions in the House of Commons about the future of the BBC for the first time since becoming culture secretary.,33007198
1,"The crowds mingle in the sunshine, kids queue for ice-cream, adults for fish and chips; young Asian women in saris or headscarves enjoy the fairground rides and shirtless men with dogs on chains watch performers on temporary stages.\nIt is a tradition that dates back to Oldham's days of industrial greatness. Before World War One, the town produced more cotton than France and Germany combined. By the 1960s and 70s, more than 300 cotton mills working day and night had pulled in immigrant workers from across the Asian sub-continent.\n""This was a brilliant place for a young immigrant in the 1970s,"" Riaz Ahmad told me. He came here from Pakistan in 1974 at the age of 21, and has been a Labour councillor for much of the last 25 years.\n""The town was full of noise and smoke from the chimneys. The mayoress's chain of office has 365 diamonds - one for each of the cotton mills that operated here.""\nWhen the world was buying cotton, Oldham thrived, a genuine beneficiary of a global market. But then the world started making cotton, and much more cheaply; Oldham's decline was as sudden as it was dramatic. The tide of globalisation had turned. This year Oldham was named, by the Office for National Statistics, the most deprived town in England.\nThe townscape is still dominated by those famous old mills. But many are derelict, their windows smashed, their roofs stripped of copper and lead, water leaking into what was their immense production halls.\nThroughout, Oldham remained a Labour fortress. Though Winston Churchill was once the town's Liberal MP, Labour have been winning elections here hands down for as long as anyone can remember. But in June, Oldham, like most of post-industrial England, voted decisively to leave the EU, in defiance of Labour's pro-Remain stance.\n""I'd always voted Labour,"" Marlene Nurse, a retired schoolteacher told me. ""But then we got a leaflet through the door from UKIP. We didn't know anything about that party so we went along to their public meeting. [My husband Ian and I] met Paul Nuttall [UKIP deputy leader] there. It was an eye-opener. He was talking common sense.\n""We'd never heard anything like it from the Labour Party or the Conservatives. I thought - this is what we want. From then on we were committed.""\nIn 2005, UKIP polled less than 3% in Oldham; in a parliamentary by-election this year in Oldham West and Royton, it polled 23% - propelling the party into second place.\nOldham has drawn a new wave of migrants since EU enlargement in 2004.\n""Lots of east Europeans have been sent here because the housing is cheap and the rents are low,"" Marlene Nurse told me. ""The media try to say UKIP supporters are racist, but it's not true. We are ordinary, decent, honourable people.\n""Nigel Farage gets a terrible time in the media, he's barely allowed to finish his sentence before the interviewers cut him off. But if you actually listen to what he says, he makes sense - common sense.""\n""I am not complacent about this threat,"" says Riaz Ahmad. ""We will not sleepwalk into that. We are working in the estates to hold onto our support. But UKIP are a one-agenda party, and that issue is now resolved [by the EU referendum].\n""So what has UKIP got to offer? They'll have to reinvent themselves, so we'll see what they come up with.""\nThe EU referendum result revealed a profound divide in British society. It is not the traditional divide, between Conservative and Labour, but one that cuts through both major UK parties.\nIt is as though the Britain that thrived under globalisation and open markets, the Britain that voted to stay in the EU, failed to notice that another Britain had been incubating for decades in the dereliction to which the once proud industrial heartlands have been reduced since the 1970s.\nAn entrenched hostility to the EU has been building in that second Britain in direct proportion to the decline of its industrial base.\nThe vote on 23 June was, in Oldham and towns like it, a revolt against globalisation and a revolt against open markets. But how much was it also a revolt by people who have always voted Labour?\nReal votes in real elections are still going Labour's way in Oldham. But the EU referendum has shifted something in the alignment of political loyalties in the old industrial heartlands.\nAnti-EU and anti-Westminster sentiment emerges from it emboldened. Labour still has the numbers; but UKIP has the energy, the momentum, and a renewed sense of its own legitimacy as the authentic voice of those left behind in a globalised economy,\n""If I were a Labour MP or councillor here in Oldham, should I be scared?"" I asked Marlene Nurse. ""Oh I hope so,"" she said, and laughed. ""I certainly hope so!""",At Oldham's annual end-of-summer carnival you see the town's 20th-Century history in flesh and blood.,37449630
2,"The Labour-led city council is looking to save a further Â£2.2m in its budget by 2020, after an Â£8m cut was sent out to consultation in July.\nLib Dem councillor Adam Williams said the move would lead to ""further cuts to services people rely on"".\nThe Labour group said it was making ""difficult decisions on the back of a very tough funding position"".\nThe fresh round of cuts will be considered by the authority's cabinet on 22 August.\nServices where savings are being sought include children's centres, provisions to homelessness organisations, bus routes and recycling centres.\nMr Williams said: ""These services are the things that the most vulnerable people in Hull rely on on a daily basis.\n""It's just the wrong thing to be doing.""\nLabour councillor Daren Hale, deputy leader of the council, said: ""We are continuing to have to make difficult decisions on the back of a very tough funding position.\n""We will continue to plan and budget as best we can to maintain vital services, but with the relentless cuts in government funding we cannot avoid the prospect of some reductions in service.""\nThe authority said it had lost about Â£104m and shed more than 2,000 jobs since 2011 due to cuts from central government.","A new round of council savings in Hull is an attack on the city's ""most vulnerable"", a councillor has said.",37063003
3,"Woodrow's powerfully struck low free-kick - Burton's first and only effort on goal - went through Forest's wall and beat goalkeeper Jordan Smith.\nJohn Brayford made a crucial second-half block to deny Forest's Matty Cash, who also wasted a late chance.\nThe victory was Burton's first in five games and put them 18th in the table.\nForest have now lost five in seven games, with their only win in the past month coming against automatic promotion hopefuls Brighton, and they are just two points above the relegation places, with Blackburn - in 22nd spot - claiming a point in a 2-2 draw at Norwich City.\nTheir first league visit to Burton - managed by former Forest striker Nigel Clough - failed to produce anything like the drama of the reverse fixture in August, when Forest overcame Albion 4-3 in what was the Brewers' debut in the second tier.\nLasse Vigen Christensen and Jackson Irvine both fired shots over the bar before on-loan Fulham forward Woodrow put the hosts ahead.\nForest's Zach Clough scuffed two chances wide for the visitors in the first half, while Brewers goalkeeper Jon McLaughlin foiled Ben Brereton at the near post.\nAnd while Cash had two openings after the break, Burton survived to move three points clear of the bottom three.\nBurton Albion boss Nigel Clough:\n""The first goal was crucial. Whoever got it the opposition were going to struggle to break them down. We got it and looked after it.\n""There were a few scares and they had a lot of possession but we saw it through. I'm disappointed that we didn't do enough on the break.\n""In the first half of the season we were conceding goals late on in games which is one of the reasons we are where we are in the table, but the determination and resilience at the moment to see it through and get that clean sheet is what impressed me.""\nNottingham Forest manager Gary Brazil:\n""Obviously it's not the result we wanted. There wasn't a great deal in the game in terms of penalty box action and clear cut chances.\n""We had to be a bit better in possession and in that top third of the pitch to give ourselves opportunities to score, and in fairness we didn't really do that.\n""We felt if we could get one we would go on and get a second and win the game.""\nMatch ends, Burton Albion 1, Nottingham Forest 0.\nSecond Half ends, Burton Albion 1, Nottingham Forest 0.\nCorner, Nottingham Forest. Conceded by Tom Flanagan.\nAttempt saved. Matthew Cash (Nottingham Forest) right footed shot from the centre of the box is saved in the centre of the goal. Assisted by David Vaughan.\nHildeberto Pereira (Nottingham Forest) wins a free kick on the right wing.\nFoul by Lasse Vigen Christensen (Burton Albion).\nHildeberto Pereira (Nottingham Forest) wins a free kick in the defensive half.\nFoul by Lasse Vigen Christensen (Burton Albion).\nZach Clough (Nottingham Forest) wins a free kick in the defensive half.\nFoul by Tom Naylor (Burton Albion).\nAttempt missed. Tom Naylor (Burton Albion) right footed shot from outside the box misses to the right. Assisted by Jackson Irvine.\nAttempt blocked. Lloyd Dyer (Burton Albion) left footed shot from the left side of the box is blocked. Assisted by Jackson Irvine.\nFoul by Apostolos Vellios (Nottingham Forest).\nKyle McFadzean (Burton Albion) wins a free kick in the defensive half.\nAttempt blocked. Apostolos Vellios (Nottingham Forest) header from the centre of the box is blocked. Assisted by Ben Osborn with a cross.\nAttempt missed. Jackson Irvine (Burton Albion) right footed shot from outside the box misses to the left.\nSubstitution, Burton Albion. Tom Naylor replaces Cauley Woodrow.\nAttempt blocked. Ben Osborn (Nottingham Forest) left footed shot from the centre of the box is blocked. Assisted by Matthew Cash.\nCorner, Nottingham Forest. Conceded by Tom Flanagan.\nSubstitution, Nottingham Forest. Apostolos Vellios replaces Ben Brereton.\nAttempt missed. Daniel Fox (Nottingham Forest) left footed shot from outside the box is close, but misses to the right. Assisted by Joe Worrall.\nAttempt saved. Ben Brereton (Nottingham Forest) right footed shot from a difficult angle on the right is saved in the centre of the goal. Assisted by Hildeberto Pereira.\nSubstitution, Nottingham Forest. Hildeberto Pereira replaces Eric Lichaj.\nCorner, Nottingham Forest. Conceded by John Brayford.\nAttempt blocked. Matthew Cash (Nottingham Forest) right footed shot from the centre of the box is blocked. Assisted by Ben Brereton.\nAttempt missed. David Vaughan (Nottingham Forest) left footed shot from outside the box misses to the right. Assisted by Ben Osborn.\nCorner, Nottingham Forest. Conceded by Tom Flanagan.\nSubstitution, Nottingham Forest. Matthew Cash replaces Michael Mancienne.\nDelay over. They are ready to continue.\nDelay in match Cauley Woodrow (Burton Albion) because of an injury.\nBen Brereton (Nottingham Forest) is shown the yellow card.\nFoul by Ben Brereton (Nottingham Forest).\nKyle McFadzean (Burton Albion) wins a free kick in the defensive half.\nBen Osborn (Nottingham Forest) wins a free kick in the defensive half.\nFoul by Luke Murphy (Burton Albion).\nCorner, Burton Albion. Conceded by Eric Lichaj.\nCorner, Nottingham Forest. Conceded by Tom Flanagan.\nAttempt blocked. Jackson Irvine (Burton Albion) right footed shot from the right side of the box is blocked. Assisted by Lasse Vigen Christensen.\nSecond Half begins Burton Albion 1, Nottingham Forest 0.\nSubstitution, Burton Albion. Lucas Akins replaces Luke Varney because of an injury.",Cauley Woodrow scored the only goal as Burton Albion moved above of Nottingham Forest in the Championship with their first competitive win over the Reds.,39162727
4,"William Bawn was found with almost 100 category A images - the most serious - when police seized the device in October.\nThe 21-year-old admitted six charges, including three counts of making indecent images of children.\nThe judge at Carlisle Crown Court sentenced him to a 20-month jail term, suspended for two years.\nHe was also ordered to complete a rehabilitation requirement and made subject to a 10-year sexual harm prevention order.\nBawn, of Main Street, Frizington in Cumbria, was prohibited from using any computer equipment not fitted with police monitoring software after a similar conviction in 2014.\nThe latest offence came to light when a woman complained he had borrowed her computer but refused to give it back.","A man banned from using computers borrowed a device to download 1,000 indecent images of children.",37052121


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [8]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [9]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [11]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [8774, 6, 48, 19, 3, 9, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [12]:
tokenizer(["Hello, this is a sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [13]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this is a sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [14]:
if model_checkpoint in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [15]:
max_input_length = 1024
max_target_length = 128


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [16]:
preprocess_function(raw_datasets["train"][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [17]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934/cache-d301f9d3e75faf9c.arrow
Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934/cache-452e300efe52fecf.arrow
Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934/cache-b77df12ee40e6489.arrow


Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is sequence-to-sequence (both the input and output are text sequences), we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [18]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

2022-01-27 18:19:04.963271: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-27 18:19:04.969452: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-27 18:19:04.970133: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-27 18:19:04.971210: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [19]:
batch_size = 8
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-xsum"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [20]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [21]:
tokenized_datasets["train"]

Dataset({
    features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
    num_rows: 204045
})

Now we convert our input datasets to TF datasets using this collator. There's a built-in method for this: `to_tf_dataset()`. Make sure to specify the collator we just created as our `collate_fn`!

Computing the `ROUGE` metric can be slow because it requires the model to generate outputs token-by-token. To speed things up, we make a `generation_dataset` that contains only 200 examples from the validation dataset, and use this for `ROUGE` computations.

In [22]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    batch_size=8,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["validation"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=8,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

Loading cached shuffled indices for dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934/cache-2574e81e35e5979b.arrow


Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally - we can train on this as our loss value simply by not specifying a loss when we `compile()`.

In [23]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


Now we can train our model. We can also add a few optional callbacks here, which you can remove if they aren't useful to you. In no particular order, these are:
- PushToHubCallback will sync up our model with the Hub - this allows us to resume training from other machines, share the model after training is finished, and even test the model's inference quality midway through training!
- TensorBoard is a built-in Keras callback that logs TensorBoard metrics.
- KerasMetricCallback is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

If this is the first time you've seen `KerasMetricCallback`, it's worth explaining what exactly is going on here. The callback takes two main arguments - a `metric_fn` and an `eval_dataset`. It then iterates over the `eval_dataset` and collects the model's outputs for each sample, before passing the `list` of predictions and the associated `list` of labels to the user-defined `metric_fn`. If the `predict_with_generate` argument is `True`, then it will call `model.generate()` for each input sample instead of `model.predict()` - this is useful for metrics that expect generated text from the model, like `ROUGE`.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [24]:
import numpy as np
import nltk


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Rouge expects a newline after each sentence
    decoded_predictions = [
        "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions
    ]
    decoded_labels = [
        "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
    ]
    result = metric.compute(
        predictions=decoded_predictions, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)

    return result

In [25]:
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./summarization_model_save/logs")

push_to_hub_callback = PushToHubCallback(
    output_dir="./summarization_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]

model.fit(
    train_dataset, validation_data=validation_dataset, epochs=1, callbacks=callbacks
)

/home/matt/PycharmProjects/notebooks/examples/summarization_model_save is already a clone of https://huggingface.co/Rocketknight1/t5-small-finetuned-xsum. Make sure you pull the latest changes with `repo.git_pull()`.








<keras.callbacks.History at 0x7f378829ba00>

If you used the callback above, you can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")
```