# Setup

Important installations for colab notebook from PyPi directories of Hugging Face (Google)

In [None]:
!pip install folium==0.2.1
!pip install transformers datasets
!pip install rouge-score nltk
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting folium==0.2.1
  Downloading folium-0.2.1.tar.gz (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.0/70.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.2.1-py3-none-any.whl size=79794 sha256=f65920bdb575d80cd7751bbe5a3ecb4feefc553c1151c32591a4a9820fb8336e
  Stored in directory: /root/.cache/pip/wheels/00/0c/07/d7792a5444d5bb074361ac27da53cee9d5cce59a07fe9da5dd
Successfully built folium
Installing collected packages: folium
  Attempting uninstall: folium
    Found existing installation: folium 0.14.0
    Uninstalling folium-0.14.0:
      Successfully uninstalled folium-0.14.0
[31mERROR: pip's dependency resolver does not currently take into accou

Import of huggingface API for an easy GUI. Needs the below personalized token:

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

We also ``` !git config ``` the credential helper to be able to save our trained models to the google's Hugging Face Hub for later use.



In [None]:
from huggingface_hub import notebook_login
!git config --global credential.helper store
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We install and config ```git``` in this **colab** instance. It will work only if needed.

Transformers version check (needs to be >=4.16.0 for best functionality)

In [None]:
import transformers

print(transformers.__version__)

4.29.2


# Fine-tuning a model on a summarization task

We use a pre-trained model from [Model Hub](https://huggingface.co/models), to fine-tune it to the data we want. We can use any transformer model that has a seq2seq version in the Transformers library.

Below we chooce [Google's T5](https://huggingface.co/t5-small)

In [None]:
model_checkpoint = "t5-small"

## Loading the dataset

We will use the [Hugging Face Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

*   **ROUGE**, or **Recall-Oriented Understudy for Gisting Evaluation**, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

*   The **Extreme Summarization** (**XSum**) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.


In [None]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading and preparing dataset xsum/default to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Dataset xsum downloaded and prepared to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, we need to select a split first, then give an index:

In [None]:
raw_datasets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"The investment comes as plans were announced to reinvent the corporation ""for a new generation"" and combat competition from media giants like Netflix and Amazon.\nDirector general Tony Hall said it was ""the biggest investment in children's services in a generation"".\nThe funding was unveiled as part of the BBC's first Annual Plan.\nSetting out the BBC's ambitions for the coming year, the extra money for children's content is going to be invested across the three years to 2019-20.\nLord Hall said: ""Our ambition to reinvent the BBC for a new generation is our biggest priority for next year. Every part of the BBC will need to contribute to meeting this challenge.""\nThe new investment, delivered following savings made across the BBC, will see the budget for children's programming reach £124.4m by 2019-20, up from the current figure of £110m.\nIn the three years, £31.4m will be spent online on content that will include video, live online programme extensions, blogs, vlogs, podcasts, quizzes, guides, games and apps.\nLord Hall said it was ""the biggest investment for a generation"" and will ""increasingly offer a personalised online offering for our younger viewers"".\nThe BBC said it wants to respond to changes to the way children ""are watching and consuming programmes"", adding: ""Investment in British content - particularly for the young - is vital, unless we want more of our culture shaped and defined by the rise of West Coast American companies.""\nBy David Sillito, media and arts correspondent\nOver the last six years, children's TV viewing has dropped by more than a quarter.\nYoungsters now spend more time online than they do in front of the television, around 15 hours a week. Even pre-schoolers spend more than eight hours a week online, according to Ofcom.\nNaturally then, the CBBC channel aimed at six to 12-year-olds has seen a drop in its audience, and increasingly children are choosing to use the BBC's iPlayer.\nViewing habits are changing, but so too is the content they are watching. Shorter video clips, interactive content and games are all going to increase.\nThe setting for all of this is a long-term decline in spending on British children's programmes by other broadcasters - ITV's programming went from 424 hours in 1998 to 64 in 2013 - and the dominance of US programming.\nThis will only increase in an online world dominated by the tech giants. Children's culture is being shaped by firms based on the west coast of America.\nThe annual plan also explains how the BBC is aiming to tackle such challenges as ""fake news"" with BBC News's Reality Check being expanded to fact-check social media claims, and work being done alongside Facebook to build trust.\nIt also shows how the corporation will ""rise to the challenge of better reflecting and representing a changing UK"" and how it is focusing on personalisation.\nThe BBC's creative plans for the next 12 months also include:\nThe annual plan is not the same as the BBC's annual report, which looks back over the previous year's performance and publishes details about the corporation's finances and spending. That report is expected later this month.\nFollow us on Facebook, on Twitter @BBCNewsEnts, or on Instagram at bbcnewsents. If you have a story suggestion email entertainment.news@bbc.co.uk.",The BBC is to spend an extra £34m on children's content over the next three years.,40489812
1,"But beneath the surface, Westminster life seethes with intrigue.\nOn the government side, the position of the prime minister and the ambitions of would-be successors are the subject of endless speculation, which means that all kinds of occasions can morph into opportunities for possible contenders to, ever so subtly, strut their stuff.\nWith the elections for select committee chairs due on Wednesday, there are plenty of MPs pursuing influential perches on the committee corridor.\nThe biggest prize is probably the chair of the Treasury Committee - vacated by Andrew Tyrie, who stood down at the last election; but the chairs of the Education and Business, Energy and Industrial Strategy Committees are also open.\nAnd, as some old-stagers shake their heads, two incumbent chairs, Julian Lewis in Defence, and Crispin Blunt in Foreign Affairs, are being challenged.\nLong-serving Foreign Affairs Committee member John Baron is seeking the chair of that committee, as is Tom Tugendhat from the 2015 intake. Another new-ish MP, Johnny Mercer is seeking the chair of Defence - will the torch be seized by a younger generation?\nThe key to understanding these races is that, where several MPs from one party are contesting a post, the winner is the one with the most appeal to the other parties - so Labour MPs, in effect, will pick the winner in hotly contested elections like the one for the Treasury Committee chair.\nElsewhere the All-Party Parliamentary Groups are springing into action...Margaret Hodge's APPG on tax, which is zeroing-in on a particularly sensitive political issue, is taking shape, and the APPG on cancer is launching its second inquiry into NHS England's Cancer Strategy, which will feed into a 'Britain Against Cancer' conference in Westminster on 5 December.\nHere's my rundown of the week ahead:\nThe Commons week begins (2.30pm) with Defence Questions - providing a final chance for the contenders for the post of Defence Committee chair, the incumbent, Julian Lewis, and his challenger, Johnny Mercer, with a final audition for the job.\nThe day's legislating is on the second reading of the Telecommunications Infrastructure (Relief from Non-Domestic Rates) Bill - which scraps business rates on new ultrafast broadband lines as part of the government's drive to speed up Britain's move to ultrafast broadband and 5G mobile coverage.\nOlder telecoms infrastructure will still be taxed as the government seeks to encourage a switch to ""gold standard"" fibre optics rather than upgrades to get better performance from old-style copper telephone lines.\nIn the Lords (2.30 pm) questions to ministers include one on setting up a Data Ethics Commission from the former CEO of TalkTalk, the Conservative, Baroness Harding of Winscombe. The day's main debate is on the current security situation in the UK - with 20 peers listed to speak at the time of writing.\nThe Commons meets at 11.30am for Foreign Office Questions (watch out for a bit of competition between Messrs Blunt, Baron and Tugendhat) with any statements or urgent questions following at 12.30pm.\nSlightly to everyone's surprise, the Commons will then deal with the detail of the Air Travel Organisers' Licensing Bill - given its second reading this week - in a committee of the whole House, with report stage and third reading following immediately after.\nThere has already been some Opposition chuntering that the government should have set up a bill committee, and some speculation about why they have not. Some accuse ministers of trying to bog down the House in legislative chores and general debates, because they dare not put controversial legislation before MPs. Expect these complaints to intensify, if this continues.\nOne piece of controversial legislation is expected to appear - but only to be presented, not debated. The Great Repeal Bill, as it will not be called, because Commons rules prohibit argumentative titles for legislation, will be formally published this week - with rumour pointing for Tuesday as the day for the formalities to be enacted. The actual second reading debate will not be until September, at the earliest.\nThe adjournment debate, led by the Conservative, Sir Paul Beresford, is on Section 136 of the Mental Health Act 1983 - an emergency power which allows the police to remove an apparently mentally disordered person from a public place to a place of safety, for up to 72 hours.\nIn Westminster Hall, the former Conservative chief whip Mark Harper - now on the backbenches - leads a debate (9.30am-11am) on balancing the public finances. There has been a lot of internal criticism that the Conservatives have failed to make the intellectual case for austerity, in the face of Jeremy Corbyn's attacks on spending cuts.\nThe newly elected Lib Dem Layla Moran discusses the role of children's centres in tackling social inequality (11am-11.30 am) and in the afternoon, Labour's Lucy Powell has a longer debate (2.30pm- 4pm) on government policies on social mobility - her aim is to give MPs a chance to debate the recent Social Mobility Commission report, Time for change: an assessment of government policies on social mobility 1997-2017, with ministers.\nShe has been carving out a niche as a social mobility campaigner since she left the Labour frontbench, and will use the debate to renew her calls (made with Nicky Morgan and Nick Clegg in the last Parliament) for more cross-party joint working on what works for social mobility. With the sensitive subject of grammar schools likely to feature, it may also see a bit of last-minute campaigning from candidates for the chair of the Education Committee.\nThe SNP's Dr Lisa Cameron has a debate (4.30pm-5.30pm) on consultation with disabled people on the effect on their services of the UK leaving the EU.\nIn the Lords (2.30pm) the usual half hour of questions to ministers is followed by an order to extend non-jury trials in Northern Ireland.\nThen peers will turn to the second reading of the Armed Forces (Flexible Working) Bill - which would allow for part time and flexible working arrangements for armed forces personnel: the idea is that the conditions of service must offer recruits a career that better reflects the realities of modern life - it is hoped that allowing greater flexibility over how long and where people work will help attract and keep the talent the forces need.\nThere will also be a short debate on the diplomatic crisis in the Gulf region and the steps being taken to de-escalate tensions and encourage Qatar to engage with its neighbours about their concerns about extremism.\nThe Commons begins (11.30am) with International Development Questions, followed at noon by Prime Minister's Questions.\nThese occasions may not have much cut-through with the public most of the time, but are increasingly critical for Conservative Party morale.\nThe rest of the day is devoted to a debate on the Grenfell Tower fire inquiry. One strand of this is the increasing pressure from senior MPs for some kind of parliamentary role in drawing up the terms of reference of major inquiries, and appointing the chair.\nThe PM is the 'sponsoring minister' for the inquiry and as a result the department handling arrangements is the Cabinet Office.\nIn Westminster Hall, the first debate (9.30am-11am) is on negotiations on the UK's future Euratom membership - the Labour MP Albert Owen (one of the contenders for the chair of the business and energy select committee) will warn of problems ahead if the UK leaves the EU atomic energy agency without making transitional arrangements to allow for the purchase of nuclear materials and technology from member nations, and maintain monitoring arrangements for British nuclear facilities.\nAn amendment highlighting the possible problems from leaving Euratom as part of the Brexit process did gain some traction during the debates on the bill to invoke Article 50, earlier this year, and Mr Owen anticipates some cross-party support and pressure on the government, during this debate.\nIn the afternoon, the Telford MP, Lucy Allan, leads a debate (2.30pm-4pm) on the challenges facing new towns.\nBut it is the debate on abuse and intimidation of candidates and the public in UK elections (4.30pm-5.30pm) that promises to be the most rancorous of the week. Expect much finger-pointing, with a number of MPs seeking payback for some bruising experiences during the campaign. The debate will be led by the Conservative backbencher, Simon Hart, who says many candidates faced ""numerous acts of vandalism, abuse, intimidation and general thuggishness - especially online"".\nHe wants to use the debate to raise the wider question of the impact of all this on recruitment and retention of candidates, public attitudes to voicing support for individuals, reporting and ultimately electoral outcomes. Intriguingly he also wants to pose the question: ""what is acceptable online activity from opponents in public office?""\nIn the Lords (3pm) there's an interesting looking question to the Leader of the House from Labour's Lord Soley on proposals ""to create a close and constructive relationship between the House of Lords and the European Parliament"".\nThen peers come to the final frontier, the second reading of the Space Industry Bill - it is less exciting than it sounds; the bill is concerned with providing a legal framework for a projected British spaceport.\nSpace technology is one of the UK's unsung industries, and the government is keen to expand it, by licensing a British spaceport ahead of rival countries like Portugal, which wants a launch facility in the Azores.\nThere are a number of rival sites in Cornwall and Scotland which might become the home for a launch facility, although the vehicles concerned will be aircraft lifting space vehicles to high altitude rather than Cape Canaveral style rockets. Before the election, the Commons Science and Technology Committee had run an inquiry into a draft version of the bill, and now peers will boldly legislate where no peer has legislated before.\nThat is followed by a short debate on risks to NHS sustainability arising from the United Kingdom's departure from the European Union - led by the former health minister, Lord Warner, who resigned the Labour whip in 2015 and now sits as a non-affiliated peer.\nOf course, in the Lords, their select committees are up and running - and their EU Committee meets (at 4pm) with the Secretary of State for Exiting the EU, David Davis, to discuss Brexit, the first round of negotiations, engagement with Parliament, citizens' rights and devolution.\nThe Commons opens (9.30 am) with Transport Questions, which could see the debut of whoever is elected as the new chair of the Transport Select Committee. And then comes the weekly Business Statement from the Leader of the House, Andrea Leadsom.\nThe main debate is a Commemoration of Passchendaele, the third battle of Ypres, one of the bloodiest of the First World War. The figures are disputed but there were an estimated 240,000 British, 8,525 French and 260,000 German casualties. One MP to watch will be the Conservative military historian Keith Simpson, who intends to look at the blunders by the generals who planned the allied offensive.\nExpect a number of MPs to speak about the sacrifices made by their ancestors, and to point to the losses sustained by New Zealand soldiers.\nThe adjournment debate, led by the Conservative, Alec Shelbrooke is on the prosecution of driving offences committed on private land - it follows from the death of nine-year-old Harry Whitlam killed by a drunk driver in a farmyard, in 2015.\nBecause this happened on private land the driver could only be prosecuted under health and safety legislation - and Mr Shelbrooke will be calling for a change in the law.\nIn the Lords (11 am) the Plaid Cymru peer Lord Wigley has a question on the sensitive issue of changes to the Barnett formula for Wales and Scotland arising from additional financial provision for Northern Ireland which follows the government's deal with the Northern Ireland Democratic Unionists.\nA series of Labour backbench debates on public service issues follows, with Baroness Andrews raising the impact of deregulation on public services, health and safety; Lord Haskel the cap on public sector pay and Lord Kennedy of Southwark on local government finance and arrangements beyond 2020.\nThere's also a short debate on security challenges and related human rights violations on the Korean peninsula led by Lord Alton of Liverpool.\nNeither House sits on Friday.","Parliament continues to mark time, with business in both Houses dominated by general debates and uncontroversial legislation.",40535201
2,"Media playback is unsupported on your device\n30 January 2015 Last updated at 18:01 GMT\nIn this week's episode of What's Up Africa, satirist Ikenna Azuike asks, ""Whose side are the Kenyan police on?""\nWatch Focus on Africa on BBC World News & partner stations across Africa every Friday from 17:30 GMT.\nWhat's Up Africa is a BBC and RNW co-production.","When police in Kenya used tear gas on primary schoolchildren protesting over a land grab of their playground, there was outrage across the country.",31063515
3,"The seven-year deal gives Adidas exclusive rights to produce kits for players along with licensed fan gear.\nIt will also mean Adidas provides uniforms for the 2016 World Cup of Hockey.\nThe German company takes over the rights from its subsidiary Reebok. The move is part of a company-wide effort to refocus the Adidas brands.\nAdidas North America President, Mark King said: ""For us [the deal] is a way to put these brands in sync.""\nThe strategy, according to Mr King, would allow Reebok to focus more on the fitness consumer, while Adidas focused on sports and their fans.\nAdidas faces competition in North America from brands like Under Armour and Nike. Nike took over from Adidas as the official supplier of kit for the National Basketball Association (NBA) after Adidas and the NBA failed to reach a renewal agreement earlier this year.\nSponsorship of the NHL will give Adidas access to a sports league with a following across the US and Canada.\nIn 2010 Reebok moved from focusing on team sport to fitness. The brand has deals with companies like Cross-Fit and Spartan Races.\nReebok's contract to supply the NHL continues for two more years, but the league said it would consider moving Adidas' agreement forward to the 2016/2017 season.\nThe cost of the deal was not disclosed by Adidas or the NHL.\nThey also ruled out the possibility of placing adverts on team jerseys, but may still consider the option for the World Cup.\n""The World Cup and other international events give us, among other things, an opportunity for experimentation,"" said executive director of the NHL Players Association Don Fehr.",The US National Hockey League has named Adidas as its sponsor to provide kits for the league's teams.,34265091
4,"It was the 32nd ATP title of the Scot's career as he prepares for the French Open at the end of the month.\nIn a rain-interrupted match, the top seed needed just over three hours to beat the world number 24 Kohlschreiber.\n""It was a really tough match, he served very close to the line and I was getting frustrated,"" said Murray, 27.\nHe became the first Briton to win an ATP-ranked event on clay since Buster Mottram in Palma in April 1976.\nMurray, playing in his first tournament since marrying long-term girlfriend Kim Sears last month, added: ""I didn't realise I was the first Brit to win on clay for so long, so that's obviously an honour.""\nHe and Kohlschreiber could meet again in the second round of this week's Madrid Masters.\nThe final resumed on Monday after heavy rain stopped play on Sunday evening with Kohlschreiber leading 3-2 in the first set.\nBoth players confidently held serve to take the opener into a tie-break, and it was world number three Murray, watched by new coach Jonas Bjorkman, who got the mini-break to snatch the set after almost an hour.\nMurray had three break points at 4-3 up in the second set, but Kohlschreiber held his nerve and then broke Murray in the 11th game before serving out to level the match.\nThere were no breaks in the third set and Murray clinched victory on his second match point in another tie-break when Kohlschreiber, the winner in Munich in 2007 and 2012, sent a backhand long.",Britain's Andy Murray has won his first clay-court title with a 7-6 (7-4) 5-7 7-6 (7-4) victory over German Philipp Kohlschreiber in the Munich Open final.,32582143


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [None]:
fake_preds = ["Raf is the best dog", "gouf gouf"]
fake_labels = ["Raf is the best dog", "gouf gouf"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a Transformers `Tokenizer` which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, model_max_length=512) # model_max_length=512 is needed for correct instantatiation cause of deprecation issues

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

By default, the call above will use one of the fast tokenizers (backed by Rust) from the Hugging Face Tokenizers library.

We can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [8774, 6, 48, 19, 3, 9, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model we selected, we will see different keys in the dictionary returned by the cell above. This does not make any difference in training.

More about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html).

Instead of one sentence, we can pass along a list of sentences:

In [None]:
tokenizer(["Hello, this is a sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this is a sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 19, 3, 9, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




Since we are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [None]:
if model_checkpoint in ["t5-small"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [None]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(raw_datasets["train"][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is sequence-to-sequence (both the input and output are text sequences), we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [None]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/242M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay.

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [None]:
batch_size = 8
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 3

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-xsum_3epoch_batch8"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [None]:
tokenized_datasets["train"]

Now we convert our input datasets to TF datasets using this collator. There's a built-in method for this: `to_tf_dataset()`. Make sure to specify the collator we just created as our `collate_fn`!

Computing the `ROUGE` metric can be slow because it requires the model to generate outputs token-by-token. To speed things up, we make a `generation_dataset` that contains only 200 examples from the validation dataset, and use this for `ROUGE` computations.

In [None]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    batch_size=8,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["validation"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=8,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally - we can train on this as our loss value simply by not specifying a loss when we `compile()`.

In [None]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Now we can train our model. We can also add a few optional callbacks here, which you can remove if they aren't useful to you. In no particular order, these are:
- PushToHubCallback will sync up our model with the Hub - this allows us to resume training from other machines, share the model after training is finished, and even test the model's inference quality midway through training!
- TensorBoard is a built-in Keras callback that logs TensorBoard metrics.
- KerasMetricCallback is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

If this is the first time you've seen `KerasMetricCallback`, it's worth explaining what exactly is going on here. The callback takes two main arguments - a `metric_fn` and an `eval_dataset`. It then iterates over the `eval_dataset` and collects the model's outputs for each sample, before passing the `list` of predictions and the associated `list` of labels to the user-defined `metric_fn`. If the `predict_with_generate` argument is `True`, then it will call `model.generate()` for each input sample instead of `model.predict()` - this is useful for metrics that expect generated text from the model, like `ROUGE`.

This callback allows complex metrics to be computed each epoch that would not function as a standard Keras Metric. Metric values are printed each epoch, and can be used by other callbacks like `TensorBoard` or `EarlyStopping`.

In [None]:
import numpy as np
import nltk
nltk.download('punkt')

def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Rouge expects a newline after each sentence
    decoded_predictions = [
        "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions
    ]
    decoded_labels = [
        "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
    ]
    result = metric.compute(
        predictions=decoded_predictions, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)

    return result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./summarization_model_save/logs")

push_to_hub_callback = PushToHubCallback(
    output_dir="./summarization_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]

model.fit(
    train_dataset, validation_data=validation_dataset, epochs=3, callbacks=callbacks
)

Cloning https://huggingface.co/Evangeliaa/t5-small-finetuned-xsum_3epoch_batch8 into local empty directory.


Epoch 1/3
    6/25506 [..............................] - ETA: 47:45 - loss: 3.8015







Epoch 2/3



Epoch 3/3



InvalidArgumentError: ignored

In [None]:
from transformers import TFAutoModelForSeq2SeqLM
# model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")
model = TFAutoModelForSeq2SeqLM.from_pretrained("Evangeliaa/t5-small-model")

In [None]:
preds = [ "A selection of your pictures of Scotland sent in between 30 June and 7 July. Send your photos to scotlandpictures@bbc.co.uk or via Instagram at #bbcscotlandpics" ]
labels = ["All pictures are copyrighted."]
metric.compute(predictions= preds, references=labels)