<a href="https://colab.research.google.com/github/hsupeter/pychPushTest/blob/master/en_zh_Tw_Summarization_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization
In this section we‚Äôll take a look at how Transformer models can be used to condense long documents into summaries, a task known as text summarization. This is one of the most challenging NLP tasks as it requires a range of abilities, such as understanding long passages and generating coherent text that captures the main topics in a document. However, when done well, text summarization is a powerful tool that can speed up various business processes by relieving the burden of domain experts to read long documents in detail.



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Install the Transformers and Datasets libraries to run this notebook.

In [2]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

Collecting datasets
  Downloading datasets-1.18.4-py3-none-any.whl (312 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 312 kB 7.8 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3.8 MB 56.3 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 134 kB 71.1 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.1 MB 51.3 

Although there already exist various fine-tuned models for summarization on the Hugging Face Hub, almost all of these are only suitable for English documents. So, to add a twist in this section, we‚Äôll train a bilingual model for English and Taiwanese (Zhong Hua). By the end of this section, you‚Äôll have a model that can summarize customer reviews like the one shown here:

You will need to setup git, adapt your email and name in the following cell.

In [3]:
!git config --global user.email "hsupeter98@gmail.com"
!git config --global user.name "peterhsu"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [5]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## Preparing a multilingual corpus
We‚Äôll use the[ Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi) to create our bilingual summarizer. This corpus consists of Amazon product reviews in six languages and is typically used to benchmark multilingual classifiers. However, since each review is accompanied by a short title, we can use the titles as the target summaries for our model to learn from! To get started, let‚Äôs download the English and Taiwanese (Zhong Hua) subsets from the Hugging Face Hub:

In [6]:
from datasets import load_dataset

zhongHua_dataset = load_dataset("amazon_reviews_multi", "zh")
english_dataset = load_dataset("amazon_reviews_multi", "en")
print(english_dataset)
print('_'* 60)
print(zhongHua_dataset)

Downloading:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

Downloading and preparing dataset amazon_reviews_multi/zh (download: 109.09 MiB, generated: 52.01 MiB, post-processed: Unknown size, total: 161.10 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/zh/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/109M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.70M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/zh/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading and preparing dataset amazon_reviews_multi/en (download: 82.11 MiB, generated: 58.69 MiB, post-processed: Unknown size, total: 140.79 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/82.0M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})
____________________________________________________________
DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],


As you can see, for each language there are 200,000 reviews for the train split, and 5,000 reviews for each of the validation and test splits. The review information <u>we are interested in is contained in the review_body and review_title columns</u>. Let‚Äôs take a look at a few examples by creating a simple function that takes a random sample from the training set with the techniques we learned in [Chapter 5](https://huggingface.co/course/chapter5/1):

In [7]:
def show_samples(dataset, num_samples=3, seed=41):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Index: {example['review_id']}'")
        print(f"'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")


print(show_samples(english_dataset))
print('_'* 60)
print(show_samples(zhongHua_dataset))


'>> Index: en_0020978'
'>> Title: Very poor quality.. breaks in no time'
'>> Review: Extremely poor product. Took theee attempts to install it. Broke in like a week afterwards. Get good quality product and not waste your money on this one.'

'>> Index: en_0915330'
'>> Title: Feels fake'
'>> Review: I never got this i have ordered it twice now i wrote person and all they got money for it but i never got the cords very un happy'

'>> Index: en_0422493'
'>> Title: Good'
'>> Review: Good product so far, very helpful'
None
____________________________________________________________

'>> Index: zh_0621232'
'>> Title: ÂÅáË¥ß'
'>> Review: Áªè‰∏éËá™Â∑±Â∏∏Áî®ÁöÑÂØπÊØîÔºåÊúâÂºÇÂë≥~~Á°ÆËÆ§‰∏∫ÂÅáË¥ß'

'>> Index: zh_0384855'
'>> Title: ÊúÄÁ≥üÁ≥ïÁöÑÂø´ÈÄíÂíåË¥≠Áâ©‰ΩìÈ™å„ÄÇ'
'>> Review: Âø´ÈÄíÂÆûÂú®ÈùûÂ∏∏Á≥üÁ≥ïÔºåË¥ßÁâ©Ë¢´‰∏¢Âà∞Èó®Âç´ÈÇ£ÈáåÔºå‰πãÂâçÊ≤°Êúâ‰ªª‰ΩïÁîµËØùÈÄöÁü•ÔºåÂîØ‰∏ÄÂèëËøá‰∏Ä‰∏™Áü≠‰ø°ÔºåÁ≥ªÁªüÊòæÁ§∫‰∏∫ËØàÈ™óÁü≠‰ø°„ÄÇÁõ¥Âà∞‰∏ÉÂ§©ÂêéÊàëËá™Â∑±ÊâìÁîµËØùÈóÆÊâçÁü•ÈÅì„ÄÇËÄå‰∏îÂú®ÊàëÊ≤°Êî∂Âà∞

This sample shows the diversity of reviews one typically finds online, ranging from positive to negative (and everything in between!). Although the example with the ‚Äúmeh‚Äù title is not very informative, the other titles look like decent summaries of the reviews themselves. Training a summarization model on all 400,000 reviews would take far too long on a single GPU, so instead we‚Äôll focus on generating summaries for a single domain of products. To get a feel for what domains we can choose from, let‚Äôs convert english_dataset to a pandas.DataFrame and compute the number of reviews per product category:

In [8]:
english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]
# Show counts for top 20 products
english_df["product_category"].value_counts()[:20]

home                      17679
apparel                   15951
wireless                  15717
other                     13418
beauty                    12091
drugstore                 11730
kitchen                   10382
toy                        8745
sports                     8277
automotive                 7506
lawn_and_garden            7327
home_improvement           7136
pet_products               7082
digital_ebook_purchase     6749
pc                         6401
electronics                6186
office_product             5521
shoes                      5197
grocery                    4730
book                       3756
Name: product_category, dtype: int64

In [9]:
english_df

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,en_0964290,product_en_0740675,reviewer_en_0342986,1,Arrived broken. Manufacturer defect. Two of th...,I'll spend twice the amount of time boxing up ...,en,furniture
1,en_0690095,product_en_0440378,reviewer_en_0133349,1,the cabinet dot were all detached from backing...,Not use able,en,home_improvement
2,en_0311558,product_en_0399702,reviewer_en_0152034,1,I received my first order of this product and ...,The product is junk.,en,home
3,en_0044972,product_en_0444063,reviewer_en_0656967,1,This product is a piece of shit. Do not buy. D...,Fucking waste of money,en,wireless
4,en_0784379,product_en_0139353,reviewer_en_0757638,1,went through 3 in one day doesn't fit correct ...,bubble,en,pc
...,...,...,...,...,...,...,...,...
199995,en_0046316,product_en_0980158,reviewer_en_0629807,5,"Cute slippers, my MIL loved them.",Nice and fit as advertised,en,shoes
199996,en_0956024,product_en_0954574,reviewer_en_0459072,5,My 6 year old likes this and keeps him engaged...,good to keep the kids engaged,en,toy
199997,en_0589358,product_en_0402982,reviewer_en_0199163,5,Replaced my battery with it. Works like new.,This works,en,wireless
199998,en_0970602,product_en_0873374,reviewer_en_0590563,5,"I like them, holding up well.",Well made.,en,industrial_supplies


In [10]:
zhongHua_dataset.set_format("pandas")
zhongHua_dataset_df = zhongHua_dataset["train"][:]
# Show counts for top 20 products
zhongHua_dataset_df["product_category"].value_counts()[:20]

book                      63058
digital_ebook_purchase    19006
apparel                   11804
shoes                      9877
beauty                     9401
kitchen                    9170
home                       8222
other                      7525
grocery                    7425
wireless                   6432
baby_product               6172
drugstore                  6072
sports                     6015
pc                         4821
toy                        3670
home_improvement           3239
watch                      3133
electronics                3059
luggage                    2984
office_product             2855
Name: product_category, dtype: int64

In [11]:
zhongHua_dataset_df

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,zh_0626061,product_zh_0691762,reviewer_zh_0824776,1,Êú¨‰∫∫Ë¥¶Âè∑Ë¢´ÁõóÔºåËµÑÈáëË¢´Ê±üË•øÔºàÊù®Âª∫ÔºâÊå™Áî®ÔºåËØ∑‰∫öÈ©¨ÈÄäÂ∞ΩÂø´Êü•ÂÆûÔºåÂ∞ÜÊú¨‰∫∫ÁöÑ200ÂÖÉËµÑÈáëÈÄÄÂõû„ÄÇÊú¨‰∫∫Â∑≤‰∫é2...,Ê≠§‰π¶‰∏çÊòØÊú¨‰∫∫Ë¥≠‰π∞,zh,book
1,zh_0713738,product_zh_0123483,reviewer_zh_0518940,1,ËøôÁÆÄÁõ¥Â∞±ÊòØÂ§™Â∑Æ‰∫ÜÔºÅÂá∫ÁâàÁ§æÊÄé‰πàÂ∞±ËÉΩÂá∫ÁâàÂêóÔºüÊàë‰ª•‰∏∫ÊòØÁôæÂ∫¶ÊëòÂΩïÂë¢ÔºÅËøôÂà∞Â∫ïÊòØÂì™‰∏™È±ºÁõÆÊ∑∑Áè†ÁöÑÊïôÊéàÂïäÔºüÔºÅ...,ÁÆÄÁõ¥ÊòØÂ∫üËØùÔºÅ,zh,book
2,zh_0621612,product_zh_0670618,reviewer_zh_0040023,1,Ë¥≠‰π∞È°µÈù¢ÊòæÁ§∫1ÔΩû2Êó•ÂèëË¥ßÔºå‰ªòÊ¨æ‰πãÂêéÊòæÁ§∫Âçä‰∏™ÊúàÂêéÈÄÅËææÔºåÂÆûÈôÖÊî∂Âà∞ÂïÜÂìÅË∑ùÁ¶ª‰∏ãÂçïÊó•ÊúüÂ∑≤Áªè‰∏Ä‰∏™Â§öÊúà„ÄÇ ...,ÊúÄÁâõÈÄºÁöÑÈ¢ÑÂîÆ,zh,home_improvement
3,zh_0757997,product_zh_0379151,reviewer_zh_0794363,1,Èü≥ÁÆ±Êí≠ÊîæÊó∂Êñ≠Êñ≠Áª≠Áª≠ÁöÑÔºÅË¥®ÈáèÂÆåÂÖ®‰∏çË°åÔºåÁ¨¨‰∏ÄÊ¨°Âú®‰∫öÈ©¨ÈÄä‰π∞‰∏úË•øÔºåÊôïÔºÅÊÄé‰πàÊòØËøôÊ†∑ÁöÑÂëÄÔºüÊúâÂÆ¢ÊúçÂíåÊàëËÅîÁ≥ªÂêóÔºü,Ëø∑‰Ω†Èü≥Âìç,zh,other
4,zh_0086548,product_zh_0280958,reviewer_zh_0726131,1,Â≠óÂ§™Â∞èÂï¶ÔºåÂª∫ËÆÆ‰π∞Âà´ÁöÑÁâàÊú¨ÔºåÊÖé‰π∞ÂëÄÔºå‰∫≤‰ª¨ÔºåÊàëÂêéÊÇî‰π∞‰∫ÜËøô‰∏™ÁâàÊú¨ÔºÅÔºÅÔºÅ,ÊéíÁâàÂ§™ÂØÜÔºå‰∏çÈÄÇÂêàËèúÈ∏üÁî®ÔºåÁúãÂà∞ÁúºÁùõËä±‰∫ÜÔºÅ,zh,book
...,...,...,...,...,...,...,...,...
199995,zh_0336212,product_zh_0290549,reviewer_zh_0811077,5,‰π∞ÁöÑÊó∂ÂÄôÂÅöÊ¥ªÂä®Âæà‰æøÂÆúÔºåÊïàÊûúÁúüÊòØ‰∏çÈîôÔºåÊçÆËØ¥ÊòØÁ∫ØÂ§©ÁÑ∂ÁöÑÔºåÈùûÂ∏∏ÊªãÊ∂¶„ÄÇ,‰π∞ÁöÑÊó∂ÂÄôÂÅöÊ¥ªÂä®Âæà‰æøÂÆúÔºåÊïàÊûúÁúüÊòØ‰∏çÈîôÔºåÊçÆËØ¥ÊòØÁ∫ØÂ§©ÁÑ∂ÁöÑÔºåÈùûÂ∏∏ÊªãÊ∂¶„ÄÇ,zh,baby_product
199996,zh_0053535,product_zh_0652692,reviewer_zh_0826787,5,‰ªéÁîüÊ¥ªÁöÑÂ∞èÁªÜËäÇÂÖ•ÊâãÔºåËôΩÁÑ∂Â∑≤ÁªèËøá‰∫ÜÊó∂‰ª£Ôºå‰ΩÜÊòØÂæàÂ§öÁªÜËäÇËøòÊòØÂÄºÂæóÊàë‰ª¨Â≠¶‰π†,ÈõïÁà∑ÁöÑÁªèÂÖ∏‰ΩúÂìÅÊé®ËçêÂïä,zh,book
199997,zh_0023067,product_zh_0379439,reviewer_zh_0607766,5,ÈÄüÂ∫¶Âø´ÔºåË¥®Èáè‰πüÂ•ΩÔºå‰π¶ÁöÑÂÜÖÂÆπÊñ∞È¢ñÔºåÈ¢òÁõÆËøòÊúâËß£ÊûêÔºåÊòØ‰∏ÄÊú¨ÂÄºÂæóÊé®ËçêÁöÑÂ§ç‰π†ÂèÇËÄÉ‰π¶„ÄÇ,‰π¶‰∏çÈîô,zh,book
199998,zh_0723826,product_zh_0065445,reviewer_zh_0689101,5,Á¨¨‰∏ÄÊ¨°Áî®Ëøô‰πàÂ•ΩÁöÑÂç°ÔºåLOLËÉΩÂºÄÂà∞300Â§öFPS,ÂÖ®Êñ∞Âç°,zh,pc


The most popular products in the English dataset are about household items, clothing, and wireless electronics. To stick with the Amazon theme, though, let‚Äôs focus on summarizing book reviews ‚Äî after all, this is what the company was founded on! We can see two product categories that fit the bill (book and digital_ebook_purchase), so let‚Äôs filter the datasets in both languages for just these products. As we saw in [Chapter 5](https://huggingface.co/course/chapter5/1), the `Dataset.filter()` function allows us to slice a dataset very efficiently, so we can define a simple function to do this:

In [12]:
def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

Now when we apply this function to english_dataset and zhongHua_dataset, the result will contain just those rows involving the book categories. Before applying the filter, let‚Äôs <u>switch the format of english_dataset from "pandas" back to "arrow"</u>:

In [13]:
english_dataset.reset_format()

We can then apply the filter function, and as a sanity check let‚Äôs inspect a sample of reviews to see if they are indeed about books:

In [14]:
zhongHua_books = zhongHua_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
print(show_samples(english_books))
print(show_samples(zhongHua_books))

  0%|          | 0/200 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/200 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]


'>> Index: en_0821617'
'>> Title: Where are the page numbers??? And maybe a map?'
'>> Review: This book provides a light amount of detail on the most important places to visit on a trip to Japan. But it is lacking in 3 important items that I could not believe were missing from a travel guide. There is not a single map of any area discussed or even a map of Japan itself. Second, there are no page numbers?! Third, which goes hand in hand with the lack of page numbers, there is no index of what is in the guide. The Table of Contents list chapters but without page numbers you have to constantly fan through the pages to find a chapter. OK, so I added color coded tabs to everything which made it more acceptable, but why should a reader have to do this? Can't recommend it.'

'>> Index: en_0634721'
'>> Title: Good "behind the scenes" look at Air America and Nixon's secret war'
'>> Review: This was a good book, unpretentious, written in everyday style. The author kept a serious subject light a

In [15]:
print(english_books)
print(zhongHua_books)

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 10505
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 231
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 278
    })
})
DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 82064
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 2003
    })
    test: Dataset({
        feature

Okay, we can see that the reviews are not strictly about books and might refer to things like calendars and electronic applications such as OneNote. Nevertheless, the domain seems about right to train a summarization model on. Before we look at various models that are suitable for this task, we have one last bit of data preparation to do: combining the English and Taiwanese (Zhong Hua) reviews as a single DatasetDict object. ü§ó Datasets provides a handy [concatenate_datasets()](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.concatenate_datasets) function that (as the name suggests) will stack two Dataset objects on top of each other. So, to create our bilingual dataset, we‚Äôll loop over each split, concatenate the datasets for that split, and shuffle the result to ensure our model doesn‚Äôt overfit to a single language:

In [16]:
from datasets import concatenate_datasets, DatasetDict
 
books_dataset = DatasetDict()

for split in english_books.keys():
  books_dataset[split] = concatenate_datasets(
      [english_books[split], zhongHua_books[split]]
  )
  books_dataset[split] = books_dataset[split].shuffle(seed=41) # seed=1000
# Peek at a few examples
print(show_samples(books_dataset))
print(books_dataset)


'>> Index: zh_0417784'
'>> Title: ËßÇÁÇπ„ÄÅÁåúÊÉ≥„ÄÅËÆÆËÆ∫Â§ö‰∫é‰∫ãÂÆû'
'>> Review: ‰ΩúËÄÖÁöÑ‰∏âÊú¨‰π¶Ë¥®ÈáèË∂äÊù•Ë∂ä‰ΩéÔºåËøô‰∏ÄÊú¨ÊúâÊØîËæÉÂ§ßÁöÑÊãºÂáëÊÑüËßâÔºåÂ∑≤ÁªèÊ≤°ÊúâÂ§™Â§öÊñ∞È≤úÁöÑ‰∏úË•øÔºåËøòÈùûË¶ÅÂáëÊàê‰∏ÄÊú¨‰π¶ÔºåÂÜô‰∫ÜÂ§™Â§öÁöÑÂπ≤Áò™ÁöÑËÆÆËÆ∫ÔºåÂ∞±ÂÉèÈ´ò‰∏≠ÁöÑÊó∂ÂÄôÂÜôËÆÆËÆ∫Êñá„ÄÇ'

'>> Index: zh_0022348'
'>> Title: Âæà‰∏ÄËà¨'
'>> Review: Â≠©Â≠êÂ≠¶Ê†°ÊúÄËøëÊµÅË°åËøô‰∏™Ôºå‰ΩÜÊòØ‰∏™‰∫∫Âπ∂‰∏çËßâÂæó‰ΩúÂìÅÊúâÂ§öÂÄºÂæóËØª„ÄÇÁ∫∏Âº†‰πüÂ∞±Ëøô‰∏™Ê†∑Â≠ê‰∫ÜÔºåÊâÄÂÆûËØùÁõóÁâàËøòÊòØÊ≠£ÁâàÁúüÂøÉÁúã‰∏çÂá∫Êù•„ÄÇÊâÄË∞ìÁöÑËß£ÂØÜÂç°‰πü‰∏çÊòØÈÇ£‰πàÊ∏ÖÊô∞„ÄÇ'

'>> Index: zh_0580291'
'>> Title: ‰∏ÄËà¨'
'>> Review: ‰π¶Â∫îËØ•ËøòË°åÔºå‰ΩÜÂçñÂÆ∂Â±ÖÁÑ∂ÁªôÊàëÂèë‰∫ÜÁ≤òÊ≥•Â∑¥ÁöÑ‰π¶ÔºåÂ∏åÊúõ‰∏ãÊ¨°ËÉΩËÆ§ÁúüÁÇπ'
None
DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 92569
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars

This certainly looks like a mix of English and Taiwanese (Zhong Hua) reviews! Now that we have a training corpus, <u>one final thing to check is the distribution of words in the reviews and their titles</u>. This is especially important for summarization tasks, where short reference summaries in the data can bias the model to only output one or two words in the generated summaries. The plots below show the word distributions, and we can see that the titles are heavily skewed toward just 1-2 words:

<img src = ".\review-lengths.png">

To deal with this, we‚Äôll filter out the examples with very short titles so that our model can produce more interesting summaries. Since we‚Äôre dealing with English and Taiwanese (Zhong Hua) texts, we can use a rough heuristic to split the titles on whitespace and then use our trusty `Dataset.filter()` method as follows:

In [17]:
books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split())>2)
books_dataset

  0%|          | 0/93 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 6701
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 148
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 180
    })
})

Now that we‚Äôve prepared our corpus, let‚Äôs take a look at a few possible Transformer models that one might fine-tune on it!

#@Models for text summarization

If you think about it, text summarization is a similar sort of task to machine translation: we have a body of text like a review that we‚Äôd like to ‚Äútranslate‚Äù into a shorter version that captures the salient features of the input. Accordingly, most Transformer models for summarization adopt the encoder-decoder architecture that we first encountered in [Chapter 1](https://huggingface.co/course/chapter1/1), although there are some exceptions like the GPT family of models which can also be used for summarization in few-shot settings. [The following table](https://huggingface.co/course/chapter7/5?fw=pt) lists some popular pretrained models that can be fine-tuned for summarization.

Monolingual: [GPT-2](https://huggingface.co/gpt2-xl), [PEGASUS](hhttps://huggingface.co/google/pegasus-large), [T5](https://huggingface.co/t5-base), [BART](https://huggingface.co/facebook/bart-base)    
Multilingual: [mT5](https://huggingface.co/google/mt5-base), [mBART-50](https://huggingface.co/facebook/mbart-large-50)  

### Transformer
	
|model | Description | Multilingual? |
|-----------------|-----------------|-----------------|
|[GPT-2](https://huggingface.co/gpt2-xl)      |  Although trained as an auto-regressive language model, you can make GPT-2 generate summaries by appending ‚ÄúTL;DR‚Äù at the end of the input text.    |  ‚ùå |
| [PEGASUS](hhttps://huggingface.co/google/pegasus-large) |Uses a pretraining objective to predict masked sentences in multi-sentence texts. This pretraining objective is closer to summarization than vanilla language modeling and scores highly on popular benchmarks.|  ‚ùå |
| [T5](https://huggingface.co/t5-base)|A universal Transformer architecture that formulates all tasks in a text-to-text framework; e.g., the input format for the model to summarize a document is summarize: ARTICLE.|  ‚ùå |
| [mT5](https://huggingface.co/google/mt5-base) |A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages.|  ‚úÖ |
| [BART](https://huggingface.co/facebook/bart-base) |A novel Transformer architecture with both an encoder and a decoder stack trained to reconstruct corrupted input that combines the pretraining schemes of BERT and GPT-2.|  ‚ùå |
| [mBART-50](https://huggingface.co/facebook/mbart-large-50) |A multilingual version of BART, pretrained on 50 languages.|  ‚úÖ |  

As you can see from this table, the majority of Transformer models for summarization (and indeed most NLP tasks) are monolingual. This is great if your task is in a ‚Äúhigh-resource‚Äù language like English or German, but less so for the thousands of other languages in use across the world. Fortunately, there is a class of multilingual Transformer models, like mT5 and mBART, that come to the rescue. These models are pretrained using language modeling, but with a twist: instead of training on a corpus of one language, they are trained jointly on texts in over 50 languages at once!

We‚Äôll focus on mT5, an interesting architecture based on T5 that was pretrained in a text-to-text framework. In T5, every NLP task is formulated in terms of a prompt prefix like summarize: which conditions the model to adapt the generated text to the prompt. As shown in the figure below, this makes T5 extremely versatile, as you can solve many tasks with a single model!

Different tasks performed by the T5 architecture.

mT5 doesn‚Äôt use prefixes, but shares much of the versatility of T5 and has the advantage of being multilingual. Now that we‚Äôve picked a model, let‚Äôs take a look at preparing our data for training.

‚úèÔ∏è Try it out! Once you‚Äôve worked through this section, see how well mT5 compares to mBART by fine-tuning the latter with the same techniques. For bonus points, you can also try fine-tuning T5 on just the English reviews. Since T5 has a special prefix prompt, you‚Äôll need to prepend summarize: to the input examples in the preprocessing steps below.



#@ Preprocessing the data

Our next task is to tokenize and encode our reviews and their titles. As usual, we begin by loading the tokenizer associated with the pretrained model checkpoint. We‚Äôll use mt5-small as our checkpoint so we can fine-tune the model in a reasonable amount of time:

In [18]:
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

üí° In the early stages of your NLP projects, a good practice is to train a class of ‚Äúsmall‚Äù models on a small sample of data. This allows you to debug and iterate faster toward an end-to-end workflow. Once you are confident in the results, you can always scale up the model by simply changing the model checkpoint!

Let‚Äôs test out the mT5 tokenizer on a small example:

In [19]:
inputs = tokenizer("I loved reading the Hunger Game!")
inputs

{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 7233, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Here we can see the familiar input_ids and attention_mask that we encountered in our first fine-tuning experiments back in [Chapter 3](https://huggingface.co/course/chapter3/1). Let‚Äôs decode these input IDs with the tokenizer‚Äôs `convert_ids_to_tokens()` function to see what kind of tokenizer we‚Äôre dealing with:

The special Unicode character ‚ñÅ and end-of-sequence token </s> indicate that we‚Äôre dealing with the SentencePiece tokenizer, which is based on the Unigram segmentation algorithm discussed in Chapter 6. Unigram is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters

In [20]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['‚ñÅI', '‚ñÅ', 'loved', '‚ñÅreading', '‚ñÅthe', '‚ñÅHung', 'er', '‚ñÅGame', '!', '</s>']

The special Unicode character and end-of-sequence token </s> indicate that we‚Äôre dealing with the SentencePiece tokenizer, which is based on the Unigram segmentation algorithm discussed in [Chapter 6](https://huggingface.co/course/chapter6/1). <u>Unigram is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages</u>, like Japanese, do not have whitespace characters.

To tokenize our corpus, we have to deal with a subtlety associated with summarization: because <u>our labels are also text</u>, <u>it is possible that they exceed the model‚Äôs maximum context size</u>. <u>This means we need to apply truncation to both the reviews and their titles to ensure we don‚Äôt pass excessively long inputs to our model</u>. The tokenizers in ü§ó Transformers provide a nifty  [as_target_tokenizer()](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.as_target_tokenizer) function that <u>allows you to tokenize the labels in parallel to the inputs</u>. This is typically done using a <u>context manager inside a preprocessing function</u> that first encodes the inputs, and then encodes the labels as a separate column. Here is an example of such a function for mT5:

### üí° <a>Ë¶ÅËÉΩÂ§†Ê≠£Á¢∫Â≠∏ÁøíÔºå ÂæÆË™øÊ®°ÂûãÁöÑ input_ids ÊòØ review_body ÂàÜË©ûÂæåÁöÑ input_idsÔºåËÄå labels ÊòØ review_title ÂàÜË©ûÁöÑ input_ids</a>


### üí° <a>Ê≥®ÊÑèÔºö</a>
####<a> Áî±ÊñºÊ≠§ËôïË¶ÅÂ∞çÂÖ©Á®ÆË™ûË®ÄÈÄ≤Ë°åÂàÜË©û(token)ÔºåË¶Å‰ΩøÁî®‰∏ä‰∏ãÊñáÁÆ°ÁêÜÂô® as_target_tokenizer() ÔºåÂê¶ÂâáÊúÉÂá∫ÈåØ„ÄÇ </a>


```
with tokenizer.as_target_tokenizer():
    labels = tokenizer( ... )
```

####<a>ËÆìÂè™Âú® with Á∏ÆÊéíÁØÑÂúçÂÖßÂÉÖÂ∞ç review_title ÈÄ≤Ë°åÂàÜË©ûÔºåÁ∏ÆÊéíÁØÑÂúçÂ§ñÂ∑≤Ë®≠ÂÆöÂ∞ç review_body ÈÄ≤Ë°åÂàÜË©û.
</a>

In [21]:
max_input_length = 512
max_target_length = 30

def preprocess_function(examples):
  # ÂàÜË©ûÂæåÁî¢Áîü input_ids, attention_mask Âèä labelsÔºå‰ΩÜ
  # (1) model_inputs ÊòØÂèñ review_body ÂàÜË©ûÂæåÁöÑ input_ids Âèä attention_mask
  model_inputs = tokenizer(
      examples["review_body"], max_length=max_input_length, truncation=True
  )
  # Set up the tokenizer for targets
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(
        examples["review_title"], max_length=max_target_length, truncation=True
  )
  # (2) model_inputs ÁöÑ labels ÊòØ review_title ÂàÜË©ûÂæåÁöÑ input_idsÔºå
  # ÈÄôÊ®£ÊâçÊòØÊ≠£Á¢∫Â≠∏Áøí
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs


Let‚Äôs walk through this code to understand what‚Äôs happening. The first thing we‚Äôve done is define values for `max_input_length` and `max_target_length`, which set the upper limits for how long our reviews and titles can be. Since the review body is typically much larger than the title, we‚Äôve scaled these values accordingly. Then, in the `preprocess_function()` itself we can see the reviews are first tokenized, followed by the titles with `as_target_tokenizer()`.

With `preprocess_function()`, it is then a simple matter to tokenize the whole corpus using the handy `Dataset.map()` function we‚Äôve used extensively throughout this course:

In [22]:
tokenized_datasets = books_dataset.map(preprocess_function, batched=True)
tokenized_datasets

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 6701
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 148
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 180
    })
})

Now that the corpus hss been preprocessed, let's take a look at some metrics that are commonly used for summarization. As we'll see, there is no silver bullet when it comes to measuring the quality of machine-generated text.

üí° You may have noticed that we used batched=True in our Dataset.map() function above. This encodes the examples in batches of 1,000 (the default) and allows you to make use of the multithreading capabilities of the fast tokenizers in ü§ó Transformers. Where possible, try using batched=True to get the most out of your preprocessing!

##@ Metrics for text summarization 

In comparison to most of the other tasks we‚Äôve covered in this course, measuring the performance of text generation tasks like summarization or translation is not as straightforward. For example, given a review like ‚ÄúI loved reading the Hunger Games‚Äù, there are multiple valid summaries, like ‚ÄúI loved the Hunger Games‚Äù or ‚ÄúHunger Games is a great read‚Äù. <u>Clearly</u>, <u>applying some sort of exact match between the generated summary and the label is not a good solution</u> ‚Äî even humans would fare poorly under such a metric, because we all have our own writing style.

For summarization, one of the most commonly used metrics is the [ROUGE score](https://en.wikipedia.org/wiki/ROUGE_(metric)) (short for Recall-Oriented Understudy for Gisting Evaluation). The basic idea behind this metric is to compare a generated summary against a set of reference summaries that are typically created by humans. To make this more precise, suppose we want to compare the following two summaries:

In [23]:
generated_summary = "I absolutely loved reading the Huger Games"
reference_summary = "I loved reading the Huger Games"

One way to compare them could be to count the number of overlapping words, which in this case would be 6. However, this is a bit crude, so instead ROUGE is based on computing the *precision* and *recall* scores for the overlap.


> üôã Don‚Äôt worry if this is the first time you‚Äôve heard of precision and recall ‚Äî we‚Äôll go through some explicit examples together to make it all clear. These metrics are usually encountered in classification tasks, so if you want to understand how precision and recall are defined in that context, we recommend checking out the scikit-learn [guides](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html).



For ROUGE, recall measures how much of the reference summary is captured by the generated one. If we are just comparing words, recall can be calculated according to the following formula:  


> $Recall= \frac{Number\,of\,overlapping\, words}{Total\, number\, of\, words\, in\, reference\, summary}$




For our simple example above, this formula gives a perfect recall of 6/6 = 1; i.e., all the words in the reference summary have been produced by the model. This may sound great, but imagine if our generated summary had been ‚ÄúI really really loved reading the Hunger Games all night‚Äù. This would also have perfect recall, but is arguably a worse summary since it is verbose. To deal with these scenarios we also compute the precision, which in the ROUGE context measures how much of the generated summary was relevant:\:


> $Precision= \frac{Number\,of\,overlapping\, words}{Total\, number\, of\, words\, in\, generated\, summary}$



Applying this to our verbose summary gives a precision of 6/10 = 0.6, which is considerably worse than the precision of 6/7 = 0.86 obtained by our shorter one. In practice, both precision and recall are usually computed, and then the F1-score (the harmonic mean of precision and recall) is reported. We can do this easily in ü§ó Datasets by first installing the rouge_score package:

In [24]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [25]:
from datasets import load_metric

rouge_score = load_metric("rouge")

Downloading:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

Then we use the rouge_score.compute() function to calculate all the metrics at once:

In [26]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rouge2': AggregateScore(low=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), mid=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), high=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272)),
 'rougeL': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rougeLsum': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.92307692307

Whoa, there‚Äôs a lot of information in that output ‚Äî what does it all mean? First, ü§ó Datasets actually computes confidence intervals for precision, recall, and F1-score; these are the low, mid, and high attributes you can see here. Moreover, ü§ó Datasets computes a variety of ROUGE scores which are based on different types of text granularity when comparing the generated and reference summaries. The rouge1 variant is the overlap of unigrams ‚Äî this is just a fancy way of saying the overlap of words and is exactly the metric we‚Äôve discussed above. To verify this, let‚Äôs pull out the mid value of our scores:

In [27]:
scores["rouge1"].mid

Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)

Great, the precision and recall numbers match up! Now what about those other ROUGE scores? rouge2 measures the overlap between bigrams (think the overlap of pairs of words), while rougeL and rougeLsum measure the longest matching sequences of words by looking for the longest common substrings in the generated and reference summaries. The ‚Äúsum‚Äù in rougeLsum refers to the fact that this metric is computed over a whole summary, while rougeL is computed as the average over individual sentences.

>‚úèÔ∏è Try it out! Create your own example of a generated and reference summary and see if the resulting ROUGE scores agree with a manual calculation based on the formulas for precision and recall. For bonus points, split the text into bigrams and compare the precision and recall for the rouge2 metric.



We‚Äôll use these ROUGE scores to track the performance of our model, but before doing that let‚Äôs do something every good NLP practitioner should do: create a strong, yet simple baseline!


###@ Creating a strong baseline
<u>A common baseline for text summarization is to simply take the first three sentences of an article, often called the lead-3 baseline</u>. We could use full stops to track the sentence boundaries, but this will fail on acronyms like ‚ÄúU.S.‚Äù or ‚ÄúU.N.‚Äù ‚Äî so instead we‚Äôll use the nltk library, which includes a better algorithm to handle these cases. You can install the package using pip as follows:

üí° ÊñáÁ´†ÁöÑÂü∫Ë™øÈÄöÂ∏∏Âú®ÂâçÈù¢‰∏âÂè•Â∞±ÂëàÁèæÂÆÉÁöÑÊÑèÊó®ÔºåÂèñ lead-3 baseline ÂèØÂä†Âø´Â≠∏ÁøíÔºåÊâÄ‰ª•ÂÉÖÂèñÂàÜÂâ≤ review_body ÁöÑÂâç‰∏âÂè•

In [28]:
!pip install nltk



In [29]:
import nltk

# download the punctuation rules
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Next, we import the sentence tokenizer from nltk and create a simple function to extract the first three sentences in a review, The convention in text summarizatin is to separate is to seperate each summary with a newline, so
let's also include this and test it on a training example:

In [30]:
from nltk.tokenize import sent_tokenize

def three_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:3])

print(three_sentence_summary(books_dataset["train"][2]["review_body"]))

I enjoyed reading this book tremendously.
I always liked reading about time travel, but usually you have to set logic aside.
Not in this book.


This seems to work, so let‚Äôs now implement a function that extracts these ‚Äúsummaries‚Äù from a dataset and computes the ROUGE scores for the baseline:

<a>Âèñ review_body Ââç‰∏âÂè•Ëàá review_title Ë®àÁÆó metric</a>

In [31]:
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["review_body"]]
    return metric.compute(predictions=summaries, references=dataset["review_title"])

We can then use this function to compute the ROUGE scores over the validation set and prettify them a bit using Pandas:

In [32]:
import pandas as pd

score = evaluate_baseline(books_dataset["validation"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names)
rouge_dict

{'rouge1': 15.32, 'rouge2': 7.98, 'rougeL': 14.28, 'rougeLsum': 14.68}

We can see that <u>the rouge2 score is significantly lower</u> than the rest; this likely reflects the fact that review titles are typically concise and so the lead-3 baseline is too verbose. Now that we have a good baseline to work from, let‚Äôs turn our attention toward fine-tuning mT5!

#@ Fine-tuning mT5 with the Trainer API
Fine-tuning a model for summarization is very similar to the other tasks we‚Äôve covered in this chapter. The first thing we need to do is load the pretrained model from the mt5-small checkpoint. Since summarization is a sequence-to-sequence task, we can load the model with the AutoModelForSeq2SeqLM class, which will automatically download and cache the weights:

In [33]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]


> üí° If you‚Äôre wondering why you don‚Äôt see any warnings about fine-tuning the model on a downstream task, that‚Äôs because for sequence-to-sequence tasks we keep all the weights of the network. Compare this to our text classification model in [Chapter 3](https://huggingface.co/course/chapter3/1), where the head of the pretrained model was replaced with a randomly initialized network.


In [34]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


We‚Äôll need to generate summaries in order to compute ROUGE scores during training. Fortunately, ü§ó Transformers provides dedicated [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments) and [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainer) classes that can do this for us automatically! To see how this works, let‚Äôs first define the hyperparameters and other arguments for our experiments:

In [35]:
from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 7
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-zh_TW",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)

Here, <a>the `predict_with_generate` argument has been set to indicate that we should generate summaries during evaluation so that we can compute ROUGE scores for each epoch</a>. As discussed in [Chapter 1](https://huggingface.co/course/chapter1/1), the decoder performs inference by predicting tokens one by one, and this is implemented by the model‚Äôs generate() method. <u>Setting `predict_with_generate=True` tells the Seq2SeqTrainer to use that method for evaluation</u>. We‚Äôve also adjusted some of the default hyperparameters, like the learning rate, number of epochs, and weight decay, and we‚Äôve <u>set the save_total_limit option to only save up to 3 checkpoints during training </u> ‚Äî this is because even the ‚Äúsmall‚Äù version of mT5 uses around a GB of hard drive space, and we can save a bit of room by limiting the number of copies we save.

The push_to_hub=True argument will allow us to push the model to the Hub after training; you‚Äôll find the repository under your user profile in the location defined by output_dir. Note that you can specify the name of the repository you want to push to with the hub_model_id argument (in particular, you will have to use this argument to push to an organization). For instance, when we pushed the model to the huggingface-course organization, we added `hub_model_id="huggingface-course/mt5-finetuned-amazon-en-zh_TW"` to Seq2SeqTrainingArguments.

The next thing we need to do is provide the trainer with a `compute_metrics()` function so that we can evaluate our model during training. <u>For summarization this is a bit more involved than simply calling `rouge_score.compute()` on the model‚Äôs predictions</u>, since we need to <u>decode the outputs and labels into text</u> before we can compute the ROUGE scores. The following function does exactly that, and also makes use of the `sent_tokenize()` function from nltk to separate the summary sentences with newlines:

In [36]:
import numpy as np

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  
  # Replace -100 in the labels as we can't decode them
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  # Decode reference summaries into text
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
  # ROUGE expects a newline after each sentence
  decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]

  # Decode generated summaries into text
  decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)  
  decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]

  # Compute ROUGE scores
  result = rouge_score.compute(
      predictions=decoded_preds, references=decoded_labels, use_stemmer=True
  )
  # Extract the median scores
  result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
  return {k: round(v, 4) for k, v in result.items()}  


Next, we need to define a data collator for our sequence-to-sequence task. Since mT5 is an encoder-decoder Transformer model, <u>one subtlety with preparing our batches is that during decoding we need to shift the labels to the right by one</u>. <u>This is required to ensure that the decoder only sees the previous ground truth labels and not the current or future ones</u>, which would be easy for the model to memorize. This is <u>similar to how masked self-attention</u> is applied to the inputs in a task like [causal language modeling](https://huggingface.co/course/chapter7/6).

Luckily, ü§ó Transformers provides a DataCollatorForSeq2Seq collator that will dynamically pad the inputs and the labels for us. To instantiate this collator, we simply need to provide the tokenizer and model:

In [37]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


Let‚Äôs see what this collator produces when fed a small batch of examples. First, we need to remove the columns with strings because the collator won‚Äôt know how to pad these elements:

In [38]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 6701
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 148
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 180
    })
})

In [39]:
tokenized_datasets = tokenized_datasets.remove_columns(
    books_dataset["train"].column_names
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 6701
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 148
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 180
    })
})

Since the collator expects a list of dicts, where each dict represents a single example in the dataset, we also need to wrangle the data into the expected format before passing it to the data collator:



In [40]:
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'input_ids': tensor([[   336,  11243,    285,    461, 151964,    304,    714,   3435,    288,
           4996,    259,    262,   4302,    260,   1494,    339,    259,    262,
           2857,    259,  22243,    305,    259,  34200,   3435,    260,   3840,
            293,    263,  45511,  12878,    263,   8125,    305,    281,    281,
           1638,    898,   2108,    339,    259,    262,    259,  57344,  81156,
            265,    332,   4065,    288,  11243,   1638,    259,  42627,   3359,
            260,   1669,   1689,    263,    326,  43762,    288,  42522,   1866,
            339,    259,   1082,  12937,    299,    263,    790,  15070,    263,
            288,    390,    259,   1059,    288,  11243,   1638,    259,  42627,
           3359,    260,    259,   6397,    790,  15070,    263,    259,  47153,
           1001,  46378,   1001,  46378,   1866,    339,    281,   9848,    790,
            259,  16878,    332,   1001,    260,   1669,    259,  16878,    259,
           236

<a> üí° features Á∂ìÈÅé Collactor Èô§‰∫ÜÂ¢ûÂä†Áõ∏ÊáâÁöÑ [PAD] token Â§ñÔºåÈÇÑÂ¢ûÂä† 'decoder_input_ids' featureÔºåÈÄôÊòØÊää 'labels' Âè≥ÁßªÁï∂‰Ωú transformer decoder ÁöÑËº∏ÂÖ•</a>

The main thing to notice here is that the first example is longer than the second one, so the input_ids and attention_mask of the second example have been padded on the right with a [PAD] token (whose ID is 0). Similarly, we can see that the labels have been padded with -100s, to make sure the padding tokens are ignored by the loss function. And finally, <u>we can see a new decoder_input_ids which has shifted the labels to the right by inserting a [PAD] token in the first entry</u>.

We finally have all the ingredients we need to train with! We now simply need to instantiate the trainer with the standard arguments:

In [40]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model, 
    args, 
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator, 
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/peterhsu/mt5-small-finetuned-amazon-en-zh_TW into local empty directory.


Download file pytorch_model.bin:   0%|          | 3.47k/1.12G [00:00<?, ?B/s]

Download file runs/Mar08_15-07-12_d6d99f1f35ea/events.out.tfevents.1646752556.d6d99f1f35ea.77.0: 100%|########‚Ä¶

Download file runs/Mar07_16-23-16_acabab30f1f6/1646670265.206142/events.out.tfevents.1646670265.acabab30f1f6.7‚Ä¶

Download file runs/Mar08_14-03-42_0f017a6572c4/1646749895.9200602/events.out.tfevents.1646749895.0f017a6572c4.‚Ä¶

Download file runs/Mar08_15-07-12_d6d99f1f35ea/1646752556.096627/events.out.tfevents.1646752556.d6d99f1f35ea.7‚Ä¶

Clean file runs/Mar08_15-07-12_d6d99f1f35ea/events.out.tfevents.1646752556.d6d99f1f35ea.77.0:  12%|#2        |‚Ä¶

Download file runs/Mar08_14-56-00_0f017a6572c4/1646751450.8387818/events.out.tfevents.1646751450.0f017a6572c4.‚Ä¶

Clean file runs/Mar07_16-23-16_acabab30f1f6/1646670265.206142/events.out.tfevents.1646670265.acabab30f1f6.78.1‚Ä¶

Clean file runs/Mar08_14-03-42_0f017a6572c4/1646749895.9200602/events.out.tfevents.1646749895.0f017a6572c4.81.‚Ä¶

Clean file runs/Mar08_15-07-12_d6d99f1f35ea/1646752556.096627/events.out.tfevents.1646752556.d6d99f1f35ea.77.1‚Ä¶

Download file tokenizer.json:   0%|          | 1.58k/15.6M [00:00<?, ?B/s]

Clean file runs/Mar08_14-56-00_0f017a6572c4/1646751450.8387818/events.out.tfevents.1646751450.0f017a6572c4.184‚Ä¶

Download file spiece.model:   0%|          | 1.58k/4.11M [00:00<?, ?B/s]

Download file runs/Mar08_15-07-12_d6d99f1f35ea/events.out.tfevents.1646754515.d6d99f1f35ea.77.2: 100%|########‚Ä¶

Clean file runs/Mar08_15-07-12_d6d99f1f35ea/events.out.tfevents.1646754515.d6d99f1f35ea.77.2: 100%|##########|‚Ä¶

Download file runs/Mar08_14-03-42_0f017a6572c4/events.out.tfevents.1646749895.0f017a6572c4.81.0:  87%|########‚Ä¶

Clean file runs/Mar08_14-03-42_0f017a6572c4/events.out.tfevents.1646749895.0f017a6572c4.81.0:  25%|##4       |‚Ä¶

Download file training_args.bin: 100%|##########| 3.11k/3.11k [00:00<?, ?B/s]

Download file runs/Mar07_16-23-16_acabab30f1f6/events.out.tfevents.1646670265.acabab30f1f6.78.0: 100%|########‚Ä¶

Clean file training_args.bin:  32%|###2      | 1.00k/3.11k [00:00<?, ?B/s]

Clean file runs/Mar07_16-23-16_acabab30f1f6/events.out.tfevents.1646670265.acabab30f1f6.78.0: 100%|##########|‚Ä¶

Download file runs/Mar08_14-56-00_0f017a6572c4/events.out.tfevents.1646751450.0f017a6572c4.1847.0: 100%|######‚Ä¶

Clean file runs/Mar08_14-56-00_0f017a6572c4/events.out.tfevents.1646751450.0f017a6572c4.1847.0: 100%|#########‚Ä¶

Clean file spiece.model:   0%|          | 1.00k/4.11M [00:00<?, ?B/s]

Clean file tokenizer.json:   0%|          | 1.00k/15.6M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.12G [00:00<?, ?B/s]

In [41]:
trainer.train()

***** Running training *****
  Num examples = 6701
  Num Epochs = 7
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5866


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,7.5388,3.588842,12.6081,5.3611,12.3495,12.2926
2,4.0043,3.403824,13.8517,6.3417,13.4755,13.4913
3,3.6776,3.329428,15.1519,7.3842,14.8844,14.8458
4,3.4929,3.266764,15.6067,7.4016,15.3715,15.2908
5,3.387,3.285527,15.0546,7.3065,14.8271,14.7755
6,3.302,3.245691,15.0213,6.6597,14.6131,14.5641
7,3.2806,3.240798,15.8831,7.1676,15.5523,15.4954


Saving model checkpoint to mt5-small-finetuned-amazon-en-zh_TW/checkpoint-500
Configuration saved in mt5-small-finetuned-amazon-en-zh_TW/checkpoint-500/config.json
Model weights saved in mt5-small-finetuned-amazon-en-zh_TW/checkpoint-500/pytorch_model.bin
tokenizer config file saved in mt5-small-finetuned-amazon-en-zh_TW/checkpoint-500/tokenizer_config.json
Special tokens file saved in mt5-small-finetuned-amazon-en-zh_TW/checkpoint-500/special_tokens_map.json
Copy vocab file to mt5-small-finetuned-amazon-en-zh_TW/checkpoint-500/spiece.model
tokenizer config file saved in mt5-small-finetuned-amazon-en-zh_TW/tokenizer_config.json
Special tokens file saved in mt5-small-finetuned-amazon-en-zh_TW/special_tokens_map.json
Copy vocab file to mt5-small-finetuned-amazon-en-zh_TW/spiece.model
***** Running Evaluation *****
  Num examples = 148
  Batch size = 8
Saving model checkpoint to mt5-small-finetuned-amazon-en-zh_TW/checkpoint-1000
Configuration saved in mt5-small-finetuned-amazon-en-zh_TW/

TrainOutput(global_step=5866, training_loss=4.096298559184291, metrics={'train_runtime': 1943.3531, 'train_samples_per_second': 24.137, 'train_steps_per_second': 3.018, 'total_flos': 7986316915476480.0, 'train_loss': 4.096298559184291, 'epoch': 7.0})

During training, you should see the training loss decrease and the ROUGE scores increase with each epoch. Once the training is complete, you can see the final ROUGE scores by running Trainer.evaluate():

In [42]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 148
  Batch size = 8


{'epoch': 7.0,
 'eval_loss': 3.240798234939575,
 'eval_rouge1': 15.8831,
 'eval_rouge2': 7.1676,
 'eval_rougeL': 15.5523,
 'eval_rougeLsum': 15.4954,
 'eval_runtime': 7.1324,
 'eval_samples_per_second': 20.75,
 'eval_steps_per_second': 2.664}

In [43]:
trainer.push_to_hub(commit_message="Training complete", tags="summarization")

Saving model checkpoint to mt5-small-finetuned-amazon-en-zh_TW
Configuration saved in mt5-small-finetuned-amazon-en-zh_TW/config.json
Model weights saved in mt5-small-finetuned-amazon-en-zh_TW/pytorch_model.bin
tokenizer config file saved in mt5-small-finetuned-amazon-en-zh_TW/tokenizer_config.json
Special tokens file saved in mt5-small-finetuned-amazon-en-zh_TW/special_tokens_map.json
Copy vocab file to mt5-small-finetuned-amazon-en-zh_TW/spiece.model
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.37k/1.12G [00:00<?, ?B/s]

Upload file runs/Mar10_06-08-21_9fa1b1cb12fb/events.out.tfevents.1646895007.9fa1b1cb12fb.79.2: 100%|##########‚Ä¶

Upload file runs/Mar10_06-08-21_9fa1b1cb12fb/events.out.tfevents.1646893057.9fa1b1cb12fb.79.0:  42%|####1     ‚Ä¶

To https://huggingface.co/peterhsu/mt5-small-finetuned-amazon-en-zh_TW
   4bc7642..2a76b7f  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}, 'metrics': [{'name': 'Rouge1', 'type': 'rouge', 'value': 15.8831}]}
To https://huggingface.co/peterhsu/mt5-small-finetuned-amazon-en-zh_TW
   2a76b7f..e8ac447  main -> main



'https://huggingface.co/peterhsu/mt5-small-finetuned-amazon-en-zh_TW/commit/2a76b7fa61656e7a85cf9a13e0731767e8e0861d'

This will save the checkpoint and configuration files to output_dir, before uploading all the files to the Hub. By specifying the tags argument, we also ensure that the widget on the Hub will be one for a summarization pipeline instead of the default text generation one associated with the mT5 architecture (for more information about model tags, see the ü§ó [Hub documentation](https://huggingface.co/docs/hub/main#how-is-a-models-type-of-inference-api-and-widget-determined). The output from `trainer.push_to_hub()` is a URL to the Git commit hash, so you can easily see the changes that were made to the model repository!

To wrap up this section, let‚Äôs take a look at how we can also fine-tune mT5 using the low-level features provided by ü§ó Accelerate.

#@ Fine-tuning mT5 with ü§ó Accelerate
Fine-tuning our model with ü§ó Accelerate is very similar to the text classification example we encountered in [Chapter 3](https://huggingface.co/course/chapter3/1). The main differences will be the need to explicitly <u>generate our summaries during training</u> and <u>define how we compute the ROUGE scores</u> (<u>recall that the Seq2SeqTrainer</u> took care of the generation for us). Let‚Äôs take a look how we can implement these two requirements within ü§ó Accelerate!

## Preparing everything for training
The first thing we need to do is create a DataLoader for each of our splits. Since the PyTorch dataloaders expect batches of tensors, we need to set the format to "torch" in our datasets:

In [41]:
tokenized_datasets.set_format("torch")

Now that we‚Äôve got datasets consisting of just tensors, the next thing to do is instantiate the DataCollatorForSeq2Seq again. For this we need to provide a fresh version of the model, so let‚Äôs load it again from our cache:

In [42]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

We can then instantiate the data collator and use this to define our dataloaders:

In [43]:
from torch.utils.data import DataLoader

batch_size = 8
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator,
    batch_size=batch_size,
)


The next thing to do is define the optimizer we want to use. As in our other examples, we‚Äôll use AdamW, which works well for most problems:

In [44]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)


Finally, we feed our model, optimizer, and dataloaders to the `accelerator.prepare()` method:

In [45]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that we‚Äôve prepared our objects, there are three remaining things to do:

> Define the learning rate schedule.  
  Implement a function to post-process the summaries for evaluation.  
  Create a repository on the Hub that we can push our model to.


For the learning rate schedule, we‚Äôll use the standard linear one from previous sections:

In [46]:
from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)


For post-processing, we need a function that <u>splits the generated summaries into sentences that are separated by newlines</u>. <u>This is the format the ROUGE metric expects</u>, and we can achieve this with the following snippet of code:

In [47]:
def postprocess_text(preds, labels):
  preds = [pred.strip() for pred in preds]
  labels = [label.strip() for label in labels]

  # ROUGE expects a newline after each sentence
  preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
  labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

  return preds, labels
  

This should look familiar to you if you recall how we defined the compute_metrics() function of the Seq2SeqTrainer.  


Finally, we need to create a model repository on the Hugging Face Hub. For this, we can use the appropriately titled ü§ó Hub library. We just need to define a name for our repository, and the library has a utility function to combine the repository ID with the user profile:


In [48]:
from huggingface_hub import get_full_repo_name

model_name = "test-bert-finetuned-en-zh_TW-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'peterhsu/test-bert-finetuned-en-zh_TW-accelerate'

Now we can use this repository name to clone a local version to our results directory that will store the training artifacts:

In [49]:
from huggingface_hub import Repository

output_dir = "results-mt5-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

Cloning https://huggingface.co/peterhsu/test-bert-finetuned-en-zh_TW-accelerate into local empty directory.


Download file pytorch_model.bin:   0%|          | 2.83k/1.12G [00:00<?, ?B/s]

Download file spiece.model:   0%|          | 1.58k/4.11M [00:00<?, ?B/s]

Download file tokenizer.json:   0%|          | 16.0k/15.6M [00:00<?, ?B/s]

Clean file spiece.model:   0%|          | 1.00k/4.11M [00:00<?, ?B/s]

Clean file tokenizer.json:   0%|          | 1.00k/15.6M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.12G [00:00<?, ?B/s]

This will allow us to push the artifacts back to the Hub by calling the repo.push_to_hub() method during training! Let‚Äôs now wrap up our analysis by writing out the training loop.


## Training loop
The training loop for summarization is quite similar to the other ü§ó Accelerate examples that we‚Äôve encountered and is roughly split into four main steps:

1.   Train the model by iterating over all the examples in train_dataloader for each epoch.   
2.   Generate model summaries at the end of each epoch, by first generating the tokens and then decoding them (and the reference summaries) into text. 
3.   Compute the ROUGE scores using the same techniques we saw earlier
4.   Save the checkpoints and push everything to the Hub. Here we rely on the nifty blocking=False argument of the Repository object so that we can push the checkpoints per epoch asynchronously. This allows us to continue training without having to wait for the somewhat slow upload associated with a GB-sized model!

These steps can be seen in the following block of code:

Use [unwrap_model()](https://huggingface.co/docs/accelerate/accelerator.html#accelerate.Accelerator.unwrap_model) to unwrap your model before saving it.

In [50]:
from tensorflow.python.util.tf_decorator import unwrap
from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
  #Training
  model.train()
  for step, batch in enumerate(train_dataloader):
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)


  # Evaluation
  model.eval()
  for step, batch in enumerate(eval_dataloader):
    with torch.no_grad():
      generated_tokens = accelerator.unwrap_model(model).generate(
          batch["input_ids"],
          attention_mask=batch["attention_mask"],
      )

      generated_tokens = accelerator.pad_across_processes(
          generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
      )
      labels = batch["labels"]

      # If we did not pad to max length, we need to pad the labels too
      labels = accelerator.pad_across_processes(
          batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

      gernerated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
      labels = accelerator.gather(labels).cpu().numpy()

      # Replace -100 in the labels as we can't decode them
      labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
      if isinstance(generated_tokens, tuple):
        generated_tokens = generated_tokens[0]
      
      decoded_preds = tokenizer.batch_decode(
          generated_tokens, skip_special_tokens=True
      )
      decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

      decoded_preds, decoded_labels = postprocess_text(
          decoded_preds, decoded_labels
      )

      rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

  # Compute metrics
  result = rouge_score.compute()
  # Extract the median ROUGE SCORES
  result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
  result = {k: round(v, 4) for k, v in result.items()}
  print(f"Epoch {epoch}:", result)

  # Save and upload
  accelerator.wait_for_everyone()
  unwrapped_model = accelerator.unwrap_model(model)
  unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
  if accelerator.is_main_process:
    tokenizer.save_pretrained(output_dir)
    repo.push_to_hub(
        commit_message=f"Training in progress epoch {epoch}",
        blocking=False
    )


  0%|          | 0/8380 [00:00<?, ?it/s]

Epoch 0: {'rouge1': 1.9047, 'rouge2': 0.2977, 'rougeL': 1.8652, 'rougeLsum': 1.8626}
Epoch 1: {'rouge1': 2.4156, 'rouge2': 0.1866, 'rougeL': 2.313, 'rougeLsum': 2.3613}


Several commits (2) will be pushed upstream.


Epoch 2: {'rouge1': 2.6376, 'rouge2': 0.0901, 'rougeL': 2.6515, 'rougeLsum': 2.6535}


Several commits (3) will be pushed upstream.


Epoch 3: {'rouge1': 2.7015, 'rouge2': 0.1931, 'rougeL': 2.6353, 'rougeLsum': 2.6542}


Several commits (4) will be pushed upstream.


Epoch 4: {'rouge1': 8.3552, 'rouge2': 3.3531, 'rougeL': 8.4159, 'rougeLsum': 8.4451}


Several commits (5) will be pushed upstream.


Epoch 5: {'rouge1': 9.256, 'rouge2': 3.9193, 'rougeL': 9.2803, 'rougeLsum': 9.3266}


Several commits (6) will be pushed upstream.


Epoch 6: {'rouge1': 9.4027, 'rouge2': 3.8037, 'rougeL': 9.432, 'rougeLsum': 9.426}


Several commits (7) will be pushed upstream.


Epoch 7: {'rouge1': 11.1427, 'rouge2': 4.6925, 'rougeL': 11.165, 'rougeLsum': 11.1734}


Several commits (8) will be pushed upstream.


Epoch 8: {'rouge1': 11.2515, 'rouge2': 5.179, 'rougeL': 11.1607, 'rougeLsum': 11.1952}


Several commits (9) will be pushed upstream.


Epoch 9: {'rouge1': 11.4723, 'rouge2': 5.179, 'rougeL': 11.4448, 'rougeLsum': 11.457}


Several commits (10) will be pushed upstream.


The training loop looks a lot like the ones in [section 2](https://huggingface.co/course/chapter7/2) and [Chapter 3](https://huggingface.co/course/chapter3/1), with a few differences in the evaluation part ‚Äî so let‚Äôs focus on that! 
   

The first thing to note is that we use the [generate() method](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate) \([ÂèÉÁúã](https://huggingface.co/docs/transformers/internal/generation_utils)\) to compute predictions, but this is a method on our base model, not the wrapped model ü§ó Accelerate created in the  [prepare() method](https://huggingface.co/docs/accelerate/accelerator.html#accelerate.Accelerator.prepare). That‚Äôs why we [unwrap()](https://huggingface.co/docs/accelerate/accelerator.html#accelerate.Accelerator.unwrap_model) the model first, then call this method.  

  
The second thing is that, like with [token classification](https://huggingface.co/course/chapter7/2), two processes may have padded the inputs and labels to different shapes, so we use [accelerator.pad_across_processes()](https://huggingface.co/docs/accelerate/accelerator.html#accelerate.Accelerator.pad_across_processes) to make the predictions and labels the same shape before calling the [gather() method](https://huggingface.co/docs/accelerate/accelerator.html#accelerate.Accelerator.gather). If we don‚Äôt do this, the evaluation will either error out or hang forever.

In [None]:
repo.git_add()  # Âä†ËºâÊâÄÊúâÊ™îÊ°à
# commit version and comment
#repo.git_commit()   
repo.git_push() # push to hub

Several commits (10) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.36k/1.12G [00:00<?, ?B/s]

In [None]:
from transformers import pipeline

# Change the username to your Hub profile
hub_model_id = "peterhsu/test-bert-finetuned-en-zh_TW-accelerate"
summarizer = pipeline("summarization", model=hub_model_id)

In [None]:
def print_summary(idx):
  review = books_dataset["test"][idx]["review_body"]
  title = books_dataset["test"][idx]["review_title"]
  summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"]
  print(f"'>>> Review: {review}'")
  print(f"\n'>>> Title: {title}'")
  print(f"\n'>>> Summary: {summary}'")

In [None]:
print_summary(124)