In [None]:
# default_exp data.text2text.summarization

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.text2text.summarization

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5.

In [None]:
#export
import ast
from functools import reduce

import torch
from transformers import *
from fastai.text.all import *

from blurr.utils import *
from blurr.data.core import *
from blurr.data.text2text.core import *

logging.set_verbosity_error()

In [None]:
#hide
import pdb

from nbdev.showdoc import *
from fastcore.test import *

from fastai import __version__ as fa_version
from torch import __version__ as pt_version
from transformers import __version__ as hft_version

print(f'Using pytorch {pt_version}')
print(f'Using fastai {fa_version}')
print(f'Using transformers {hft_version}')

Using pytorch 1.7.0
Using fastai 2.1.5
Using transformers 3.4.0


In [None]:
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


## Summarization tokenization, batch transform, and DataBlock methods

Summarization tasks attempt to generate a human-understandable and sensible representation of a larger body of text (e.g., capture the meaning of a larger document in 1-3 sentences).

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)

1000

In [None]:
cnndm_df.head(2)

Unnamed: 0,article,highlights,ds_type
0,"(CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el...","John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in ""century of knowledge""",train
1,"(CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will ""hopefully bring some order"" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ...","NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says .",train


In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)

('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)

We create a subclass of `HF_BatchTransform` for summarization tasks to add `decoder_input_ids` and `labels` to our inputs during training, which will in turn allow the huggingface model to calculate the loss for us.  See [here](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartModel.forward) for more information on these additional inputs are used in summarization and conversational training tasks.  

Note also that `labels` is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) `decoder_input_ids`.

And lastly, we also update our targets to just be the `input_ids` of our target sequence so that fastai's `Learner.show_results` works (again, almost all the fastai bits require returning a single tensor to work).

In [None]:
#export
class HF_SummarizationInput(HF_BaseInput): pass

In [None]:
#export
class HF_SummarizationBeforeBatchTransform(HF_BeforeBatchTransform):
    
    def __init__(self, hf_arch, hf_tokenizer, max_length=None, padding=True, truncation=True, 
                 is_split_into_words=False, n_tok_inps=2, tok_kwargs={}, **kwargs):
                 
        super().__init__(hf_arch, hf_tokenizer, max_length=max_length, padding=padding, truncation=truncation, 
                         is_split_into_words=is_split_into_words, n_tok_inps=n_tok_inps, 
                         tok_kwargs=tok_kwargs.copy(), **kwargs)
        
    def encodes(self, samples):  
        samples = super().encodes(samples)
        if (len(samples[0]) == 1): return samples
        
        updated_samples = []
        for s in samples:
            s[0]['decoder_input_ids'] = s[1]['input_ids'][:-1].clone()
            s[0]['labels'] = s[1]['input_ids'][1:].clone()
            s[0]['labels'][s[0]['labels'] == self.hf_tokenizer.pad_token_id] = -100
            
            targ_ids = s[1]['input_ids']
            
            updated_samples.append((s[0], targ_ids))
        
        return updated_samples

We had to override the `decodes` method above because, while both our inputs and targets are technically the same things, we update the later to consist of *only* the target input_ids so that methods like `Learner.show_results` work.

In [None]:
before_batch_tfm = HF_SummarizationBeforeBatchTransform(hf_arch, hf_tokenizer)
blocks = (HF_Text2TextBlock(before_batch_tfms=before_batch_tfm, input_return_type=HF_SummarizationInput), noop)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

Two lines!  Notice we pass in `noop` for our targets (e.g. our summaries) because the batch transform will take care of both out inputs and targets.

In [None]:
# dblock.summary(cnndm_df)

In [None]:
dls = dblock.dataloaders(cnndm_df, bs=4)

In [None]:
b = dls.one_batch()

In [None]:
len(b), b[0]['input_ids'].shape, b[1].shape

(2, torch.Size([4, 1024]), torch.Size([4, 78]))

In [None]:
#export
@typedispatch
def show_batch(x:HF_SummarizationInput, y, samples, dataloaders, ctxs=None, max_n=6, input_trunc_at=None, target_trunc_at=None, **kwargs):  
    hf_tokenizer = dataloaders.before_batch[0].hf_tokenizer
    
    res = L([ (hf_tokenizer.decode(s[0], skip_special_tokens=True)[:input_trunc_at], hf_tokenizer.decode(s[1], skip_special_tokens=True)[:target_trunc_at])
             for s in samples ])      
    
    display_df(pd.DataFrame(res, columns=['text', 'target'])[:max_n])
    return ctxs

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, input_trunc_at=1000, target_trunc_at=250)

Unnamed: 0,text,target
0,"(CNN) -- Home to up to 10 percent of all known species, Mexico is recognized as one of the most biodiverse regions on the planet. The twin threats of climate change and human encroachment on natural environments are, however, threatening the existence of the country's rich wildlife. And there is a great deal to lose. In the United Nations Environment Program (UNEP) World Conservation Monitoring Centre's list of megadiverse countries Mexico ranks 11th. The list represents a group of 17 countries that harbor the majority of the Earth's species and are therefore considered extremely biodiverse. From its coral reefs in the Caribbean Sea to its tropical jungles in Chiapas and the Yucatan peninsula and its deserts and prairies in the north, Mexico boasts an incredibly rich variety of flora and fauna. Some 574 out of 717 reptile species found in Mexico -- the most in any country -- can only be encountered within its borders. It is home to 502 types of mammals, 290 species of birds, 1,150 var","Mexico hosts to up to 10 percent of all known species on Earth.\nIt is home to 502 types of mammals, 290 bird species and 26,000 types of plants.\nHuman development and climate change is placing a big strain on its biodiversity.\nThe Golden Eagle is un"
1,"Some U.S. officials this year are expected to get smartphones capable of handling classified government documents over cellular networks, according to people involved in the project. The phones will run a modified version of Google's Android software, which is being developed as part of an initiative that spans multiple federal agencies and government contractors, these people said. The smartphones are first being deployed to U.S. soldiers, people familiar with the project said. Later, federal agencies are expected to get phones for sending and receiving government cables while away from their offices, sources said. Eventually, local governments and corporations could give workers phones with similar software. The Army has been testing touchscreen devices at U.S. bases for nearly two years, said Michael McCarthy, a director for the Army's Brigade Modernization Command, in a phone interview. About 40 phones were sent to fighters overseas a year ago, and the Army plans to ship 50 more p","Government, military officials to get Android phones capable of sharing secret documents.\nThe phones will run a modified version of Google's Android software, sources say.\nContractor: Google ""more cooperative"" than Apple working with government on p"


## Tests

The tests below to ensure the core DataBlock code above works for **all** pretrained summarization models available in huggingface.  These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

**Note**: Feel free to modify the code below to test whatever pretrained summarization models you are working with ... and if any of your pretrained summarization models fail, please submit a github issue *(or a PR if you'd like to fix it yourself)*

In [None]:
BLURR_MODEL_HELPER.get_models(task='ConditionalGeneration')

[transformers.modeling_bart.BartForConditionalGeneration,
 transformers.modeling_blenderbot.BlenderbotForConditionalGeneration,
 transformers.modeling_fsmt.FSMTForConditionalGeneration,
 transformers.modeling_mbart.MBartForConditionalGeneration,
 transformers.modeling_pegasus.PegasusForConditionalGeneration,
 transformers.modeling_prophetnet.ProphetNetForConditionalGeneration,
 transformers.modeling_t5.T5ForConditionalGeneration,
 transformers.modeling_xlm_prophetnet.XLMProphetNetForConditionalGeneration]

In [None]:
pretrained_model_names = [
    ('facebook/bart-base',BartForConditionalGeneration),
    ('t5-small', T5ForConditionalGeneration),
    ('google/pegasus-cnn_dailymail', PegasusForConditionalGeneration)
]

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv')

In [None]:
#slow
#hide_output
task = HF_TASKS_ALL.ConditionalGeneration
bsz = 2
seq_sz = 256
trg_seq_sz = 40

test_results = []
for model_name, model_cls in pretrained_model_names:
    error=None
    
    print(f'=== {model_name} ===\n')
    
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, 
                                                                                   task=task, 
                                                                                   model_cls=model_cls)
    print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
    
    before_batch_tfm = HF_SummarizationBeforeBatchTransform(hf_arch, hf_tokenizer, 
                                                            padding='max_length', 
                                                            max_length=[seq_sz, trg_seq_sz])
    
    blocks = (HF_TextBlock(before_batch_tfms=before_batch_tfm, input_return_type=HF_SummarizationInput), noop)
    
    def add_t5_prefix(inp): return f'summarize: {inp}' if (hf_arch == 't5') else inp

    dblock = DataBlock(blocks=blocks, 
                   get_x=Pipeline([ColReader('article'), add_t5_prefix]), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

    dls = dblock.dataloaders(cnndm_df, bs=bsz) 
    b = dls.one_batch()
    
    try:
        print('*** TESTING DataLoaders ***\n')
        test_eq(len(b), 2)
        test_eq(len(b[0]['input_ids']), bsz)
        test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
        test_eq(len(b[1]), bsz)
        test_eq(b[1].shape, torch.Size([bsz, trg_seq_sz]))

        if (hasattr(hf_tokenizer, 'add_prefix_space')):
            test_eq(dls.before_batch[0].tok_kwargs['add_prefix_space'], True)
            
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
        dls.show_batch(dataloaders=dls, max_n=2, input_trunc_at=1000)
        
    except Exception as err:
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))

=== facebook/bart-base ===

architecture:	bart
tokenizer:	BartTokenizer

*** TESTING DataLoaders ***



Unnamed: 0,text,target
0,"Dan Condon believes in recycling. Just not when it comes to his hotel towels. Condon composts when he's at home in Boulder, Colorado. He eats local, organic and fair-trade food and drives a Honda CR-Z hybrid sports car. You might call him green. Except he's not so green when he travels for his work at an education nonprofit and stays in a hotel, which happens about 10 weeks per year. There, he uses a new towel every day. And don't try to bribe him with a drink or dessert coupon to get him to reuse the same one. ""I could care less about rewards for environmentally conscious behavior unless it's miles,"" Condon wrote in an e-mail. If hotels can't convince a hybrid-driving recycling enthusiast like Condon to go green while traveling, how can they possibly convince everyone else? 9 glamorous movie-star hotels. That's the problem of hotels trying to ""green"" your hotel stay. After guests have paid a pretty penny for a night at the inn, even the most environmental guests may want to treat the","Hotel guests who ""go green"" are happier with their stay.\nIncreasing water and energy costs are pushing hotels to cut costs wherever they can.\nMany hotels find that guests don't"
1,"Some U.S. officials this year are expected to get smartphones capable of handling classified government documents over cellular networks, according to people involved in the project. The phones will run a modified version of Google's Android software, which is being developed as part of an initiative that spans multiple federal agencies and government contractors, these people said. The smartphones are first being deployed to U.S. soldiers, people familiar with the project said. Later, federal agencies are expected to get phones for sending and receiving government cables while away from their offices, sources said. Eventually, local governments and corporations could give workers phones with similar software. The Army has been testing touchscreen devices at U.S. bases for nearly two years, said Michael McCarthy, a director for the Army's Brigade Modernization Command, in a phone interview. About 40 phones were sent to fighters overseas a year ago, and the Army plans to ship 50 more p","Government, military officials to get Android phones capable of sharing secret documents.\nThe phones will run a modified version of Google's Android software, sources say.\nContractor: Google ""more"


=== t5-small ===

architecture:	t5
tokenizer:	T5Tokenizer

*** TESTING DataLoaders ***



Unnamed: 0,text,target
0,"summarize: (CNN) -- Home to up to 10 percent of all known species, Mexico is recognized as one of the most biodiverse regions on the planet. The twin threats of climate change and human encroachment on natural environments are, however, threatening the existence of the country's rich wildlife. And there is a great deal to lose. In the United Nations Environment Program (UNEP) World Conservation Monitoring Centre's list of megadiverse countries Mexico ranks 11th. The list represents a group of 17 countries that harbor the majority of the Earth's species and are therefore considered extremely biodiverse. From its coral reefs in the Caribbean Sea to its tropical jungles in Chiapas and the Yucatan peninsula and its deserts and prairies in the north, Mexico boasts an incredibly rich variety of flora and fauna. Some 574 out of 717 reptile species found in Mexico -- the most in any country -- can only be encountered within its borders. It is home to 502 types of mammals, 290 species of birds,","Mexico hosts to up to 10 percent of all known species on Earth. It is home to 502 types of mammals, 290 bird species and 26,000 types of plants. Human development"
1,"summarize: Dan Condon believes in recycling. Just not when it comes to his hotel towels. Condon composts when he's at home in Boulder, Colorado. He eats local, organic and fair-trade food and drives a Honda CR-Z hybrid sports car. You might call him green. Except he's not so green when he travels for his work at an education nonprofit and stays in a hotel, which happens about 10 weeks per year. There, he uses a new towel every day. And don't try to bribe him with a drink or dessert coupon to get him to reuse the same one. ""I could care less about rewards for environmentally conscious behavior unless it's miles,"" Condon wrote in an e-mail. If hotels can't convince a hybrid-driving recycling enthusiast like Condon to go green while traveling, how can they possibly convince everyone else? 9 glamorous movie-star hotels. That's the problem of hotels trying to ""green"" your hotel stay. After guests have paid a pretty penny for a night at the inn, even the most environmental guests may want to","Hotel guests who ""go green"" are happier with their stay. Increasing water and energy costs are pushing hotels to cut costs wherever they can. Many hotels find that guests don't"


=== google/pegasus-cnn_dailymail ===

architecture:	pegasus
tokenizer:	PegasusTokenizer

*** TESTING DataLoaders ***



Unnamed: 0,text,target
0,"London (CNN) -- In 1948, a hospital outside London witnessed the birth of the Paralympic movement, as a Jewish doctor who had fled Nazi Germany sought to change the lives of patients with spinal injuries -- and inspire new hope in them through sport. The first ""Stoke Mandeville Games"" were organized in 1948 to coincide with the London Olympics, the second to be held in Britain. Named for the hospital in Buckinghamshire where Prof. Ludwig Guttmann's pioneering spinal injuries unit was based, the competitors in those initial Games -- 14 men and two women -- took part in a wheelchair archery contest. Many were military veterans injured on the battlefields of World War II. Just a year later, six teams competed at Stoke Mandeville -- with wheelchair netball, a forerunner of wheelchair basketball, being introduced -- as sport became a central part of a rehabilitation process that had been revolutionized by Guttmann. In 1956, a ""statement of intent"" was unveiled for the Games, which were by t","Paralympic movement was born in Stoke Mandeville, outside London, in 1948. 2012 Games will be the biggest yet, with 4,200 competitors from 165 countries. In an echo of the first,"
1,"Giglio, Italy (CNN) -- The nautical blue paint spelling out ""Costa Concordia"" has almost all bubbled and chipped off the bow of the once luxurious cruise liner after 20 months under salt water off the Italian island of Giglio. One can get glimpse of just what it's like in and under the Concordia by the vast array of mesmerizing underwater videos released by Italy's coast guard and the Titan Micoperi salvage team tasked with removing the rusting hulk. The seabed is still littered with sun deck chairs that floated from the ship's balconies and upper deck when it finally came to a rest in January 2012. Fish swim around the sunbed legs and seaweed has grown through some of the mesh seating. The beds are spread out in a surreal scene that looks like a set from an underwater science fiction film. Shoes, mattresses, dinner plates and thousands of pieces of cutlery shimmer in the divers' lights on a bed of sea grass. INTERACTIVE: How the 952-foot wreck will be raised. Divers have not been deep","Costa Concordia ran aground off Giglio in January 2012, killing 32 people. Bodies of two victims may still be inside the rotting 952-foot wreck. Divers call inside of ship a"


In [None]:
#slow
#hide_input
test_results_df = pd.DataFrame(test_results, columns=['arch', 'tokenizer', 'model_name', 'result', 'error'])
display_df(test_results_df)

Unnamed: 0,arch,tokenizer,model_name,result,error
0,bart,BartTokenizer,facebook/bart-base,PASSED,
1,t5,T5Tokenizer,t5-small,PASSED,
2,pegasus,PegasusTokenizer,google/pegasus-cnn_dailymail,PASSED,


## Cleanup

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01a_data-token-classification.ipynb.
Converted 01b_data-question-answering.ipynb.
Converted 01za_data-text2text-core.ipynb.
Converted 01zb_data-text2text-language-modeling.ipynb.
Converted 01zc_data-text2text-summarization.ipynb.
Converted 02_modeling-core.ipynb.
Converted 02a_modeling-token-classification.ipynb.
Converted 02b_modeling-question-answering.ipynb.
Converted 02za_modeling-text2text-core.ipynb.
Converted 02zb_modeling-text2text-language-modeling.ipynb.
Converted 02zc_modeling-text2text-summarization.ipynb.
Converted 99a_examples-multilabel.ipynb.
Converted index.ipynb.
