In [None]:
# default_exp data.summarization

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.summarization

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5.

In [None]:
#export
import ast
from functools import reduce

import torch
from transformers import *
from fastai2.text.all import *

from blurr.utils import *
from blurr.data.core import *

In [None]:
#hide
import pdb

from nbdev.showdoc import *
from fastcore.test import *

In [None]:
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


## Summarization tokenization, batch transform, and DataBlock methods

Summarization tasks attempt to generate a human-understandable and sensible representation of a larger body of text (e.g., capture the meaning of a larger document in 1-3 sentences).

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)

1000

In [None]:
cnndm_df.head(2)

Unnamed: 0,article,highlights,ds_type
0,"(CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el...","John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in ""century of knowledge""",train
1,"(CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will ""hopefully bring some order"" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ...","NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says .",train


In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)

Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)

In [None]:
#export
class HF_SummarizationInput(list): pass

We create a subclass of `HF_BatchTransform` for summarization tasks to add `decoder_input_ids` and `labels` to our inputs during training, which will in turn allow the huggingface model to calculate the loss for us.  See [here](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartModel.forward) for more information on these additional inputs are used in summarization and conversational training tasks.  

Note also that `labels` is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) `decoder_input_ids`.

And lastly, we also update our targets to just be the `input_ids` of our target sequence so that fastai's `Learner.show_results` works (again, almost all the fastai bits require returning a single tensor to work).

In [None]:
#export
class HF_SummarizationBatchTransform(HF_BatchTransform):
    def __init__(self, hf_arch, hf_tokenizer, **kwargs):
        super().__init__(hf_arch, hf_tokenizer, HF_SummarizationInput, **kwargs)
        
    def encodes(self, samples):  
        samples = super().encodes(samples)
        if (len(samples[0]) == 1): return samples
        
        updated_samples = []
        for s in samples:
            s[0]['decoder_input_ids'] = s[1]['input_ids'][:-1].clone()
            s[0]['labels'] = s[1]['input_ids'][1:].clone()
            s[0]['labels'][s[0]['labels'] == self.hf_tokenizer.pad_token_id] = -100
            
            targ_ids = s[1]['input_ids']
            
            updated_samples.append((s[0], targ_ids))
        
        return updated_samples
    
    def decodes(self, encoded_samples):
        if (isinstance(encoded_samples, dict)): return self.hf_input_return_type([encoded_samples['input_ids']])
        return [encoded_samples]

We had to override the `decodes` method above because, while both our inputs and targets are technically the same things, we update the later to consist of *only* the target input_ids so that methods like `Learner.show_results` work.  Nevertheless, because fastai remembers what they are, `HF_TokenizerTransform.decodes` will be called for both and it works on a `list` of input_ids.

In [None]:
hf_batch_tfm = HF_SummarizationBatchTransform(hf_arch, hf_tokenizer)

blocks = ( 
    HF_TextBlock(hf_arch, hf_tokenizer), 
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm, max_length=150)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

In [None]:
# dblock.summary(cnndm_df)

In [None]:
dls = dblock.dataloaders(cnndm_df, bs=4)

In [None]:
b = dls.one_batch()

In [None]:
len(b), b[0]['input_ids'].shape, b[1].shape

(2, torch.Size([4, 512]), torch.Size([4, 150]))

In [None]:
#export
@typedispatch
def show_batch(x:HF_SummarizationInput, y, samples, dataloaders=None, ctxs=None, max_n=6, **kwargs):  
    res = L([ (s[0], s[1]) for s in samples ])          
    display_df(pd.DataFrame(res, columns=['text', 'target'])[:max_n])
    return ctxs

In [None]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,target
0,"It's no secret that a battle has boiled over in the Republican Party. The fight has played out in the policy arena but also on the campaign trail. And since the inception of the tea party in 2009, it seemed like that wing had the upper hand. It slowly made effective inroads into a party many members of the vocal new group thought had lost its way. They elected a new breed of Republican into office, including Texas Sen. Ted Cruz and Kentucky Sen. Rand Paul, who surprised the political world by defeating establishment-backed candidates in their respective primaries. But those two successes haven't been the norm, especially in the Senate, as many inexperienced but ideologically more pure candidates have been unable to seal the deal. In 2010, Sharron Angle won the Senate primary in Nevada and Christine O'Donnell won in Delaware. Two years later, Richard Mourdock and Todd Akin won in Indiana and Missouri respectively. All four went on to lose against the Democrat. In a year in which Republicans have their best shot in several elections of regaining control of the Senate, party leaders are hoping to avoid general election stumbles. The Republican establishment is off to a good start this primary season, but it had an easy opener. Texas Sen. John Cornyn easily won his primary against tea party backed challenger Rep. Steve Stockman, who was largely persona-non-grata during his campaign. Still he spent $2.6 million in the final weeks leading up to the early March primary. Brian Walsh, a former Cornyn spokesman and former communications director for the National Republican Senatorial Committee (NRSC), which works to get Republicans elected to the Senate, said Cornyn did the work necessary to beat a challenger, including researching possible opponents and raising money. ""It's not a coincidence that John Cornyn won,"" he said. Republican incumbents fight back. Senate Republicans want to make sure that what happened in Texas happens elsewhere. Senate Republican Leader Mitch McConnell, who is facing a primary challenge from Matt Bevin, has taken the unusual step of publicly criticizing at least one group working to defeat some GOP incumbents, including McConnell - the Senate Conservative Fund. ""I think we are going to crush them everywhere,"" referring to SCF-backed candidates. ""I don't think they are going to have a single nominee anywhere in the country,"" he said in a recent interview with The New York Times. And in an interview with The Weekly Standard, McConnell said the organization is giving conservatism ""a bad name."" ""We know their business model is only","Republicans are taking an aggressive stance against intra-party opposition.\nOne conservative groups called Mississippi Republican incumbent a ""liberal""\nEstablishment trying to prevent candidates who can't win in general election."
1,"When Rachel Frederickson, 24, stepped out onto the stage at NBC's ""The Biggest Loser"" finale Tuesday night, some wondered if she had gone too far. While the show is known for its dramatic weight loss transformations -- most winners lose more than 50% of their body weight -- Frederickson appeared extremely thin. And the looks on trainers Bob Harper and Jillian Michaels' faces could be interpreted as shock -- or dismay. Frederickson went from 260 pounds to 105 pounds, losing 59.62% of her body weight. At 5 feet, 5 inches tall, that puts her body mass index at 17.5. Anything under 18.5 is considered by the National Institutes of Health to be underweight. NBC on Wednesday declined to comment on its $250,000 grand prize winner, a voice-over artist who lives in Los Angeles. Social media, however, was buzzing. ""Are you kidding?"" One woman wrote on the show's Facebook page. ""Rachel looks anorexic! She has gone from one extreme to the other!"" Other posts called Frederickson ""frail"" and said she seemed ""dizzy"" and ""disoriented"" on stage. ""She obviously worked incredibly hard to achieve her weight loss goals, but I am wondering if the pressure of winning a large cash prize caused her to take it too far,"" one said. Others posted on Frederickson's Facebook page, asking her to get help and expressing disappointment. ""I'm saddened that my 13-year-old daughter watched as you were rewarded for doing that to your body,"" one woman said. Frederickson as of Wednesday had not responded to the posts and had not commented on the controversy on Twitter. Michaels issued a statement late Wednesday on social media on behalf of herself and Harper, saying, ""Bob and I want to take a moment to congratulate all of the BL contestants on their hard work. We're not comfortable commenting on Rachel's journey because (we) weren't her trainers and weren't given an opportunity to work with her at any point. Any questions about the contestants on the Biggest Loser should be directed to the show's producers."" Asked by another Twitter user whether she was concerned about Frederickson's weight, Michaels posted, ""We won't comment because we had nothing to do with it and want no part in shaming someone with 'public statements.'"" HLNtv.com: Opinion: Dropping pounds nothing like 'Biggest Loser' BMI applies less on an individual level and is more utilized for population norms -- many people","Social media is buzzing over ""The Biggest Loser"" winner's 155-pound drop.\nLosing a lot of weight with unhealthy methods can be dangerous, experts say.\nBeing too thin can be as dangerous as being obese."


## Cleanup

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01a_data-language-modeling.ipynb.
Converted 01c_data-question-answering.ipynb.
Converted 01d_data-token-classification.ipynb.
Converted 01e_data-summarization.ipynb.
Converted 02_modeling-core.ipynb.
Converted 02a_modeling-language-modeling.ipynb.
Converted 02c_modeling-question-answering.ipynb.
Converted 02d_modeling-token-classification.ipynb.
Converted 02e_modeling-summarization.ipynb.
Converted index.ipynb.
