In [None]:
# default_exp data.summarization

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.summarization

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5.

In [None]:
#export
import ast
from functools import reduce

import torch
from transformers import *
from fastai.text.all import *

from blurr.utils import *
from blurr.data.core import *

In [None]:
#hide
import pdb

from nbdev.showdoc import *
from fastcore.test import *

In [None]:
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


## Summarization tokenization, batch transform, and DataBlock methods

Summarization tasks attempt to generate a human-understandable and sensible representation of a larger body of text (e.g., capture the meaning of a larger document in 1-3 sentences).

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)

1000

In [None]:
cnndm_df.head(2)

Unnamed: 0,article,highlights,ds_type
0,"(CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el...","John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in ""century of knowledge""",train
1,"(CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will ""hopefully bring some order"" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ...","NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says .",train


In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)

Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)

In [None]:
#export
class HF_SummarizationInput(list): pass

We create a subclass of `HF_BatchTransform` for summarization tasks to add `decoder_input_ids` and `labels` to our inputs during training, which will in turn allow the huggingface model to calculate the loss for us.  See [here](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartModel.forward) for more information on these additional inputs are used in summarization and conversational training tasks.  

Note also that `labels` is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) `decoder_input_ids`.

And lastly, we also update our targets to just be the `input_ids` of our target sequence so that fastai's `Learner.show_results` works (again, almost all the fastai bits require returning a single tensor to work).

In [None]:
#export
class HF_SummarizationBatchTransform(HF_BatchTransform):
    def __init__(self, hf_arch, hf_tokenizer, **kwargs):
        super().__init__(hf_arch, hf_tokenizer, HF_SummarizationInput, **kwargs)
        
    def encodes(self, samples):  
        samples = super().encodes(samples)
        if (len(samples[0]) == 1): return samples
        
        updated_samples = []
        for s in samples:
            s[0]['decoder_input_ids'] = s[1]['input_ids'][:-1].clone()
            s[0]['labels'] = s[1]['input_ids'][1:].clone()
            s[0]['labels'][s[0]['labels'] == self.hf_tokenizer.pad_token_id] = -100
            
            targ_ids = s[1]['input_ids']
            
            updated_samples.append((s[0], targ_ids))
        
        return updated_samples
    
    def decodes(self, encoded_samples):
        if (isinstance(encoded_samples, dict)): return self.hf_input_return_type([encoded_samples['input_ids']])
        return [encoded_samples]

We had to override the `decodes` method above because, while both our inputs and targets are technically the same things, we update the later to consist of *only* the target input_ids so that methods like `Learner.show_results` work.  Nevertheless, because fastai remembers what they are, `HF_TokenizerTransform.decodes` will be called for both and it works on a `list` of input_ids.

In [None]:
hf_batch_tfm = HF_SummarizationBatchTransform(hf_arch, hf_tokenizer)

blocks = ( 
    HF_TextBlock(hf_arch, hf_tokenizer), 
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm, max_length=150, hf_input_idxs=[0,1])
)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

In [None]:
# dblock.summary(cnndm_df)

In [None]:
dls = dblock.dataloaders(cnndm_df, bs=4)

In [None]:
b = dls.one_batch()

In [None]:
len(b), b[0]['input_ids'].shape, b[1].shape

(2, torch.Size([4, 512]), torch.Size([4, 81]))

In [None]:
#export
@typedispatch
def show_batch(x:HF_SummarizationInput, y, samples, dataloaders=None, ctxs=None, max_n=6, **kwargs):  
    res = L([ (s[0], s[1]) for s in samples ])          
    display_df(pd.DataFrame(res, columns=['text', 'target'])[:max_n])
    return ctxs

In [None]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,target
0,"London, England (CNN) -- Olympic triple-jump champion Christian Taylor knows all about putting his best foot forward. But in order to continue competing in the sport he loves, he's had to go back to square one. Retrain his muscle memory and try a new way. For an athlete who's used to constantly repeating his routine, day after day, year after year, it was a big deal. ""All my life I've jumped from my left foot -- that was my takeoff -- and even winning the 2012 Olympics, that was the foot I jumped from, so the idea of switching feet was pretty crazy,"" the American tells CNN's Human to Hero series. ""You have to almost use a different side of the brain. My left leg was muscle memory, I could do that day in, day out, and now to do that off my right, it took a little while to get over it."" So why take such a big risk? Taylor had no choice -- triple-jumping had taken its toll on his knee joints and he was facing the prospect of having to quit in his prime; at 22, he was the youngest man to win the discipline's Olympic title in 100 years. ""People around me would say, 'You've won so many things, can you not continue through the pain?' but it was to the point that it was either that I try and blow out my knee or I just give it up,"" says the 2011 world champion. ""I love this too much -- my passion for it exceeded the doubt -- and once I just committed to it, I was like this is it. I'm going to make it happen. ""My parents always brought me up with the saying, 'Where there's a will there's a way,' so I was willing to do what it took and now I'm making the way."" During the hard times that followed, he was able to draw on the religious beliefs instilled in him by his grandmother. ""My faith keeps me to who I am, because a lot of times there are a lot of distractions, a lot of pressures that come with the lifestyle,"" Taylor explains. ""But keeping that faith and just remembering who I am is very important to me. I have my daily devotions."" His persistence has paid off. While he finished outside the medals at last year's world championships, finishing fourth, he has completed a hat-trick of titles in the lucrative Diamond League series. A season's best leap of 17.51 meters at the Zurich meeting in August","Triple-jump star Christian Taylor overcomes career-threatening injury.\nKnee problems meant he had to reverse his leaping stride, or retire early.\nThe 24-year-old has won Diamond League series title for third year in a row.\nAmerican also competes in long jump and hopes for 400m relay place at Rio 2016."
1,"Joint Base Lewis-McChord, Washington (CNN) -- Army Staff Sgt. Calvin Gibbs has been sentenced to life in military prison with eligibility for parole in 10 years. A military court-martial Thursday found Gibbs guilty of murdering three Afghan civilians, illegally cutting off pieces of their corpses to keep as ""souvenirs"" and planting weapons to make the men appear as if they were Taliban fighters killed in legitimate firefights. He was reduced in rank to private and ordered to forfeit all pay and benefits. Whatever sentence Gibbs serves will be reduced by the 547 days he has already spent in prison. ""He said they were all dirty savages,"" prosecutor Maj. Andre Leblanc said at Gibbs' sentencing hearing. ""He is the savage, not the innocent Afghans he murdered. It is monstrous. What kind of savagery does it take to do this? To cut a finger off a victim and show it to people? This is a savage being"" Gibbs' attorney, Phillip Stackhouse, had asked the court for a sentence of life with parole so Gibbs would have the opportunity to be with his now-3-year-old son again. ""He has a long time to reflect on his life, what he has done and what he wants to do in the future,"" Stackhouse said. Gibbs is the highest ranking of five soldiers charged with being part of a rogue ""kill squad"" that targeted civilians. Another seven soldiers also were charged with lesser crimes including abusing drugs, keeping ""off the books"" weapons and intimidating a fellow soldier not to speak out against the platoon's alleged killings. Gibbs had pleaded not guilty. A prosecutor described Gibbs as a ""recruiting poster"" soldier. But the tall, clean-cut Gibbs and the ""kill squad"" he was convicted of leading turned into a public-relations nightmare for the military. ""Sgt. Gibbs had a charisma, he had a 'follow me' personality,"" Maj. Robert Stelle, a prosecutor in the case, told the court in closing arguments Wednesday. ""But it was all a bunch of crap, he had his own mission: murder and depravity."" The murders Gibbs is accused of committing took place over a period of five months last year, while Gibbs led the 3rd Platoon of the ArmyÂ¹s 5th Stryker Brigade in Kandahar Province, Afghanistan. Gibbs' platoon was tasked with patrolling small villages in the area to build relationships with an Afghan population wary of the U.S. presence in their country. Instead","Staff Sgt. Calvin Gibbs is sentenced to life in prison, with parole possible.\nNEW: He remains under investigation for the 2004 shooting death of a family in Iraq.\nProsecutors: Gibbs' platoon staged killings of civilians as firefights with Taliban.\nDefense says Gibbs' accusers were high on hashish at the time."


## Cleanup

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01a_data-language-modeling.ipynb.
Converted 01c_data-question-answering.ipynb.
Converted 01d_data-token-classification.ipynb.
Converted 01e_data-summarization.ipynb.
Converted 02_modeling-core.ipynb.
Converted 02a_modeling-language-modeling.ipynb.
Converted 02c_modeling-question-answering.ipynb.
Converted 02d_modeling-token-classification.ipynb.
Converted 02e_modeling-summarization.ipynb.
Converted index.ipynb.
