In [None]:
# default_exp data.summarization

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.summarization

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for summarization tasks using architectures like BART and T5.

In [None]:
#export
import ast
from functools import reduce

import torch
from transformers import *
from fastai.text.all import *

from blurr.utils import *
from blurr.data.core import *

In [None]:
#hide
import pdb

from nbdev.showdoc import *
from fastcore.test import *

In [None]:
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


## Summarization tokenization, batch transform, and DataBlock methods

Summarization tasks attempt to generate a human-understandable and sensible representation of a larger body of text (e.g., capture the meaning of a larger document in 1-3 sentences).

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)

1000

In [None]:
cnndm_df.head(2)

Unnamed: 0,article,highlights,ds_type
0,"(CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el...","John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in ""century of knowledge""",train
1,"(CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will ""hopefully bring some order"" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ...","NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says .",train


In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)

Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)

In [None]:
#export
class HF_SummarizationInput(list): pass

We create a subclass of `HF_BatchTransform` for summarization tasks to add `decoder_input_ids` and `labels` to our inputs during training, which will in turn allow the huggingface model to calculate the loss for us.  See [here](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartModel.forward) for more information on these additional inputs are used in summarization and conversational training tasks.  

Note also that `labels` is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) `decoder_input_ids`.

And lastly, we also update our targets to just be the `input_ids` of our target sequence so that fastai's `Learner.show_results` works (again, almost all the fastai bits require returning a single tensor to work).

In [None]:
#export
class HF_SummarizationBatchTransform(HF_BatchTransform):
    def __init__(self, hf_arch, hf_tokenizer, **kwargs):
        super().__init__(hf_arch, hf_tokenizer, HF_SummarizationInput, **kwargs)
        
    def encodes(self, samples):  
        samples = super().encodes(samples)
        if (len(samples[0]) == 1): return samples
        
        updated_samples = []
        for s in samples:
            s[0]['decoder_input_ids'] = s[1]['input_ids'][:-1].clone()
            s[0]['labels'] = s[1]['input_ids'][1:].clone()
            s[0]['labels'][s[0]['labels'] == self.hf_tokenizer.pad_token_id] = -100
            
            targ_ids = s[1]['input_ids']
            
            updated_samples.append((s[0], targ_ids))
        
        return updated_samples
    
    def decodes(self, encoded_samples):
        if (isinstance(encoded_samples, dict)): return self.hf_input_return_type([encoded_samples['input_ids']])
        return [encoded_samples]

We had to override the `decodes` method above because, while both our inputs and targets are technically the same things, we update the later to consist of *only* the target input_ids so that methods like `Learner.show_results` work.  Nevertheless, because fastai remembers what they are, `HF_TokenizerTransform.decodes` will be called for both and it works on a `list` of input_ids.

In [None]:
hf_batch_tfm = HF_SummarizationBatchTransform(hf_arch, hf_tokenizer)

blocks = ( 
    HF_TextBlock(hf_arch, hf_tokenizer), 
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm, max_length=150, hf_input_idxs=[0,1])
)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

In [None]:
# dblock.summary(cnndm_df)

In [None]:
dls = dblock.dataloaders(cnndm_df, bs=4)

In [None]:
b = dls.one_batch()

In [None]:
len(b), b[0]['input_ids'].shape, b[1].shape

(2, torch.Size([4, 512]), torch.Size([4, 80]))

In [None]:
#export
@typedispatch
def show_batch(x:HF_SummarizationInput, y, samples, dataloaders=None, ctxs=None, max_n=6, **kwargs):  
    res = L([ (s[0], s[1]) for s in samples ])          
    display_df(pd.DataFrame(res, columns=['text', 'target'])[:max_n])
    return ctxs

In [None]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,target
0,"(CNN) -- His shooting spree left at least 10 dead and millions terrified of bullets coming from an unseen sniper. But Mildred Muhammad believes she was the ultimate target of her ex-husband, John Allan Muhammad, the man dubbed the ""D.C. Sniper."" And for some time, Muhammad said she felt extreme guilt for the victims that were gunned down in grocery store parking lots and gas stations. The youngest was a 13-year-old boy who was shot while walking to his Maryland school. Muhammad spoke about the guilt she felt after the killing spree on CNN's ""Larry King Live"" on Monday night, the day before her ex-husband was scheduled to be executed. Muhammad said she has gradually gotten over her guilty feelings and focused on her three children. ""I felt that way initially because I had done everything I knew how to do to bring attention to how dangerous he was to me,"" Muhammad said. ""I had no idea his anger would extend beyond me, to include all people in his killings."" John Muhammad, the mastermind behind the Washington-area sniper attacks of 2002, is scheduled to die by lethal injection Tuesday evening at a state prison near Jarratt, Virginia. During two lengthy trials -- including one featuring testimony from young accomplice Lee Boyd Malvo -- and in several years of legal appeals, John Muhammad has continued to profess his innocence. Prosecutors say John Muhammad intended the killings to provide a smokescreen to cover up his real goal -- killing his ex-wife Mildred and gaining custody of his three children. Muhammad said she divorced John Muhammad because of abuse and has not visited him since he was in prison. ""I feel that all of my efforts, all of my energy is to help my children through this emotional turmoil that they are going through,"" said Muhammad. ""I don't have an emotional attachment to John."" John Muhammad's other ex-wife, Carol Williams, also talked to King Monday. Williams, John Muhammad's first wife, said she plans to visit him in prison with their son Tuesday before the execution. Williams also brought letters that John Muhammad wrote her from prison. ""Carol, I have missed my family for the past eight years. I don't want to be missed the day that these devils murder my innocent black (expletive),"" John Muhammad wrote in one of the letters. Williams said she was not surprised that John Muhammad still believed he was innocent. ""I'm praying for myself, for my son, and also for the families of the victims,"" Williams said.","John Muhammad's second ex-wife, Mildred, believes she was ultimate target of sniper spree.\nMuhammad: For long time, I felt extreme guilt for victims that were gunned down.\nCarol Williams, his first ex-wife, plans to visit Muhammad before execution and bring son.\nMuhammad has maintained his innocence in the deaths of at least 10 people in 2002."
1,"Boston (CNN) -- With only a half-dozen witnesses to go, lawyers for James ""Whitey"" Bulger have still not confirmed whether the reputed mob boss will take the stand on his own behalf. The delay has frustrated prosecutors who Wednesday told the judge they have the right to know whether they should be working on cross-examination or closing summations. On Day 33 of the federal racketeering trial, defense lawyers continued to raise questions about what the FBI did and did not do to prevent some of the murders Bulger is accused of committing, among them the death of Bulger crime associate Brian Halloran. Retired FBI Agent James Crawford testified that 10 days before Halloran and his friend Michael Donahue were murdered, he was approached by a woman described as having a ""close relationship"" with Steven ""The Rifleman"" Flemmi, Bulger's crime partner. Defense witness criticizes FBI's inaction on Bulger. The woman insisted on total anonymity knowing she'd be killed if Flemmi found out, telling the agent there were people in law enforcement on the Bulger payroll. The confidential informant told Crawford ""Flemmi was going to kill Halloran for being a snitch."" Crawford gave her his word he would not put it in writing. However, he did speak with a supervisor who said the information should be ""put on a back burner."" Ten days later Halloran was dead. The woman, believed to be Olga Davis, later approached the FBI agent asking for help finding her daughter Debra Davis who had disappeared. Debra was Flemmi's live-in girlfriend. During the trial, Flemmi testified he lured her to a home where Bulger strangled her. Defense lawyers plan to recall hit man John Martorano to testify Flemmi admitted ""accidentally"" strangling the stunning 26 year old. The defense also plans to show that Flemmi, not Bulger, had motive to kill Deborah Hussey. Hussey's mother, Marion Hussey, testified at a deposition that Flemmi called his common-law step-daughter a ""slut, a whore, a prostitute... doing drugs."" Also on the stand, retired FBI informant-coordinator Fred Davis. He testified that when he arrived at the Boston field office in the late 1970s there was a lot of ""paranoia"" in the bureau. ""They were nervous other agents in the office were leaking information."" Key among the suspected leakers was Bulger handler John Connolly who would ""show","Defense says a Bulger criminal partner was the one who wanted informant dead.\nTestimony comes after other former FBI agents testify about 1980s Boston FBI corruption.\nJames ""Whitey"" Bulger faces murder, other charges in a 32-count indictment."


## Cleanup

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01a_data-token-classification.ipynb.
Converted 01b_data-question-answering.ipynb.
Converted 01e_data-summarization.ipynb.
Converted 01z_data-language-modeling.ipynb.
Converted 02_modeling-core.ipynb.
Converted 02a_modeling-token-classification.ipynb.
Converted 02b_modeling-question-answering.ipynb.
Converted 02e_modeling-summarization.ipynb.
Converted 02z_modeling-language-modeling.ipynb.
Converted index.ipynb.
