In [None]:
# default_exp data.text_generation

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.text_generation

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for text generation tasks using architectures like BART, T5, or good ol' GPT2, etc....  Abstract summarization and conversational agents are good examples of such tasks.

In [None]:
#export
import ast
from functools import reduce

import torch
from transformers import *
from fastai2.text.all import *

from blurr.utils import *
from blurr.data.core import *

In [None]:
#hide
import pdb

from nbdev.showdoc import *
from fastcore.test import *

In [None]:
#cuda
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


## Text Generation tokenization, batch transform, and DataBlock methods

Text generation tasks attempt to generate a human-understandable and sensible response to a prior text.  For example, in summarization, our objective is to capture the meaning of a larger document in 1-3 sentences.

In [None]:
path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv'); len(cnndm_df)

1000

In [None]:
cnndm_df.head(2)

Unnamed: 0,article,highlights,ds_type
0,"(CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el...","John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in ""century of knowledge""",train
1,"(CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will ""hopefully bring some order"" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ...","NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says .",train


In [None]:
pretrained_model_name = "facebook/bart-large-cnn"

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_tokenizer), type(hf_config), type(hf_model)

Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['final_logits_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('bart',
 transformers.tokenization_bart.BartTokenizer,
 transformers.configuration_bart.BartConfig,
 transformers.modeling_bart.BartForConditionalGeneration)

We're going to create a subclass of `HF_BatchTransform` (`HF_TextGenerationBatchTransform`) for generation tasks to all include `decoder_input_ids` which is useful for summarization and conversational training tasks (see [here](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartModel.forward) for more information).  This also requires us to provide a `labels` attribute with the target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) `decoder_input_ids`.

Notice that we also update are targets to just be the `input_ids` of our target sequence so that fastai's `Learner.show_results` works (again, almost all the fastai bits require returning a tensor to function).

In [None]:
#export
class HF_TextGenerationInput(list): pass

In [None]:
#export
class HF_TextGenerationBatchTransform(HF_BatchTransform):
    def __init__(self, hf_arch, hf_tokenizer, **kwargs):
        super().__init__(hf_arch, hf_tokenizer, HF_TextGenerationInput, **kwargs)
        
    def encodes(self, samples):  
        samples = super().encodes(samples)
        if (len(samples[0]) == 1): return samples
        
        updated_samples = []
        for s in samples:
            s[0]['decoder_input_ids'] = s[1]['input_ids'][:-1].clone()
            s[0]['labels'] = s[1]['input_ids'][1:].clone()
            s[0]['labels'][s[0]['labels'] == self.hf_tokenizer.pad_token_id] = -100
            
            targ_ids = s[1]['input_ids']
            
            updated_samples.append((s[0], targ_ids))
        
        return updated_samples
    
    def decodes(self, encoded_samples):
        if (isinstance(encoded_samples, dict)): return self.hf_input_return_type([encoded_samples['input_ids']])
        return [encoded_samples]

We had to override the `decodes` method above because, while both our inputs and targets are technically the same things, we update the later to consist of *only* the target input_ids so that methods like `Learner.show_results` work.  Nevertheless, because fastai remembers what they are, `HF_TokenizerTransform.decodes` will be called for both and it works on a `list` of input_ids.

In [None]:
hf_batch_tfm = HF_TextGenerationBatchTransform(hf_arch, hf_tokenizer)

blocks = ( 
    HF_TextBlock(hf_arch, hf_tokenizer), 
    HF_TextBlock(hf_arch, hf_tokenizer, hf_batch_tfm=hf_batch_tfm, max_length=150)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())

In [None]:
# dblock.summary(cnndm_df)

In [None]:
dls = dblock.dataloaders(cnndm_df, bs=4)

In [None]:
b = dls.one_batch()

In [None]:
len(b), b[0]['input_ids'].shape, b[1].shape

(2, torch.Size([4, 512]), torch.Size([4, 150]))

In [None]:
#export
@typedispatch
def show_batch(x:HF_TextGenerationInput, y, samples, hf_tokenizer, ctxs=None, max_n=6, **kwargs):  
    res = L([ (s[0], s[1]) for s in samples ])          
    display_df(pd.DataFrame(res, columns=['text', 'target'])[:max_n])
    return ctxs

In [None]:
dls.show_batch(hf_tokenizer=hf_tokenizer, max_n=2)

Unnamed: 0,text,target
0,"(CNN) -- Will elementary and middle school students soon be able to put up their own Facebook pages? It looks like it. According to news accounts, Facebook is considering doing away with its rule that no one under age 13 may have a Facebook page. Cranky math instructors and tyrannical P.E. coaches must be at least a tad nervous at the thought of what the little darlings might post about them. But the change could be a good thing if it encourages a reasonable amount of parent involvement. The things the rest of us do on Facebook -- reveal what we had for lunch, post way too many photos of our kids and pets, comment on what Queen Elizabeth wore for her 60th anniversary jubilee -- are considered too dangerous for 10-, 11-, or 12-year-olds to do, even if their ""public"" consists only of people they've invited to be their friends. One result is that millions of pre-teens -- 7.5 million, according to a Consumer Reports account last year -- have established profiles on Facebook, some using fictitious names. Five million of them are younger than 10. What's most disturbing is that in many cases, parents have helped their kids circumvent the rules. What, pray tell, does that teach their children? The parameters of the policy under consideration at Facebook are undetermined but would work something like PG-13 movies. One idea is that the child's page would be linked through the parent's page -- a sidecar, if you will. Savvy parents would give their children as much privacy as possible even if wincing sometimes at what they saw. But they'd also have the opportunity to step in if they noticed something that was hurtful, dangerous or inappropriate. Schools of thought: My kids won't be on Facebook anytime soon. Despite what some adults might think or read, teens, at least by their own account, are pretty responsible when it comes to social networking. According to the Pew Research Center's Internet and American Life Project, more than 7 out of 10 teens say that other teens with whom they're online are kind, not mean. Of course, they've seen instances of cruelty. But let's not forget they see that in school, too; in fact, according to what they told Pew, most of the bullying they see or experience happens in person. Only 15% say they've been victims of online cruelty -- interestingly, the same proportion of adults who report being hurt by other adults. According to Pew, teens are receiving lots of counsel about Internet","Facebook is considering changing rule to allow preteens to have their own pages.\nLaura Stepp: 7.5 million preteens are already on Facebook, 5 million under 10.\nChildren's pages could be linked through their parents' pages, she says, allowing supervision.\nStepp: Letting your child have a Facebook page is a chance to teach safe online behavior."
1,"(CNN) -- It's just another coming of age story -- one we've all heard before -- but now it's about us. Just as Holden Caulfield awoke to the excitement of the adult world around him and wanted to escape the phonies, youth voters brought a novel and intense energy to the world of politics during the 2008 election in an effort to escape the phonies we'd been listening to our whole lives. Our debut into the world of politics was significant: The candidate with overwhelming youth support, Barack Obama, came out on top. I was too young to vote in that election, but after volunteering for the Obama campaign, I felt what many first-time voters and volunteers felt after the last election: proud, accomplished and significant. Four years later, what was once to us the novel and exciting adult world of politics now seems bitter and partisan. We're a little bit older, less bright-eyed and a little more cynical. It is not surprising that a generation not tempered by past disappointments, that had hoped its representatives would work in good faith to fix America's problems, might be less enthusiastic this time around. The percentage of youth voters who plan on voting fell from 78% in 2008 to just 58% this summer. We're the least likely of any age group to vote in November. Opinion: What Democrats need to do in Charlotte. But what a mistake it would be for us to throw in the towel now. Just because our politics and government can disappoint us sometimes doesn't mean we should forget how far we've come. President Obama understands what our generation contributed in 2008. He knows where we stand on issues and he agrees with us -- he's been our biggest ally in Washington since the start of his presidency. The president's signature legislative achievement, the Affordable Care Act, allows us to stay on our parents' health plan until we are 26. That means we'll have health insurance when we graduate from college, which more and more of us will be able to do thanks to the president's push to double funding for Pell Grants and his insistence on keeping interest rates low for the 7.4 million students taking out student loans. Because of Obama's repeal of ""don't ask, don't tell,"" anyone can join the military, regardless of sexual orientation, an issue important to our generation. When Congress refused to pass the DREAM Act, Obama changed policy administratively, enabling immigrants who came to the country as children to avoid deportation. Opinion: Can Obama convince voters to turn to him again? Our president showed","Jack Schlossberg: Young voters invested energy and hope in the 2008 Obama campaign.\nHe says some aren't as enthusiastic after realizing political struggle is difficult.\nSchlossberg: President Obama has delivered on health care, student loans, climate issues.\nHe says part of growing up is realizing that change doesn't come without sustained effort."


## Cleanup

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01a_data-language-modeling.ipynb.
Converted 01c_data-question-answering.ipynb.
Converted 01d_data-token-classification.ipynb.
Converted 01e_data-text-generation.ipynb.
Converted 02_modeling-core.ipynb.
Converted 02_training-summarization.ipynb.
Converted 02a_modeling-language-modeling.ipynb.
Converted 02c_modeling-question-answering.ipynb.
Converted 02d_modeling-token-classification.ipynb.
Converted 02e_modeling-text-generation.ipynb.
Converted index.ipynb.
