In [None]:
# default_exp examples.blurr_high_level_api

In [None]:
#all_slow

In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Using the high-level Blurr API

> Show all of the high-level `BlurrFor<Task>` classes in action here with the raw data sourced from the [Hugging Face Datasets library](https://huggingface.co/docs/datasets/index.html).

In [None]:
#export
import os

from datasets import load_dataset, concatenate_datasets
from transformers import *
from fastai.text.all import *

from blurr.utils import *
from blurr.data.core import *
from blurr.modeling.core import *

from blurr.data.language_modeling import BertMLMStrategy, CausalLMStrategy
from blurr.modeling.language_modeling import *

from blurr.modeling.token_classification import *
from blurr.modeling.question_answering import *
from blurr.modeling.seq2seq.summarization import *
from blurr.modeling.seq2seq.translation import *

logging.set_verbosity_error()

In [None]:
#hide_input
import pdb

from fastcore.test import *
from nbverbose.showdoc import show_doc

os.environ["TOKENIZERS_PARALLELISM"] = "false"
print("Here's what we're running with ...\n")
print_versions('torch fastai transformers')

torch: 1.7.1
fastai: 2.5.0
transformers: 4.9.2


In [None]:
#cuda
#hide_input
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')

Using GPU #1: GeForce GTX 1080 Ti


While most of the code and examples in the documentation show how to work with Blurr given a pandas Dataframe, these set of examples will show you how to use the high-level Blurr API with any Hugging Face dataset. The high-level API provides one liners to build your DataBlock, DataLoaders, and Learner (with sensible defaults) from a DataFrame, CSV file, or a list of dictionaries as we do so here.

## Sequence Classification

### Multiclassification (one input)

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('glue', 'cola') 
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

{'idx': 0, 'label': 1, 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

{'sentence': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}



Capture the indexes for both train and validation sets, use the datasets `concatenate_datasets` to put them into a single dataset, and finally use the `IndexSplitter` method to define our train/validation splits as such:

In [None]:
train_ds = raw_datasets['train']#.select(range(10000))
valid_ds = raw_datasets['validation']#.select(range(2000))

In [None]:
n_train, n_valid = train_ds.num_rows, valid_ds.num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

In [None]:
dl_kwargs = {'bs': 4, 'val_bs': 8}
learn_kwargs = { 'metrics': [accuracy] }

learn = BlearnerForSequenceClassification.from_dictionaries(raw_ds, 'distilroberta-base', 
                                                            text_attr='sentence', label_attr='label',
                                                            dblock_splitter=IndexSplitter(valid_idxs),
                                                            dl_kwargs=dl_kwargs, learner_kwargs=learn_kwargs)
learn = learn.to_fp16()

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, trunc_at=500, max_n=5)

Unnamed: 0,text,target
0,"Everybody who has ever, worked in any office which contained any typewriter which had ever been used to type any letters which had to be signed by any administrator who ever worked in any department like mine will know what I mean.",1
1,"The more contented the nurses began to believe that we were going to pretend to be, the more angry we grew at the doctors.",0
2,The girl who the by works in a skyscraper and in a quonset but has a dimple on her nose.,0
3,"Girl in the red coat, this girl in the red coat will put a picture of Bill on your desk before tomorrow.",0


In [None]:
learn.fit_one_cycle(1, lr_max=2e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.497798,0.511438,0.760307,01:04


In [None]:
learn.show_results(learner=learn, max_n=5)

Unnamed: 0,text,target,prediction
0,"Scientists at the South Hanoi Institute of Technology have succeeded in raising one dog with five legs, another with a cow's liver, and a third with no head.",1,1
1,"The newspaper has reported that they are about to appoint someone, but I can't remember who the newspaper has reported that they are about to appoint.",1,1
2,"Sandy is very anxious to see if the students will be able to solve the homework problem in a particular way, but she won't tell us which.",1,1
3,"Most columnists claim that a senior White House official has been briefing them, and the newspaper today reveals which one.",1,1
4,"Sandy was trying to work out which students would be able to solve a certain problem, but she wouldn't tell us which one.",0,1


`Learner.blurr_predict` works here too

In [None]:
learn.blurr_predict('Blurr aint no joke yo')

[(('1',), (#1) [tensor(1)], (#1) [tensor([0.4573, 0.5427])])]

### Multiclassification (two inputs)

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('glue', 'mrpc') 
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

{'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}



In [None]:
train_ds = raw_datasets['train']#.select(range(10000))
valid_ds = raw_datasets['validation']#.select(range(2000))

In [None]:
n_train, n_valid = train_ds.num_rows, valid_ds.num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

In [None]:
dl_kwargs = {'bs': 4, 'val_bs': 8}
learn_kwargs = { 'metrics': [F1Score(), accuracy] }

learn = BlearnerForSequenceClassification.from_dictionaries(raw_ds, 'distilroberta-base', 
                                                            text_attr=['sentence1', 'sentence2'], 
                                                            label_attr='label',
                                                            dblock_splitter=IndexSplitter(valid_idxs),
                                                            dl_kwargs=dl_kwargs, learner_kwargs=learn_kwargs)
learn = learn.to_fp16()

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, trunc_at=500, max_n=5)

Unnamed: 0,text,target
0,""" In Iraq, "" Sen. Pat Roberts, R-Kan., chairman of the intelligence committee, said on CNN's "" Late Edition "" Sunday, "" we're now fighting an anti-guerrilla... effort. "" "" In Iraq, "" Sen. Pat Roberts ( R-Kan. ), chairman of the intelligence committee, said on CNN's "" Late Edition "" yesterday, "" we're now fighting an anti-guerrilla... effort. """,1
1,"With Claritin's decline, Schering-Plough's best-selling products now are two drugs used together to treat hepatitis C, the antiviral pill ribavirin and an interferon medicine called Peg-Intron. With Claritin's decline, Schering-Plough's best-selling products are now antiviral drug ribavirin and an interferon medicine called Peg-Intron -- two drugs used together to treat hepatitis C.",1
2,"The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent, down from 4.35 percent late Wednesday. At 11 a.m. ( 1500 GMT ), the 10-year note US10YT = RR was up 11 / 32 for a yield of 3.36 percent from 3.40 percent Wednesday.",0
3,"Under the NBC proposal, Vivendi would merge its U.S. film and TV business with NBC's broadcast network, Spanish-language network and cable channels including CNBC and Bravo. Under a deal with General Electric's NBC, Vivendi's film and TV business would merge with NBC's broadcast network, Spanish- language network and cable channels including CNBC and Bravo.",1


In [None]:
learn.fit_one_cycle(1, lr_max=2e-3)

epoch,train_loss,valid_loss,f1_score,accuracy,time
0,0.536673,0.467462,0.852989,0.776961,00:29


In [None]:
learn.show_results(learner=learn, max_n=5)

Unnamed: 0,text,target,prediction
0,"He said the foodservice pie business doesn 't fit the company's long-term growth strategy. "" The foodservice pie business does not fit our long-term growth strategy.",1,1
1,"The worm attacks Windows computers via a hole in the operating system, an issue Microsoft on July 16 had warned about. The worm attacks Windows computers via a hole in the operating system, which Microsoft warned of 16 July.",1,1
2,The U.S. Supreme Court will hear arguments on Wednesday on whether companies can be sued under the Americans with Disabilities Act for refusing to rehire rehabilitated drug users. The high court will hear arguments today on whether companies can be sued under the ADA for refusing to rehire rehabilitated drug users.,1,1
3,"Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war. His wife said he was "" 100 percent behind George Bush "" and looked forward to using his years of training in the war.",0,0
4,"The announcement, which economists said was not a surprise, may be bittersweet for the millions of Americans without jobs. Economists said the announcement was not a surprise, and politicians said it offered little comfort to the millions of Americans without jobs.",1,1


### Multilabel classification

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('civil_comments')
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Using custom data configuration default
Reusing dataset civil_comments (/home/wgilliam/.cache/huggingface/datasets/civil_comments/default/0.9.0/e7a3aacd2ab7d135fa958e7209d10b1fa03807d44c486e3c34897aa08ea8ffab)


DatasetDict({
    train: Dataset({
        features: ['text', 'toxicity', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit'],
        num_rows: 1804874
    })
    validation: Dataset({
        features: ['text', 'toxicity', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit'],
        num_rows: 97320
    })
    test: Dataset({
        features: ['text', 'toxicity', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack', 'sexual_explicit'],
        num_rows: 97320
    })
})

{'identity_attack': 0.0, 'insult': 0.0, 'obscene': 0.0, 'severe_toxicity': 0.0, 'sexual_explicit': 0.0, 'text': "This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!", 'threat': 0.0, 'toxicity': 0.0}

{'text': Value(dtype='string', id=None), 'toxicity': Value(dtype='float32', id=None), 'severe_toxicity': Value(dtype='float32', id=None), 'obscene': Value(dtype='float32', id=None), 'thr

In [None]:
lbl_cols =  ['identity_attack', 'insult', 'obscene', 'toxicity', 'severe_toxicity', 'sexual_explicit', 'threat']

In [None]:
train_ds = raw_datasets['train'].select(range(10000))
valid_ds = raw_datasets['validation'].select(range(2000))

In [None]:
n_train, n_valid = len(train_ds), len(valid_ds)
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

The labels need to be OHE as ints (the raw data has them as floats). We could also do this kind of preprocessing by passing in a `preprocess_func` to our `BlearnerForSequenceClassification` factory method, especially useful if such preprocessing depends on one or more of the Hugging Face objects (e.g., config, tokenizer, model, architecture)

In [None]:
def make_ohe(item):
    for k in item.keys():
        if (k in lbl_cols):
            item[k] = int(np.round(item[k]))
    return item

raw_ds = raw_ds.map(make_ohe)

Loading cached processed dataset at /home/wgilliam/.cache/huggingface/datasets/civil_comments/default/0.9.0/e7a3aacd2ab7d135fa958e7209d10b1fa03807d44c486e3c34897aa08ea8ffab/cache-615cf128814440d8.arrow


In [None]:
dl_kwargs = {'bs': 4, 'val_bs': 8}
learn_kwargs = { 'metrics': [F1ScoreMulti(), accuracy_multi] }

# using a List[dict] such as a Hugging Face dataset
learn = BlearnerForSequenceClassification.from_dictionaries(raw_ds, 'distilroberta-base', 
                                                            text_attr='text', label_attr=lbl_cols,
                                                            dblock_splitter=IndexSplitter(valid_idxs),
                                                            dl_kwargs=dl_kwargs, learner_kwargs=learn_kwargs)
learn = learn.to_fp16()

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, trunc_at=500, max_n=5)

Unnamed: 0,text,target
0,"I have had a question about Einstein's Special Theory of Relativity for some time which scientists all seem to run away from. Until 1887 the equations used for Relativity were the Galilean transformation equations.\n\n x'=x-vt\n y'=y\n z'=z\n t'=t\n\nAfter 1887, scientists threw away the Galilean transfo",[]
1,"I've went to Pizza Jerk three times and to Red Sauce twice. Pizza Jerk's menu is larger and more complicated, and they changed several menu items after the first two, which is why it got that third visit.\n\nThe photos are assigned after all of my visits (so that we don't tip off owners to a coming review) and also after the first draft of my review is written. I consult with the art director about what to take photos of. In this case, yes, I did say to take more shots of Pizza Jerk.\n\nA few days",[]
2,"The problem is that tens of millions of adult Americans work for less than a living wage. I work in a production environment. My job doesn't require a degree, but we have to do math, and we have to be focused. There are no teenagers at my workplace. Most of us are middle-aged, experienced people with a good work ethic. When someone comes in who's not so bright, or has bad work habits, they get cycled out very quickly. Yet most of us earn less than $12 per hour. Our economy has devolved t",[]
3,"Craig, light rail DOES NOT RELIEVE CONGESTION - NEVER HAS< NEVER WILL. It carries less than one lane of traffic worth of people at many times the cost. It is a TOY TRAIN, not real transportation. Why don't you explain why people will walk 1/4 mile to a bus stop, wait 5 min, then transfer to a train, wait another 5 minutes in a bad neighborhood to transfer to the second bus that drops you off 1/4 mile from your destination. \nTraveling to work by transit takes, on average, twice the time of drivi",[]


In [None]:
learn.fit_one_cycle(1, lr_max=2e-3)

epoch,train_loss,valid_loss,f1_score,accuracy_multi,time
0,0.033941,0.041363,0.130769,0.987642,01:25


  _warn_prf(


In [None]:
learn.show_results(learner=learn, trun_at=500, max_n=5)

  _warn_prf(


Unnamed: 0,text,target,prediction
0,"Everyone tries to hack everyone else. I have no doubt Russia would try to hack even canada. However, the US has been doing the same, if we recall Snowden.\n\nEven Merkel's phone conversations were being tapped by the CIA. \n\nThe real purpose of this issue is political. Trump is upset because people are trying to imply that he didn't deserve his victory, that the Russians helped him. It's an ego thing. Good CEOs sometimes have giant egos. I have no problem with that as long as they produce results, I gladly buy shares in their company.\n\nOtoh, Russia did invade Crimea recently, and their missile brought down a commercial airliner and killed lots of innocent people. The world has a right to be annoyed at the Russians.\n\nIf you want to find evidence of Russians hacking, you will find them. But if you want to find China or some guy in a basement somewhere, I have no doubt you can find the same as well. Whether they succeeded or not, that's hard to prove, but there's lots of blackhats",[],[]
1,"Thank you for this article.\nErrata: My father, Dr. James Ford Lewis, was minister of the First Unitarian Church in Portland in '58-'60; he wasn't Portland's ""first Unitarian minister."" The church was founded here in 1867 (http://bit.ly/29ClQCp). \nI met Mr. Ellison at Sacramento City College, not at my own university.\nAt PCS, for ""Astoria"" I am coaching the Shoshoni language, not ""Shoshona,"" which does not exist, and there is no such thing as ""Scotch-Canadian patois."" The Scots-Canadian accent is what I spoke of. The Iowa, Arikara, Hawaiian languages will also be heard, along with another 10+ accents of English. \nMy work as OnStar's voice began with years of work at General Magic, where I recorded tens of thousands of prompts (rather than ""a succession of responses"") for a system, not doomed but premature, that supported 2.5 million users at its peak. (http://bit.ly/29Q7P7a). And I am the longest-working pro voice in *speech recognition,* not in general. That would be some bold claim!",[],[]
2,"Mr. Alali, I am sympathetic to your position and feelings. As a Canadian I hold no ill will towards you or your family relocating to Canada. You should be aware that you and your family have been used as political pawns following the glib and ill conceived election promise made by our Prime Minister to bring 25 thousand of your compatriots to Canada by the end of 2015. It sounds as if Mr McCallum and assistants scoured UN refugee lists in an effort to press gang hapless individuals and coerce them into settling and being shipped to Canada. The expedition of your arrival was made with no regard to the logistics of accommodating vulnerable and traumatized families in a respectful and decent manner following your staged and publicized arrivals. As for your future in Canada I fear you will be lucky to find some subsistence level employment. The chances are that your children, if you allow them to assimilate into the Canadian culture, will thrive and have a rewarding life in here. Good Luck",[],[]
3,"Chittester quotes Biden--an excommunicated Catholic.\nHilliary did not represent Catholics-- which is why Trump won the Catholic vote.\nThese posters represent PC--progressive communists and propaganda suckers. They seem to soak up every piece of propaganda issued by progressive operatives.\nI' m sure they all "" believe"" in AGW and population control as the remedy through abortion, they "" believe"" Adam and Eve are mythological but evolution-- long ago debunked-- is fact, they\n "" belive"" women' s rights and feminism are not really code words for abortion and they \n"" believe"" poor Hilliary missed her chance to make partial birth abortions taxpayer funded, religious rights non- existent, create more world chaos, incite more racial riots and divisive conflict, shut down oil companies and erect windmills, gut the country of any remaining blue collar jobs, ensure no conservative speaks outbin any college classroom, make sure all young college men have no rights when accused by disturbed girls.",[],[]
4,"Regarding ""Lonely Woman"" Yes, involve herself in organizations but perhaps not a church. I enjoy my church and it is important but... Many have gossips. When living in Denver my best friend Linda and I attended together for years and I was a maid of honer in her wedding. When they moved, I attended alone and some wives began speculating I was looking for a husband. Some of the single guys heard variations of that speculation. During service a guy slid across to me and began flirting. I whispered loud enough for all around me to hear, ""Who are you! Get lost!"" The minister heard. He took a second to force down a laugh and nodded at me. The guy slithered away. The gossips saw and were nice to me afterward but that was my last day. I still saw friends who assured me it was only a few bad apples. Point being to ""Lonely Woman"" Some churches have gossip mongrels who ruin it for young, decent women. Here's an oldie but goodie for you gossips. ""Do not bare false witness against your neighbor.""",[],[]


## Token Classification

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('germeval_14') 
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset germ_eval14 (/home/wgilliam/.cache/huggingface/datasets/germ_eval14/germeval_14/2.0.0/0f174b84866aa3b8ebae65c271610520be4422405d7e8467bd24cfd493d325f0)


DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'tokens', 'ner_tags', 'nested_ner_tags'],
        num_rows: 24000
    })
    validation: Dataset({
        features: ['id', 'source', 'tokens', 'ner_tags', 'nested_ner_tags'],
        num_rows: 2200
    })
    test: Dataset({
        features: ['id', 'source', 'tokens', 'ner_tags', 'nested_ner_tags'],
        num_rows: 5100
    })
})

{'id': '0', 'ner_tags': [19, 0, 0, 0, 7, 0, 0, 0, 0, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'nested_ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'source': 'n-tv.de vom 26.02.2005 [2005-02-26] ', 'tokens': ['Schartau', 'sagte', 'dem', '"', 'Tagesspiegel', '"', 'vom', 'Freitag', ',', 'Fischer', 'sei', '"', 'in', 'einer', 'Weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']}

{'id': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None), 'tokens': Sequence(feature=Value(dtype='s

In [None]:
train_ds = raw_datasets['train']#.select(range(1000))
valid_ds = raw_datasets['validation']#.select(range(500))

In [None]:
n_train, n_valid = train_ds.num_rows, valid_ds.num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

We can grab the "labels" a token can be associated with as we do here or we can let the `BlearnerForTokenClassification` factory methods figure it out for us.

In [None]:
labels = train_ds.features['ner_tags'].feature.names
len(labels)

25

As we need pass the tag (not the index) for each example's tokens in a list, we use the handy `datasets.map` function to create a new attribute, "token_labels", with that data.  This could also be done by passing in a `preprocess_func`  to a `BlearnerForTokenClassification` factory method; especially useful if we need to use one or more of the Hugging Face objects (e.g., tokenzier, model, config, or architecture name)

In [None]:
def get_item_labels(example):
    example['token_labels'] = [ labels[tag_idx] for tag_idx in example['ner_tags'] ]
    return example
                         
raw_ds = raw_ds.map(get_item_labels)

  0%|          | 0/26200 [00:00<?, ?ex/s]

In [None]:
learn = BlearnerForTokenClassification.from_dictionaries(raw_ds, 'bert-base-multilingual-cased', 
                                                         tokens_attr='tokens', token_labels_attr='token_labels', 
                                                         labels=labels, dblock_splitter=IndexSplitter(valid_idxs), 
                                                         dl_kwargs={'bs':2})

learn.unfreeze()
fit_cbs = [HF_TokenClassMetricsCallback()]

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, max_n=2)

Unnamed: 0,token / target label
0,"[('Andere', 'O'), ('Albumtitel', 'O'), ('sind', 'O'), ('an', 'O'), ('bekannte', 'O'), ('Begriffe', 'O'), ('angelehnt', 'O'), (':', 'O'), ('Fettes', 'B-OTH'), ('Brot', 'I-OTH'), ('für', 'I-OTH'), ('die', 'I-OTH'), ('Welt', 'I-OTH'), ('(', 'O'), ('Brot', 'O'), ('für', 'O'), ('die', 'O'), ('Welt', 'O'), ('),', 'O'), ('Auf', 'O'), ('einem', 'B-OTH'), ('Auge', 'I-OTH'), ('blöd', 'I-OTH'), ('(', 'I-OTH'), ('„', 'O'), ('Auf', 'O'), ('einem', 'O'), ('Auge', 'O'), ('blind', 'O'), ('),', 'O'), ('Am', 'O'), ('Wasser', 'O'), ('gebaut', 'O'), ('(', 'B-OTH'), ('„', 'I-OTH'), ('Nah', 'I-OTH'), ('am', 'O'), ('Wasser', 'O'), ('gebaut', 'O'), (')', 'O'), ('und', 'O'), ('Strom', 'O'), ('und', 'O'), ('Drang', 'O'), ('(', 'O'), ('„', 'B-OTH'), ('Sturm', 'I-OTH'), ('und', 'I-OTH'), ('Drang', 'O'), (').', 'O')]"
1,"[('Das', 'O'), ('Spiel', 'O'), ('Während', 'O'), ('der', 'O'), ('drei', 'O'), ('Jahrtausende,', 'O'), ('in', 'O'), ('denen', 'O'), ('das', 'O'), ('Spiel', 'O'), ('gespielt', 'O'), ('wurde,', 'O'), ('änderten', 'O'), ('sich', 'O'), ('die', 'O'), ('Regeln', 'O'), ('immer', 'O'), ('wieder', 'O'), (':', 'O'), ('Die', 'O'), ('Zahl', 'O'), ('der', 'O'), ('teilnehmenden', 'O'), ('Spieler,', 'O'), ('die', 'O'), ('Körperstellen', 'O'), ('mit', 'O'), ('denen', 'O'), ('der', 'O'), ('Ball', 'O'), ('berührt', 'O'), ('werden', 'O'), ('durfte,', 'O'), ('die', 'O'), ('Folgen', 'O'), ('eines', 'O'), ('verlorenen', 'O'), ('Spieles.', 'O')]"


In [None]:
learn.fit_one_cycle(1, lr_max= 3e-5, moms=(0.8,0.7,0.8), cbs=fit_cbs)

epoch,train_loss,valid_loss,accuracy,precision,recall,f1,time
0,0.07718,0.062497,0.979905,0.855273,0.826228,0.8405,16:19


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
learn.show_results(learner=learn, max_n=2, trunc_at=10)

Unnamed: 0,token / target label / predicted label
0,"[('Darüber', 'O', 'O'), ('hinaus', 'O', 'O'), ('produziert', 'O', 'O'), ('der', 'O', 'O'), ('hr', 'B-ORG', 'O'), ('allein', 'O', 'B-ORG'), ('oder', 'O', 'O'), ('federführend', 'O', 'O'), ('mit', 'O', 'O'), ('anderen', 'O', 'O')]"
1,"[('Dass', 'O', 'O'), ('eine', 'O', 'O'), ('interne', 'O', 'O'), ('Lösung,', 'O', 'O'), ('für', 'O', 'O'), ('die', 'O', 'O'), ('sich', 'O', 'O'), ('Benjamin', 'O', 'O'), ('Schwarz', 'B-PER', 'O'), ('und', 'I-PER', 'B-PER')]"


In [None]:
print(learn.token_classification_report)

              precision    recall  f1-score   support

         LOC       0.91      0.87      0.89       798
    LOCderiv       0.90      0.83      0.86       256
     LOCpart       0.62      0.73      0.67        44
         ORG       0.81      0.73      0.77       549
    ORGderiv       0.00      0.00      0.00         0
     ORGpart       0.79      0.75      0.77        96
         OTH       0.68      0.69      0.69       264
    OTHderiv       0.69      0.73      0.71        15
     OTHpart       0.50      0.64      0.56        14
         PER       0.94      0.92      0.93       725
    PERderiv       0.00      0.00      0.00         0
     PERpart       0.11      0.29      0.16         7

   micro avg       0.86      0.83      0.84      2768
   macro avg       0.58      0.60      0.58      2768
weighted avg       0.86      0.83      0.84      2768



`Learner.blurr_predict_tokens` works here too

In [None]:
txt ="I live in California, but I'd love to travel to Scotland and visit the Macallan distillery."
txt2 = "Jane Doe loves working for ohmeow.com."

In [None]:
res = learn.blurr_predict_tokens([txt.split(), txt2.split()])
for r in res: print(f'{[(tok, lbl) for tok,lbl in zip(r[0],r[1]) ]}\n')

[('I', 'O'), ('live', 'O'), ('in', 'O'), ('California,', 'B-LOC'), ('but', 'O'), ("I'd", 'O'), ('love', 'O'), ('to', 'O'), ('travel', 'O'), ('to', 'O'), ('Scotland', 'B-LOC'), ('and', 'O'), ('visit', 'O'), ('the', 'O'), ('Macallan', 'B-ORG'), ('distillery.', 'O')]

[('Jane', 'B-PER'), ('Doe', 'I-PER'), ('loves', 'O'), ('working', 'O'), ('for', 'O'), ('ohmeow.com.', 'B-OTH')]



## Question Answering

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('squad_v2')
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset squad_v2 (/home/wgilliam/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/ba48bc29b974701e9ba8d80ac94f3e3df924aba41b764dcf9851debea7c672e4)


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

{'answers': {'answer_start': [269], 'text': ['in the late 1990s']}, 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-

In [None]:
train_ds = raw_datasets['train'].select(range(1000))

We use the `preprocess_func` here as the preprocessing is dependent upon the Hugging Face tokenizer which will vary dependending on the pretrained model we use for the task.

In [None]:
def preprocess_ds(ds, hf_arch, hf_config, hf_tokenizer, hf_model, max_seq_len, 
                  context_attr, question_attr, answer_text_attr, tok_ans_start, tok_ans_end):
    
    def _preprocess(item):
        tok_kwargs = {}
        if(hf_tokenizer.padding_side == 'right'):
            tok_input = hf_tokenizer.convert_ids_to_tokens(hf_tokenizer.encode(item[question_attr], item[context_attr]), 
                                                           **tok_kwargs)
        else:
            tok_input = hf_tokenizer.convert_ids_to_tokens(hf_tokenizer.encode(item[context_attr], item[question_attr]), 
                                                           **tok_kwargs)

        tok_ans = hf_tokenizer.tokenize(str(item['answers']['text'][0]), **tok_kwargs)
        
        start_idx, end_idx = 0,0
        
        if(len(tok_input) < max_seq_len):
            for idx, tok in enumerate(tok_input):
                try:
                    if (tok == tok_ans[0] and tok_input[idx:idx + len(tok_ans)] == tok_ans): 
                        start_idx, end_idx = idx, idx + len(tok_ans)
                        break
                except: pass

        item['tokenized_input'] = tok_input
        item['tokenized_input_len'] = len(tok_input)
        item['tok_answer_start'] = start_idx
        item['tok_answer_end'] = end_idx

        return item
    
    ds = ds.map(_preprocess)
    return ds

In [None]:
pretrained_model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'

learn = BlearnerForQuestionAnswering.from_dataframe(train_ds, pretrained_model_name,
                                                    preprocess_func=preprocess_ds, max_seq_len=256,
                                                    dblock_splitter=RandomSplitter(), dl_kwargs={ 'bs': 4 })
learn = learn.to_fp16()

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, max_n=2, trunc_at=500)

Unnamed: 0,text,start/end,answer
0,"which prominent star felt the 2009 female video of the year award should have went to beyonce instead of taylor swift? on april 4, 2008, beyonce married jay z. she publicly revealed their marriage in a video montage at the listening party for her third studio album, i am... sasha fierce, in manhattan's sony club on october 22, 2008. i am... sasha fierce was released on november 18, 2008 in the united states. the album formally introduces beyonce's alter ego sasha fierce, conceived during the mak","(0, 0)",
1,"her third album, "" i am... sasha fierce "" was released when? on april 4, 2008, beyonce married jay z. she publicly revealed their marriage in a video montage at the listening party for her third studio album, i am... sasha fierce, in manhattan's sony club on october 22, 2008. i am... sasha fierce was released on november 18, 2008 in the united states. the album formally introduces beyonce's alter ego sasha fierce, conceived during the making of her 2003 single "" crazy in love "", selling 482, 000","(0, 0)",


In [None]:
learn.fit_one_cycle(1, lr_max=1e-3)

epoch,train_loss,valid_loss,time
0,2.205399,1.983614,00:46


In [None]:
learn.show_results(learner=learn, skip_special_tokens=True, max_n=2, trunc_at=500)

Unnamed: 0,text,start/end,answer,pred start/end,pred answer
0,"who released the single girls love beyonce? her debut single, "" crazy in love "" was named vh1's "" greatest song of the 2000s "", nme's "" best track of the 00s "" and "" pop song of the century "", considered by rolling stone to be one of the 500 greatest songs of all time, earned two grammy awards and is one of the best - selling singles of all time at around 8 million copies. the music video for "" single ladies ( put a ring on it ) "", which achieved fame for its intricate choreography and its deplo","(0, 0)",,"(171, 172)",drake
1,"who said that she is the reigning national voice? in the new yorker music critic jody rosen described beyonce as "" the most important and compelling popular musician of the twenty - first century..... the result, the logical end point, of a century - plus of pop. "" when the guardian named her artist of the decade, llewyn - smith wrote, "" why beyonce? [... ] because she made not one but two of the decade's greatest singles, with crazy in love and single ladies ( put a ring on it ), not to mention","(0, 0)",,"(0, 0)",


## Language modeling

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset wikitext (/home/wgilliam/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)


DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

{'text': ''}

{'text': Value(dtype='string', id=None)}



In [None]:
train_ds = raw_datasets['train'].select(range(1000))
valid_ds = raw_datasets['validation'].select(range(1000))

In [None]:
n_train, n_valid = train_ds.num_rows, valid_ds.num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

In [None]:
def remove_empty_text(example):
    if (example['text'].strip() == ''): example['text'] = '  '
    return example
raw_ds = raw_ds.map(remove_empty_text)

  0%|          | 0/2000 [00:00<?, ?ex/s]

Causal language modeling

In [None]:
learn = BlearnerForLM.from_dictionaries(raw_ds, 'gpt2', text_attr='text', 
                                        lm_strategy_cls=CausalLMStrategy,
                                        dblock_splitter=IndexSplitter(valid_idxs), 
                                        dl_kwargs={'bs':2}).to_fp16()

  warn(f"You are shadowing an attribute ({name}) that exists in the learner. Use `self.learn.{name}` to avoid this")
Using pad_token, but it is not set yet.


In [None]:
learn.dls.show_batch(dataloaders=learn.dls, max_n=2, trunc_at=250)

Unnamed: 0,text,target
0,"A lookout aboard Weehawken spotted Atlanta at 04 : 10 on the morning of 17 June. When the latter ship closed to within about 1 @.@ 5 miles ( 2 @.@ 4 km ) of the two Union ships, she fired one round from her bow gun that passed over Weehawken and lan","lookout aboard Weehawken spotted Atlanta at 04 : 10 on the morning of 17 June. When the latter ship closed to within about 1 @.@ 5 miles ( 2 @.@ 4 km ) of the two Union ships, she fired one round from her bow gun that passed over Weehawken and lande"
1,"N @-@ 88 starts at the Nebraska – Wyoming state line in Banner County, where WYO 151 ends, and travels northeast. The road quickly bends east after less than one mile ( 1 @.@ 6 km ), and continues in a straight line. For the next twenty miles ( 32 k","@-@ 88 starts at the Nebraska – Wyoming state line in Banner County, where WYO 151 ends, and travels northeast. The road quickly bends east after less than one mile ( 1 @.@ 6 km ), and continues in a straight line. For the next twenty miles ( 32 km"


In [None]:
learn.fit_one_cycle(1, lr_max=3e-4, cbs=[BlearnerForLM.get_metrics_cb()])

epoch,train_loss,valid_loss,perplexity,lm_accuracy,time
0,3.796121,4.396461,81.163162,0.276044,00:40


In [None]:
learn.show_results(learner=learn, max_n=2, trunc_at=500)

Unnamed: 0,text,target,prediction
0,"Meridian is rightly considered an architectural treasure trove being one the nations most intact cities from the turn of the last century. Architecture students from around the nation and Canada are known to visit Meridian in groups as part of their coursework due to numerous structures in the city having been designed by noted architects. The only home in the US south designed by noted Canadian born Architect Louis S. Curtiss, famous for inventing the glass curtain wall skyscraper, is extant o","is rightly considered an architectural treasure trove being one the nations most intact cities from the turn of the last century. Architecture students from around the nation and Canada are known to visit Meridian in groups as part of their coursework due to numerous structures in the city having been designed by noted architects. The only home in the US south designed by noted Canadian born Architect Louis S. Curtiss, famous for inventing the glass curtain wall skyscraper, is extant on Highlan","\n a the.. of most and in the time of the century century.\n is are the the world are around are rightly for be the. their of well of the academic.. to the historical and the city. been built by the architect and\n city of the city of of by renowned architects architect and anda.Siss. and for hising the first and,,raper, famous well in the Park, The onlyfort, by Mile and, a considered an of the most buildingsistico buildingsrap in the world. is generally considered to the'ss Three Threeman. The onl"
1,"Headlam became Officer Commanding North @-@ Western Area in January 1946. Posted to Britain at the end of the year, he attended the Royal Air Force Staff College, Andover, and served with RAAF Overseas Headquarters, London. On his return to Australia, in November 1947, he became Director of Training at RAAF Headquarters. In November 1950, Headlam was appointed to take over command of No. 90 ( Composite ) Wing from Group Captain Paddy Heffernan. Headquartered at RAF Changi, Singapore, No. 90 Win","lam became Officer Commanding North @-@ Western Area in January 1946. Posted to Britain at the end of the year, he attended the Royal Air Force Staff College, Andover, and served with RAAF Overseas Headquarters, London. On his return to Australia, in November 1947, he became Director of Training at RAAF Headquarters. In November 1950, Headlam was appointed to take over command of No. 90 ( Composite ) Wing from Group Captain Paddy Heffernan. Headquartered at RAF Changi, Singapore, No. 90 Wing con","of, ofing Officer AmericaThe\n-\n of the of.\n by the by the end of the war. and was the Royal Navy Force Academy, in andres, and the as the.,as Air, and, Posted the return to the, he 1946, he was Officer of the, theAAF,, On 1947, he of became appointed Director the charge the of the. 1 SquadronAir ) of, the 1,. (berternan, Inlam in R Headquartersi, China. and. 90 (, byAAF,, in the waray War.\n. 90, Composite ),, No theid,.,ns, flying flying ( Bomber ) Squadron, flying Av A--,, (aras,, aircraftinco"


`Learner.blurr_generate` works here too

In [None]:
learn.blurr_generate('Blurr is fun to work with because', max_length=50, do_sample=True, top_k=25)

[' Blurr is fun to work with because , you are the best\n \xa0and you   you are a good friend  \xa0we \xa0have  \xa0we \xa0have \xa0we ']

Masked language modeling

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
learn = BlearnerForLM.from_dictionaries(raw_ds, 'bert-base-uncased', text_attr='text', 
                                        lm_strategy_cls=BertMLMStrategy,
                                        dblock_splitter=IndexSplitter(valid_idxs), 
                                        dl_kwargs={'bs':2}).to_fp16()

In [None]:
learn.fit_one_cycle(1, lr_max=3e-4, cbs=[BlearnerForLM.get_metrics_cb()])

epoch,train_loss,valid_loss,perplexity,lm_accuracy,time
0,0.869442,0.880874,2.413007,0.662066,00:39


In [None]:
learn.show_results(learner=learn, max_n=2, trunc_at=500)

Unnamed: 0,text,target,prediction
0,"meridian is right ##ly [MASK] an architectural treasure tr ##ove being [MASK] the nations most intact [MASK] from the turn [MASK] the last century . architecture students [MASK] around the nation and canada are known to visit [MASK] in [groups] as part [MASK] their course [MASK] due to numerous structures in the city having been designed by noted architects . the [MASK] home in the us south designed by [MASK] canadian born architect louis s . [MASK] , famous [MASK] in ##venting the glass [infrared] wall skyscraper , is extant on highland park [MASK] the frank fort designed three ##foot [MASK] is generally considered one of the best [MASK] [MASK] skyscraper ##s in the us and is often [MASK] to detroit ' s famed fisher building . noted california architect wallace ne [judgments] designed a number of homes in meridian as well as in the alabama black [MASK] which ad ##jo ##ins the city across the nearby alabama state line . he had relatives in meridian and selma who [switch] executives in the [MASK] thriving railroad industry and [MASK] take [MASK] in the area when [MASK] in california were lean . his work is mostly concentrated in the lower numbered blocks of pop ##lar springs [MASK] where his [MASK] ##6 pop ##lar [MASK] drive is [often] compared to the similarly designed falcon lair , the [MASK] hills home in benedict canyon of [MASK] valentin [MASK] . one ne ##ff work was [MASK] to an expansion of anderson [MASK] in 1990 and another in marion park [burned] in the [MASK] . the meridian post office with its interior done entirely of bronze and [MASK] marble [MASK] [MASK] noteworthy as a very fine example of the type of post office structures built in thriving and well to [MASK] cities [MASK] [MASK] 1920s and originally had [MASK] ##ique lighting which was [removed] sadly during [MASK] [driving] re ##mo ##del ##ing [and] which are now in private residences on [MASK] [manager] springs drive and in north hills .","meridian is right ##ly [considered] an architectural treasure tr ##ove being [one] the nations most intact [cities] from the turn [of] the last century . architecture students [from] around the nation and canada are known to visit [meridian] in [groups] as part [of] their course [##work] due to numerous structures in the city having been designed by noted architects . the [only] home in the us south designed by [noted] canadian born architect louis s . [curtiss] , famous [for] in ##venting the glass [curtain] wall skyscraper , is extant on highland park [.] the frank fort designed three ##foot [building] is generally considered one of the best [art] [deco] skyscraper ##s in the us and is often [compared] to detroit ' s famed fisher building . noted california architect wallace ne [##ff] designed a number of homes in meridian as well as in the alabama black [belt] which ad ##jo ##ins the city across the nearby alabama state line . he had relatives in meridian and selma who [were] executives in the [then] thriving railroad industry and [would] take [commissions] in the area when [commissions] in california were lean . his work is mostly concentrated in the lower numbered blocks of pop ##lar springs [drive] where his [251] ##6 pop ##lar [springs] drive is [often] compared to the similarly designed falcon lair , the [beverly] hills home in benedict canyon of [rudolph] valentin [##o] . one ne ##ff work was [lost] to an expansion of anderson [hospital] in 1990 and another in marion park [burned] in the [1950s] . the meridian post office with its interior done entirely of bronze and [verde] marble [is] [also] noteworthy as a very fine example of the type of post office structures built in thriving and well to [do] cities [in] [the] 1920s and originally had [lal] ##ique lighting which was [removed] sadly during [a] [1960s] re ##mo ##del ##ing [and] which are now in private residences on [pop] [##lar] springs drive and in north hills .","meridian is right ##ly [considered] an architectural treasure tr ##ove being [among] the nations most intact [homes] from the turn [of] the last century . architecture students [from] around the nation and canada are known to visit [meridian] in [groups] as part [of] their course [,] due to numerous structures in the city having been designed by noted architects . the [first] home in the us south designed by [the] canadian born architect louis s . [smith] , famous [for] in ##venting the glass [-] wall skyscraper , is extant on highland park [.] the frank fort designed three ##foot [building] is generally considered one of the best [high] [brick] skyscraper ##s in the us and is often [compared] to detroit ' s famed fisher building . noted california architect wallace ne [##ff] designed a number of homes in meridian as well as in the alabama black [hills] which ad ##jo ##ins the city across the nearby alabama state line . he had relatives in meridian and selma who [were] executives in the [once] thriving railroad industry and [would] take [residence] in the area when [conditions] in california were lean . his work is mostly concentrated in the lower numbered blocks of pop ##lar springs [drive] where his [66] ##6 pop ##lar [springs] drive is [often] compared to the similarly designed falcon lair , the [north] hills home in benedict canyon of [the] valentin [##o] . one ne ##ff work was [donated] to an expansion of anderson [park] in 1990 and another in marion park [burned] in the [1990s] . the meridian post office with its interior done entirely of bronze and [white] marble [,] [is] noteworthy as a very fine example of the type of post office structures built in thriving and well to [do] cities [in] [the] 1920s and originally had [angel] ##ique lighting which was [removed] sadly during [the] [drive] re ##mo ##del ##ing [and] which are now in private residences on [both] [spring] springs drive and in north hills ."
1,"during january 1942 , [MASK] . 2 squadron ' [MASK] aircraft [MASK] dispersed at pen [##fu] ##i [MASK] boer [MASK] island , and darwin . the pen ##fu [MASK] detachment [MASK] japanese shipping taking part in [[unused60]] invasion of ce ##le ##bes . two hudson ##s shot down or [MASK] three japanese [MASK] ##planes that attacked them as [MASK] [MASK] bombing a transport ship on 11 [MASK] [MASK] [MASK] [resentment] day both hudson ##s were themselves shot down by mitsubishi zero ##s . pen ##fu ##i was bombed by the [japanese] for [MASK] first time on 26 january 1942 , [and] attacked regularly thereafter , [MASK] some [MASK] [MASK] the intact hudson ##s were [MASK] to darwin but head [MASK] and his staff remained at pen ##fu ##i [MASK] [MASK] the base to [MASK] [MASK] by aircraft during reconnaissance missions from australia [MASK] on 18 february , [MASK] ##lam was ordered to evacuate [MASK] his personnel except a [MASK] party to demo ##lish the [MASK] with assistance from sparrow force . he returned to darwin the following day , just as the [MASK] experienced its first raid by the [MASK] [MASK] four of no [##mage] 2 squadron ' [##ona] hudson ##s were destroyed in the attack ; the [MASK] were relocated [to] daly [MASK] , [where] [MASK] [MASK] to carry out [MASK] and bombing [MASK] [MASK] [##lette] targets in [MASK] .","during january 1942 , [no] . 2 squadron ' [s] aircraft [were] dispersed at pen [##fu] ##i [,] boer [##oe] island , and darwin . the pen ##fu [##i] detachment [attacked] japanese shipping taking part in [the] invasion of ce ##le ##bes . two hudson ##s shot down or [damaged] three japanese [float] ##planes that attacked them as [they] [were] bombing a transport ship on 11 [january] [;] [the] [next] day both hudson ##s were themselves shot down by mitsubishi zero ##s . pen ##fu ##i was bombed by the [japanese] for [the] first time on 26 january 1942 , [and] attacked regularly thereafter , [damaging] some [aircraft] [.] the intact hudson ##s were [withdrawn] to darwin but head [##lam] and his staff remained at pen ##fu ##i [to] [enable] the base to [be] [used] by aircraft during reconnaissance missions from australia [.] on 18 february , [head] ##lam was ordered to evacuate [all] his personnel except a [small] party to demo ##lish the [airfield] with assistance from sparrow force . he returned to darwin the following day , just as the [city] experienced its first raid by the [japanese] [.] four of no [.] 2 squadron ' [s] hudson ##s were destroyed in the attack ; the [remainder] were relocated [to] daly [waters] , [where] [they] [continued] to carry out [reconnaissance] and bombing [missions] [against] [japanese] targets in [timor] .","during january 1942 , [no] . 2 squadron ' [s] aircraft [were] dispersed at pen [##fu] ##i [,] boer [##e] island , and darwin . the pen ##fu [##i] detachment [attacked] japanese shipping taking part in [the] invasion of ce ##le ##bes . two hudson ##s shot down or [destroyed] three japanese [float] ##planes that attacked them as [well] [as] bombing a transport ship on 11 [january] [,] [,] [the] day both hudson ##s were themselves shot down by mitsubishi zero ##s . pen ##fu ##i was bombed by the [japanese] for [the] first time on 26 january 1942 , [and] attacked regularly thereafter , [and] some [of] [of] the intact hudson ##s were [moved] to darwin but head [##lam] and his staff remained at pen ##fu ##i [,] [allowing] the base to [be] [attacked] by aircraft during reconnaissance missions from australia [.] on 18 february , [head] ##lam was ordered to evacuate [all] his personnel except a [small] party to demo ##lish the [base] with assistance from sparrow force . he returned to darwin the following day , just as the [base] experienced its first raid by the [japanese] [.] four of no [.] 2 squadron ' [s] hudson ##s were destroyed in the attack ; the [rest] were relocated [to] daly [city] , [where] [they] [continued] to carry out [reconnaissance] and bombing [the] [col] [##ore] targets in [darwin] ."


In [None]:
tfm = first_blurr_tfm(learn.dls)

`Learner.blurr_fill_mask` works here too

In [None]:
learn.blurr_fill_mask(f'Blurr is a {tfm.hf_tokenizer.mask_token}.', n_preds=5)

['Blurr is a joke.',
 'Blurr is a word.',
 'Blurr is a name.',
 'Blurr is a metaphor.',
 'Blurr is a term.']

## Summarization

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset("cnn_dailymail", '3.0.0')
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset cnn_dailymail (/home/wgilliam/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)


DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

{'article': 'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It\'s a step that is set to turn an international crisis into a fierce domestic political battle. The

In [None]:
train_ds = raw_datasets['train'].select(range(1000))
valid_ds = raw_datasets['validation'].select(range(1000))

In [None]:
n_train, n_valid = train_ds.num_rows, valid_ds.num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

In [None]:
learn = BlearnerForSummarization.from_dictionaries(raw_ds, 'facebook/bart-large-cnn', 
                                                   text_attr='article', summary_attr='highlights', 
                                                   max_length=256, max_target_length=130,
                                                   dblock_splitter=IndexSplitter(valid_idxs),
                                                   dl_kwargs={'bs':2}).to_fp16()

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, max_n=2, input_trunc_at=500, target_trunc_at=250)

Unnamed: 0,text,target
0,"<s> (CNN) -- When Ji Yeqing awakened, she was already in the recovery room. Chinese authorities had dragged her out of her home and down four flights of stairs, she said, restraining and beating her husband as he tried to come to her aid. They whisked her into a clinic, held her down on a bed and forced her to undergo an abortion. Her offense? Becoming pregnant with a second child, in violation of China's one-child policy. ""After the abortion, I felt empty, as if something was scooped out of me,","China's one-child policy results in forced abortions and sterilizations, activists say.\nWomen tell of emotional and physical consequences from the procedures.\nActivist Chen Guangcheng works to advocate for victims of such practices."
1,"<s> (CNN) -- The player tumbles to the ground, writhing around as if he has been mortally wounded. Television replays, however, show that his opponent has made no contact at all. It's an ever-increasing sight on football grounds around the world, and -- in the English Premier League, at least -- it's becoming an increasingly emotive issue. Santi Cazorla was labeled a ""con artist"" after his theatricals earned Arsenal a match-turning penalty kick in a game against West Brom on Saturday. Earlier th",Arsenal midfielder Santi Cazorla's theatrics reignite debate about diving in football.\nFormer anti-doping chief gives five reasons why athletes choose to cheat.\nWorld's major sports organizations face major battle to combat sporting fraud.


In [None]:
metrics_cb = BlearnerForSummarization.get_metrics_cb()
learn.fit_one_cycle(1, lr_max=4e-5, cbs=[metrics_cb])

epoch,train_loss,valid_loss,rouge1,rouge2,rougeL,bertscore_precision,bertscore_recall,bertscore_f1,time
0,1.730929,1.827253,0.342983,0.142611,0.240346,0.869126,0.892681,0.880644,12:17


In [None]:
learn.show_results(learner=learn, max_n=2, input_trunc_at=500, target_trunc_at=250)

Unnamed: 0,text,target,prediction
0,"(CNN)Reading the headlines out of Madison, Wisconsin, it's hard not to think about Ferguson, Missouri. But law enforcement's response to the shooting of 19-year-old Tony Robinson will not unfold in the same chaotic, violent and distrusting way as the shooting of 18-year-old Michael Brown, Madison's top police leaders vowed. ""I think it's very clear that Madison, Wisconsin, is not Ferguson, Missouri,"" said Jim Palmer, the executive director of the Wisconsin Professional Police Association. The h",Police officials in Madison say their responses to shooting by officer reflect their role in community.\nOne example: Madison chief talked to teen's family soon after shooting.\nA month went by before Ferguson chief apologized to Brown's family.,"Madison Police Chief Mike Koval has been out front and outspoken about Tony Robinson's shooting .\nThe head of the state's largest law enforcement group says the Madison police department has a strong relationship with the people it serves .\n""I think"
1,"London (CNN)Mohammed Emwazi, the British-Kuwaiti ISIS fighter the world knows as ""Jihadi John,"" was fuming with righteous indignation when he met with a representative of Cage Prisoners -- a Muslim advocacy group now known as CAGE -- shortly after being deported from Tanzania in August 2009. In a meeting recorded by the advocacy group, he claimed his plans for a safari vacation were ruined when he was detained at the airport and sent back first to Amsterdam and then to Dover, England, where he","The man now known as ""Jihadi John"" was on Britain's terror radar for several years.\nCourt documents claim he was a member of a terror recruitment network in London.\nThe papers spell out his connections to other known terrorists in Somalia.","Mohammed Emwazi is the British-Kuwaiti ISIS fighter the world knows as ""Jihadi John"" .\nCourt documents obtained by CNN paint a very different picture of his life .\nEmwazi was on British security services' radar before he made the trip to Tanzania, d"


`Learner.blurr_generate` works here too

In [None]:
test_article = """
About 10 men armed with pistols and small machine guns raided a casino in Switzerland and made off 
into France with several hundred thousand Swiss francs in the early hours of Sunday morning, police said. 
The men, dressed in black clothes and black ski masks, split into two groups during the raid on the Grand Casino 
Basel, Chief Inspector Peter Gill told CNN. One group tried to break into the casino's vault on the lower level 
but could not get in, but they did rob the cashier of the money that was not secured, he said. The second group 
of armed robbers entered the upper level where the roulette and blackjack tables are located and robbed the 
cashier there, he said. As the thieves were leaving the casino, a woman driving by and unaware of what was 
occurring unknowingly blocked the armed robbers' vehicles. A gunman pulled the woman from her vehicle, beat 
her, and took off for the French border. The other gunmen followed into France, which is only about 100 
meters (yards) from the casino, Gill said. There were about 600 people in the casino at the time of the robbery. 
There were no serious injuries, although one guest on the Casino floor was kicked in the head by one of the 
robbers when he moved, the police officer said. Swiss authorities are working closely with French authorities, 
Gill said. The robbers spoke French and drove vehicles with French lRicense plates. CNN's Andreena Narayan 
contributed to this report.
"""

In [None]:
outputs = learn.blurr_generate(test_article, num_return_sequences=3)

for idx, o in enumerate(outputs):
    print(f'=== Prediction {idx+1} ===\n{o}\n')

=== Prediction 1 ===
 About 10 men robbed a casino in Switzerland and made off with several hundred thousand Swiss francs, police say .
The robbers spoke French and drove vehicles with French lRicense plates .
There were no serious injuries, although one guest was kicked in the head by one of the robbers .
A woman driving by unknowingly blocked the robbers' vehicles and was beaten to death .

=== Prediction 2 ===
 About 10 men robbed a casino in Switzerland and made off with several hundred thousand Swiss francs, police say .
The robbers spoke French and drove vehicles with French lRicense plates .
There were no serious injuries, although one guest was kicked in the head by one of the robbers .
A woman driving by unknowingly blocked the robbers' vehicles, and a gunman beat her up .

=== Prediction 3 ===
 About 10 men robbed a casino in Switzerland and made off with several hundred thousand Swiss francs, police say .
The robbers spoke French and drove vehicles with French lRicense plate

## Translation

In [None]:
#hide
try: del learn; torch.cuda.empty_cache()
except: pass

In [None]:
raw_datasets = load_dataset('wmt16', 'de-en')
print(f'{raw_datasets}\n')
print(f'{raw_datasets["train"][0]}\n')
print(f'{raw_datasets["train"].features}\n')

Reusing dataset wmt16 (/home/wgilliam/.cache/huggingface/datasets/wmt16/de-en/1.0.0/0d9fb3e814712c785176ad8cdb9f465fbe6479000ee6546725db30ad8a8b5f8a)


DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 4548885
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2169
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
})

{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode', 'en': 'Resumption of the session'}}

{'translation': Translation(languages=['de', 'en'], id=None)}



In [None]:
train_ds = raw_datasets['train'].select(range(1000))
valid_ds = raw_datasets['validation'].select(range(1000))

In [None]:
n_train, n_valid = train_ds.num_rows, valid_ds.num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([train_ds, valid_ds])

In [None]:
def make_dict(item):
    return item['translation']

raw_ds = raw_ds.map(make_dict)

Loading cached processed dataset at /home/wgilliam/.cache/huggingface/datasets/wmt16/de-en/1.0.0/0d9fb3e814712c785176ad8cdb9f465fbe6479000ee6546725db30ad8a8b5f8a/cache-2d203da04becbf79.arrow


In [None]:
learn = BlearnerForTranslation.from_dataframe(raw_ds, 'Helsinki-NLP/opus-mt-de-en', 
                                              src_lang_name='German', src_lang_attr='de', 
                                              trg_lang_name='English', trg_lang_attr='en', 
                                              dblock_splitter=RandomSplitter(),
                                              dl_kwargs={'bs':2}).to_fp16()

In [None]:
learn.dls.show_batch(dataloaders=learn.dls, max_n=2, input_trunc_at=500, target_trunc_at=250)

Unnamed: 0,text,target
0,"▁Angesichts▁dieser Situation▁muß▁aus dem▁Bericht, den das▁Parlament annimmt,▁klar▁hervorgehen,▁daß▁Maßnahmen▁notwendig▁sind, die▁eindeutig auf die▁Bekämpfung der relativen▁Armut und der Arbeitslosigkeit▁gerichtet▁sind.▁Maßnahmen▁wie die für diese▁Zwecke▁angemessene▁Verwendung der▁Strukturfonds, die▁häufig▁unsachgemäß▁eingesetzt▁werden, und▁zwar mit▁zentralen▁staatlichen▁Politiken, die▁Modernisierung der▁Bereiche Telekommunikation und▁Kommunikation,▁indem man vor▁allem die am▁wenigsten▁entwickelt","Given this situation, the report approved by Parliament must highlight the need for measures that aim unequivocally to fight relative poverty and unemployment: measures such as the appropriate use of structural funds for these purposes, which are oft"
1,"Wir▁sollten▁auch die▁Bescheidenheit aufbringen einzusehen,▁daß wir nicht▁erst eine▁Woche vor der▁eigentlichen▁Debatte in▁diesem Haus die▁Mechanismen▁einrichten▁können, die▁erforderlich▁sind, um eine▁strategische▁Debatte▁durchzuführen, die sich nicht▁nur auf eine▁Präsentation und▁Erläuterungen▁seitens des▁Präsidenten der▁Kommission▁beschränkt,▁sondern in der▁auch ein▁Fünfjahresprogramm▁vorgelegt▁wird.▁Nur so▁können wir der▁Kommission▁rechtzeitig▁unsere▁Wünsche▁übermitteln und diese▁entsprechend▁d","We should also have the humility to recognise that, if we wanted to have a strategic debate accompanied not just by a presentation and elucidation by the President of the Commission, but also by a five-year programme, we should have the mechanisms in"


In [None]:
metrics_cb = BlearnerForTranslation.get_metrics_cb()
learn.fit_one_cycle(1, lr_max=4e-5, cbs=[metrics_cb])

[nltk_data] Downloading package wordnet to /home/wgilliam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


epoch,train_loss,valid_loss,bleu,meteor,sacrebleu,time
0,1.259604,1.223431,0.343715,0.569973,32.854164,02:07


In [None]:
learn.show_results(learner=learn, max_n=2, input_trunc_at=500, target_trunc_at=250)

Unnamed: 0,text,target,prediction
0,"Nach▁meiner▁Ansicht▁würde diese▁zweite▁Hypothese▁einem▁Verzicht auf▁unsere▁Verantwortung▁als▁Parlament und▁darüber▁hinaus dem Aufwerfen▁einer▁originellen These,▁einer▁unbekannten▁Methode▁gleichkommen, die▁darin bestände, den▁Fraktionen die programmatische▁Rede der▁Kommission in▁schriftlicher Form eine▁Woche▁vorher - und nicht,▁wie▁vereinbart, am Tag▁zuvor - zur▁Kenntnis zu▁geben,▁wobei zu▁berücksichtigen▁ist,▁daß das▁Legislativprogramm im▁Februar▁diskutiert▁werden▁wird, so▁daß wir auf die▁Ausspr","In my opinion, this second hypothesis would imply the failure of Parliament in its duty as a Parliament, as well as introducing an original thesis, an unknown method which consists of making political groups aware, in writing, of a speech concerning","In my view, this second hypothesis would be tantamount to abandoning our responsibilities as Parliament and, moreover, to raising an original thesis, an unknown method, which would consist in informing the political groups of the Commission' s progra"
1,"Sie▁wird▁aber auf▁Seite 5▁dieser▁Leitlinien▁ganz▁eindeutig▁genannt, und▁ich▁möchte▁darauf▁verweisen -▁weil▁sie▁mich▁dazu▁aufgefordert▁haben -,▁daß diese▁Partnerschaft für▁mich - und▁ich▁habe▁lange▁genug eine Region▁betreut, um dies▁beurteilen zu▁können - ein▁sehr▁wirkungsvolles Instrument zur▁Mobilisierung der▁geistigen▁Ressourcen auf▁lokaler▁Ebene▁ist -▁sowohl derer im▁öffentlichen▁Sektor - die Stadt- und▁Gemeinderäte, den▁schulischen und▁gesellschaftlichen▁Bereich, die▁Vereine und▁Verbände -▁a","However, I do wish to mention - since you have asked me to do so - that, as far as I am concerned, this partnership - and I spent long enough as a regional administrator within my own country to be able to say this most sincerely - is a tool, one use","However, it is clearly mentioned on page 5 of these guidelines, and I would like to point out - because they have called on me - that this partnership is a very effective instrument for mobilising intellectual resources at local level - both in the p"


`Learner.blurr_generate` works here too

In [None]:
test_de = "Ich trinke gerne Bier"

In [None]:
learn.blurr_generate(test_de)

['I like drinking beer']

## Summary

In summary, whether you want to work with Blurr's low, mid, or high-level API ... we got you covered :)

In [None]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01_modeling-core.ipynb.
Converted 02_data-language-modeling.ipynb.
Converted 02_modeling-language-modeling.ipynb.
Converted 03_data-token-classification.ipynb.
Converted 03_modeling-token-classification.ipynb.
Converted 04_data-question-answering.ipynb.
Converted 04_modeling-question-answering.ipynb.
Converted 10_data-seq2seq-core.ipynb.
Converted 10_modeling-seq2seq-core.ipynb.
Converted 11_data-seq2seq-summarization.ipynb.
Converted 11_modeling-seq2seq-summarization.ipynb.
Converted 12_data-seq2seq-translation.ipynb.
Converted 12_modeling-seq2seq-translation.ipynb.
Converted 99a_examples-high-level-api.ipynb.
Converted 99b_examples-glue.ipynb.
Converted 99c_examples-glue-plain-pytorch.ipynb.
Converted 99d_examples-multilabel.ipynb.
Converted index.ipynb.
