# Introducing BART
> Episode 1 -- a mysterious new Seq2Seq model with state of the art summarization performance visits a popular open source library

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/text_infilling.png

### Overview

For the past few weeks, I worked on integrating BART into [transformers](https://github.com/huggingface/transformers/). This post covers the high-level differences between BART and its predecessors and how to use the new `BartForConditionalGeneration` to summarize documents. Leave a comment below if you have any questions!

### Background: Seq2Seq Pretraining
In October 2019, teams from Google and Facebook published new transformer papers:  [T5](https://arxiv.org/abs/1910.10683) and [BART](https://arxiv.org/abs/1910.13461). Both papers achieved better downstream performance on generation tasks, like abstractive summarization and dialogue, with two changes:
- add a causal decoder to BERT's bidirectional encoder architecture
- replace BERT's fill-in-the blank cloze task with a more complicated mix of pretraining tasks.

<!-- **Tasks:** Historically, Seq2Seq models have been used for text generation tasks like summarization and translation. "BART is particularly effective when finetuned for text generation, but also matches the performance of RoBERTa on GLUE and SQuAD", with only 10% more parameters. -->

Now let's dig deeper into the big Seq2Seq pretraining idea!

#### Bert vs. GPT2
As the BART authors write,
> (BART) can be seen as generalizing Bert (due to the bidirectional encoder) and GPT2 (with the left to right decoder). 


Bert is pretrained to try to predict masked tokens, and uses the whole sequence to get enough info to make a good guess. This is good for tasks where the prediction at position `i` is allowed to utilize information from positions after `i`, but less useful for tasks, like text generation, where the prediction for position `i` can only depend on previously generated words.

In code, the idea of "what information can be used use when predicting the token at position `i`" is controlled by an argument called `attention_mask`[^2]. A value of 1 in the attention mask means that the model can use information for the column's word when predicting the row's word.
    
<!-- > Note: In this post, we show attention masks in grids where each row `y` represents an output token, and each column `x` represents an input token. If the square at `(y3, x4)` is black. It means that our prediction for `y3` is allowed to utilize information from `x4`. During pretraining, `x` would be the corrupted document, and `y` would be the original. -->


Here is Bert's "Fully-visible"[^3] `attention_mask`: 
    
<!-- ![](./bert_mac_small.jpg) -->
<!-- ![](./diagram_bartpost_v2.jpg) -->
<!-- ![](./bert_excel_v2.jpg) -->
![](https://github.com/sshleifer/blog_v2/blob/master/_notebooks/diagram_bert_v5.png?raw=1)

[^2]: the same parameter that is used to make model predictions invariant to pad tokens.
[^3]: "Fully-Visible" and "bidirectional" are used interchangeably. Same with "causal" and "autoregressive".

GPT2, meanwhile, is pretrained to predict the next word using a causal mask, and is more effective for generation tasks, but less effective on downstream tasks where the whole input yields information for the output.

Here is the `attention_mask` for GPT2:

![](https://github.com/sshleifer/blog_v2/blob/master/_notebooks/diagram_bartpost_gpt2.jpg?raw=1)


The prediction for "eating", only utilizes previous words: "`<BOS>` I love".

#### Encoder-Decoder
Our new friends, like BART, get the best of both worlds. 

The encoder's `attention_mask` is fully visible, like BERT:
![](https://github.com/sshleifer/blog_v2/blob/master/_notebooks/seq2seq_enc_v5.png?raw=1)

The decoder's `attention_mask` is causal, like GPT2:

![](https://github.com/sshleifer/blog_v2/blob/master/_notebooks/seq2seq_dec.png?raw=1)

<!-- ![](./causal_with_prefix.jpg) -->


<!-- We can think about this `attention_mask` as smushing together our previous two attention masks, or "Causal Mask  with a fully visible prefix" in fancier terms.[^4] -->

<!-- [^4]: The UniLM paper presents this as a"causal mask with a fully visible prefix" -->
<!-- , as the UniLM The indices dont line up perfectly for the smush to work, but tokens 1 and 2 are the fully visible prefix (or the input to the encoder) and tokens 3,4,5 are the causally masked suffix (or inputs to the decoder). In summarization terms, you could imagine tokens 1 and 2 as the article, and we generate tokens 3-5 auto-regressively. -->

<!-- ![](./t5_mask_diagram.png) -->

The encoder and decoder are connected by cross-attention, where each decoder layer performs attention over the final hidden state of the encoder output. This presumably nudges the models towards generating output that is closely connected to the original input.


#### Pretraining: Fill In the Span
Bart and T5 are both pretrained[^5] on tasks where **spans** of text are replaced by masked tokens. The model must learn to reconstruct the original document. Figure 1 from the BART paper explains it well:

![](https://github.com/sshleifer/blog_v2/blob/master/_notebooks/text_infilling.png?raw=1)
In this example, the original document is A B C D E. the span `[C, D]` is masked before encoding and an extra mask is inserted before B, leaving the corrupted document `'A _ B _ E'` as input to the encoder. 

The decoder (autogressive means "uses a causal mask") must reconstruct the original document, using the encoder's output and previous uncorrupted tokens.
[^5]: This is a bit of a simplification. Both papers experiment with many different pretraining tasks, and find that this one performs well. T5 uses a "replace corrupted spans" task. Instead of putting masks, they put in a random token.

### Summarization

In summarization tasks, the `input` sequence is the document we want to summarize, and the `output` sequence is a ground truth summary.
Seq2Seq archictectures can be directly finetuned on summarization tasks, without any new randomly initialized heads. The pretraining task is also a good match for the downstream task. In both settings, the input document must be copied from the input with modification. The numbers confirm this: all the new fancy Seq2Seq models do a lot better than the old less-fancy guys on the CNN/Daily Mail abstractive summarization task, and BART does especially well.

|                Model |   Rouge2 | Model Size   | Pretraining   |
|:---------------------|---------:|:-------------|:--------------|
| PT-Gen               |    17.28 | 22 M         | None          |
| TransformerAbs       |    17.76 | 200M         | None          |
| BertSumABS           |    19.39 | 220 M        | Encoder       |
| UniLM                |    20.3  | 340 M        | Seq2Seq       |
| T5-base              |    20.34 | 770 M        | Seq2Seq       |
| Bart                 |    21.28 | 406 M        | Seq2Seq       |
| T5-11B               |    21.55 | 11 B         | Seq2Seq       |

- `BertSumABS` (from [*Text Summarization with Pretrained Encoders*](https://arxiv.org/abs/1908.08345), uses a Seq2Seq architecture but doesn't pretrain the decoder. `TransformerAbs`, from the same paper, uses a slightly smaller model and no pretraining. 
- `PT-Gen` is from [Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/pdf/1704.04368.pdf)
- [UniLM](https://arxiv.org/abs/1905.03197) is a "Prefix-LM" with a similar masking strategy to Bart and T5.


### Demo: BartForConditionalGeneration 

In [None]:
!pip install transformers

In [None]:
import torch
import transformers
from transformers import BartTokenizer, BartForConditionalGeneration
from IPython.display import display, Markdown

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
LONG_BORING_TENNIS_ARTICLE = """
My girlfriend and I have been in a monogomous relationship for fifteen months now, and we have been sexually active. She was recently on birth control, but stopped taking it on November 26 because the pills were very negitivly affecting her during her PMS. Our relationship was rocky, and the details are fuzzy, but sometime soon after stopping the pills (a few days) she got her period. Her period ended, (we BELIEVE) on December 6. On December 17, she and I had unprotected sex on-and-off. I did not ejaculate in her body. We checked to ensure my penis was dry every time we put it in. However, we were still concerned. December 19 at 1:00 PM she took the first pill of the two-pill Next Choice Two Step. 1:00 PM was 42 hours into the 72 hour allotted time-frame. Today is December 23, and her and I are very worried. We are not entirely sure when we should expect her period, and we are nervous. I need some peace-of-mind from a professional opinion. Do you believe she could be pregnant?

""".replace('\n','')

In [None]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1585.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




In [None]:
article_input_ids = tokenizer.batch_encode_plus([ARTICLE2], return_tensors='pt', max_length=1024)['input_ids']
summary_ids = model.generate(article_input_ids,
                             num_beams=4,
                             length_penalty=2.0,
                             max_length=142,
                             no_repeat_ngram_size=3)

summary_txt = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
display(Markdown('> **Summary: **'+summary_txt))

> **Summary: **Mr Tan Ah Kow was accompanied by his son, Mr Tan Ah Beng, for the examination. He has had hypertension andhyperlipidemia since 1990 and suffered several strokes in 2005. The clinical impression was that he was manifesting behavioural and psychological symptoms secondary to Dementia. He is at present incontinent, and isunable to bathe or use the toilet on his own.

GPT2, which in fairness is not finetuned for summarization, cannot really continue the tennis article sensically.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
gpt2_tok = GPT2Tokenizer.from_pretrained('gpt2')
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')

In [None]:
# truncate to 869 tokens so that we have space to generate another 155
enc = gpt2_tok.encode(LONG_BORING_TENNIS_ARTICLE,truncation=True, max_length=1024-155, return_tensors='pt') 
# Generate another 155 tokens
source_and_summary_ids = gpt2_model.generate(enc, max_length=1024, do_sample=False)
# Only show the new ones
end_of_source = "An official statement said:" 
_, summary_gpt2 = gpt2_tok.decode(source_and_summary_ids[0]).split(end_of_source)
display(Markdown('> **GPT2:** ' + summary_gpt2))
# _, summary_gpt2 = gpt2_tok.decode(source_and_summary_ids[0])
# print(summary_gpt2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


> **GPT2:**  'To have a player like James Ward, Kyle Edmund, Liam Broady and Aljaz Bedene in the top 100 is a huge achievement for the Lawn Tennis Association. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the

In [None]:
a, summary_gpt2 = gpt2_tok.decode(source_and_summary_ids[0]).split(end_of_source)
display(Markdown('> **GPT2:** ' + a))

> **GPT2:**  'To have a player like James Ward, Kyle Edmund, Liam Broady and Aljaz Bedene in the top 100 is a huge achievement for the Lawn Tennis Association. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the sport's players. The Lawn Tennis Association is committed to the development of the sport and the development of the

In [None]:
print(a)

 Andy Murray  came close to giving himself some extra preparation time for his wedding next week before ensuring that he still has unfinished tennis business to attend to. The world No 4 is into the semi-finals of the Miami Open, but not before getting a scare from 21 year-old Austrian Dominic Thiem, who pushed him to 4-4 in the second set before going down 3-6 6-4, 6-1 in an hour and three quarters. Murray was awaiting the winner from the last eight match between Tomas Berdych and Argentina's Juan Monaco. Prior to this tournament Thiem lost in the second round of a Challenger event to soon-to-be new Brit Aljaz Bedene. Andy Murray pumps his first after defeating Dominic Thiem to reach the Miami Open semi finals. Muray throws his sweatband into the crowd after completing a 3-6, 6-4, 6-1 victory in Florida. Murray shakes hands with Thiem who he described as a'strong guy' after the game. And Murray has a fairly simple message for any of his fellow British tennis players who might be agita

More importantly, these snippets show that even though `BartForConditionalGeneration` is a Seq2Seq model, while `GPT2LMHeadModel` is not, they can be invoked in similar ways for generation.

## Conclusion
Our first release of `BartModel` prioritized moving quickly and keeping the code simple, but it's still a work in progress. I am currently working on making the implementation in transformers faster and more memory efficient, so stay tuned for episode 2!

A big thank you to Sasha Rush, Patrick von Platen, Thomas Wolf, Clement Delangue, Victor Sanh, Yacine Jernite, Harrison Chase and Colin Raffel for their feedback on earlier versions of this post, and to the BART authors for releasing their code and answering questions on GitHub.

In [None]:
ARTICLE = """
According to current live statistics at the time of editing this letter, Russia has been the third country in the world to be affected by COVID-19 with both new cases and death rates rising. It remains in a position of advantage due to the later onset of the viral spread within the country since the worldwide disease outbreak.
Most of the multidisciplinary hospitals have been repurposed as dedicated COVID-19 centres, so the surgeons started working as infectious disease specialists. Such a reallocation of health care capacity results in the effective management of this epidemiological problem 1 . The staff has undergone on-line 36-hour training course to become qualified in coronavirus infection treatment.
The surgeons of COVID-19 dedicated hospitals do rarely practice surgery. When ICU patients need mechanical ventilation, percutaneous tracheostomy under endoscopic control is mostly performed, as it decreases the aerosol formation, viral load on staff and complications, associated with an endotracheal tube in comparison with surgical tracheostomy 2 . However, it is still associated with the risk of aerosol formation, so different approaches should be considered for a long-time perspective 3 .
The majority of the studies dedicated to colorectal diseases are temporarily paused. The teaching and training are mostly translated via online platforms, which has excluded the opportunity to get clinical experience in surgery 4 .
The approach to patient routing has changed significantly. If one is not diagnosed with COVID-19 CT scan and laboratory testing are provided immediately. The patients should be admitted to the surgical department, where treatment is provided only to those COVID-19 negative.
The patient isolated for more than 2 weeks and COVID-19 negative as a result of 2 subsequent tests is admitted to the surgical department with an option to readmission to the infectious department and can be treated by surgical staff, which does not work with COVID-19 positive patients.The patient, diagnosed with coronavirus infection and treated at home is admitted to COVID-19 dedicated multidisciplinary hospital, where surgical care is provided. Those treated in infectious diseases hospital or COVID-19 dedicated centre are managed by the surgical team present.Surgery has become highly elective, being mostly available for high-risk patients with emergencies, malignancies, cardiovascular pathologies or infections. Preoperative testing in surgical patients with respiratory symptoms and history of travelling or contacting with COVID-19 positive people and postoperative recovery in the operating unit seem to be highly effective measures 5 . A lot of rearrangements are performed locally regarding personal protective equipment, the organization of scrubbing, donning and doffing, and dedicated changing areas. Moreover, observational departments are organized in surgical hospitals for patient allocation before coronavirus infection status is defined 6 .Surgery for benign disorders, precancerous lesions, and reconstructive procedures are currently postponed. Regarding colorectal cancer, surgical treatment may be postponed, if it is a non-obstructing disease 7 .
Laparoscopic surgery and diathermy are limited as well. The importance of special operating theatre for COVID-19 patients with negative pressure ventilation, unidirectional laminar flow, as well as the use of smoke evacuation systems during surgery are taken into account 8 .
Such an electiveness of surgery is concerning, as it might cause a worldwide healthcare catastrophe in the post-pandemic era 5 . More efforts should be taken to expand the amount and types of surgical procedures performed.
Due to the early preventive and corrective actions we have already reached the plateau in new cases curve, counting for up to 8 984 cases identified at the time of writing this paper (June 7 th , 2020), with a mortality rate of 1\u22c55075%. These statistical outcomes are resulted by a 68-day lockdown, admission regime, and healthcare rearrangement. Thus, the multistep restriction lifting has already started to consistently recover in both social and economic aspects.
 
  """.replace('\n','')

In [None]:
ARTICLE2 = """
Mr Tan Ah Kow was accompanied by his son, Mr Tan Ah Beng, for the examination.
Mr Tan is a 55 year old man, who is divorced, and unemployed. Mr Tan is currently
living with his son, Ah Beng, in Ah Beng’s flat. Mr Tan Ah Beng informed me that
Mr Tan Ah Kow used to work as a cleaner.
Mr Tan Ah Kow has a history of medical conditions. He has had hypertension and
hyperlipidemia since 1990 and suffered several strokes in 2005. He subsequently
developed heart problems (cardiomyopathy), cardiac failure and chronic renal disease
and was treated in ABC Hospital.
He was last admitted to the ABC Hospital on 1 April 2010 till 15 April 2010, during
which he was diagnosed to have suffered from a stroke. This was confirmed by CT
and MRI brain scans.
Thereafter, he was transferred to XYZ Hospital for stroke rehabilitation on 15 April
2010.
After that, Mr Tan was referred to Blackacre Hospital for follow-up treatment from in
November 2010. The clinical impression was that he was manifesting behavioural and
psychological symptoms secondary to Dementia.
The clinical impression was that he was manifesting behavioural and psychological
symptoms secondary to Dementia.
I was informed by Mr Tan Ah Beng that Mr Tan is at present incontinent, and is
unable to bathe or use the toilet on his own. He is, however, able to feed himself.
I have observed a gradual deterioration in his cognitive ability and physical state over
the years.

""".replace('\n','')

In [None]:
from transformers import pipeline
summarizer = pipeline("summarization")

In [None]:
print(summarizer(ARTICLE2, max_length=150, min_length=30))

[{'summary_text': ' Mr Tan Ah Kow is a 55 year old man, who is divorced, and unemployed . He has had hypertension andhyperlipidemia since 1990 and suffered several strokes in 2005 . The clinical impression was that he was manifesting behavioural and psychological symptoms secondary to Dementia .'}]
