# bigbird-pegasus-large-arxiv - How to use
mainly copied from: https://huggingface.co/google/bigbird-pegasus-large-arxiv

**Input text (only abstract from an article):**  

Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-
a-glance overview of the main findings. Highlights are usually manually specified by the authors. This
paper presents a supervised approach, based on regression techniques, with the twofold aim at automat-
ically extracting highlights of past articles with missing annotations and simplifying the process of man-
ually annotating new articles. To this end, regression models are trained on a variety of features extracted
from previously annotated articles. The proposed approach extends existing extractive approaches by
predicting a similarity score, based on n-gram co-occurrences, between article sentences and highlights.
The experimental results, achieved on a benchmark collection of articles ranging over heterogeneous
topics, show that the proposed regression models perform better than existing methods, both supervised
and not.

**Output text:**  

this paper presents supervised regression techniques with the aim at automatically extracting highlights of missing articles and at simplifying the process of extracting a variety of features extracted from previously annotated articles.

In [None]:
!pip install datasets

In [None]:
!pip install transformers

In [None]:
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")

# by default encoder-attention is `block_sparse` with num_random_blocks=3, block_size=64
# decoder attention type can't be changed & will be "original_full"
# you can change `attention_type` (encoder only) to full attention like this:
model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv", attention_type="block_sparse")

# you can change `block_size` & `num_random_blocks` like this:
#model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv", block_size=16, num_random_blocks=2)

text = "Scientific articles can be annotated with short sentences, called highlights, providing readers with an at-a-glance overview of the main findings. Highlights are usually manually specified by the authors. This paper presents a supervised approach, based on regression techniques, with the twofold aim at automatically extracting highlights of past articles with missing annotations and simplifying the process of manually annotating new articles. To this end, regression models are trained on a variety of features extracted from previously annotated articles. The proposed approach extends existing extractive approaches by predicting a similarity score, based on n-gram co-occurrences, between article sentences and highlights. The experimental results, achieved on a benchmark collection of articles ranging over heterogeneous topics, show that the proposed regression models perform better than existing methods, both supervised and not."
inputs = tokenizer(text, return_tensors='pt')

prediction = model.generate(**inputs, )
prediction = tokenizer.batch_decode(prediction)

print(prediction)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.35M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

Attention type 'block_sparse' is not possible if sequence_length: 156 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


['<s> this paper presents supervised regression techniques with the aim at automatically extracting highlights of missing articles and at simplifying the process of extracting a variety of features extracted from previously annotated articles.</s>']


## Running model on an article from the arxiv data (test set)

In [None]:
scientific_papers = load_dataset('scientific_papers', 'arxiv')

Downloading builder script:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading and preparing dataset scientific_papers/arxiv (download: 4.20 GiB, generated: 7.06 GiB, post-processed: Unknown size, total: 11.26 GiB) to /root/.cache/huggingface/datasets/scientific_papers/arxiv/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/880M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/203037 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6436 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6440 [00:00<?, ? examples/s]

Dataset scientific_papers downloaded and prepared to /root/.cache/huggingface/datasets/scientific_papers/arxiv/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
scientific_papers['train'][0]

{'abstract': ' additive models play an important role in semiparametric statistics . \n this paper gives learning rates for regularized kernel based methods for additive models . \n these learning rates compare favourably in particular in high dimensions to recent results on optimal learning rates for purely nonparametric regularized kernel based quantile regression using the gaussian radial basis function kernel , provided the assumption of an additive model is valid . \n additionally , a concrete example is presented to show that a gaussian function depending only on one variable lies in a reproducing kernel hilbert space generated by an additive gaussian kernel , but does not belong to the reproducing kernel hilbert space generated by the multivariate gaussian kernel of the same variance .    * \n key words and phrases . * additive model , kernel , quantile regression , semiparametric , rate of convergence , support vector machine . ',
 'article': 'additive models @xcite provide an 

Original article: https://arxiv.org/pdf/1009.3123.pdf

In [None]:
article = scientific_papers['test'][0]['article']
print(article)

In [None]:
inputs = tokenizer(article, return_tensors='pt', truncation=True)

prediction = model.generate(**inputs)
prediction = tokenizer.batch_decode(prediction)

In [None]:
inputs

{'input_ids': tensor([[11694,  2391,   137,   129, 35851,   122,   613,  9750,   108,   568,
          4461,   108,   876,  2557,   122,   142,   134,   121,   304,   121,
         73178,  4859,   113,   109,   674,  4469,   107, 26898,   127,   832,
          6672,  4799,   141,   109,  3802,   107,   182,   800,  3702,   114,
         15561,  1014,   108,   451,   124, 19363,  1739,   108,   122,   109,
         60869,  2560,   134,  2093, 28170,  4461,   113,   555,  2391,   122,
          2362, 37271,   111, 30960,   109,   366,   113,  6672,   142,  2957,
          7558,   177,  2391,   107,   413,   136,   370,   108, 19363,  1581,
           127,  2492,   124,   114,   809,   113,   556, 13317,   135,  2255,
         35851,  2391,   107,   139,  2962,  1014,  8998,  1385, 73840,  4166,
           141, 18972,   114, 23956,  2135,   108,   451,   124,  3178,   121,
         14506,  1229,   121, 89453,   116,   108,   317,   974,  9750,   111,
          4461,   107,   139,  7707,  

In [1]:
# model prediction
print(prediction)

NameError: ignored

In [None]:
# ground truth (article's abstract)
print(scientific_papers['test'][0]['abstract'])