<a href="https://colab.research.google.com/github/otanet/NLP_seminar_20201022/blob/main/bart_summarization_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BARTを用いた要約の例
*   本ノートブックでは，要約データにfinetuneされたBARTモデルで要約を生成する例を示します．
*   huggingfaceの[transformers](https://github.com/huggingface/transformers)を使用します．
*   BARTをfinetuneする例は含まれません
*   BARTをfinetuneする例についてはこちら[こちら](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)を参考にしてください



## 環境のセットアップ

In [None]:
!pip install transformers==3.3.1

Collecting transformers==3.3.1
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 3.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 12.8MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 27.8MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0M

## モデル・トークナイザ読み込み

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')


## ソーステキストの指定

*   ソーステキストを指定して，トークナイズ＆id化する
*   下記のARTICLE_TO_SUMMARIZE =""の中身を自由に変更して要約を実行
*   今回は例文として下記のサイト[こちら](https://edition.cnn.com/2020/10/19/politics/donald-trump-joe-biden-election-2020-coronavirus-fauci-masks/index.html)のニューステキスト（一部）を用いる


---
上記のニューステキストから一部抜粋したソーステキスト

(CNN)President Donald Trump and the pandemic he is supposed to be fighting are running out of control with the two weeks until Election Day shaping up as among the most ugly and divisive periods ever ahead of a presidential vote. He's on a fresh collision course with Dr. Anthony Fauci, who's publicly questioning why Trump thinks mask wearing is weak after a wild weekend that saw the President, who's trailing former Vice President Joe Biden in the polls and still playing to his base, pack swing state rallies that flouted his government's Covid-19 protocols.



In [None]:
ARTICLE_TO_SUMMARIZE = "(CNN)President Donald Trump and the pandemic he is supposed to be fighting are running out of control with the two weeks until Election Day shaping up as among the most ugly and divisive periods ever ahead of a presidential vote. He's on a fresh collision course with Dr. Anthony Fauci, who's publicly questioning why Trump thinks mask wearing is weak after a wild weekend that saw the President, who's trailing former Vice President Joe Biden in the polls and still playing to his base, pack swing state rallies that flouted his government's Covid-19 protocols."
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt', truncation=True)
print(inputs)

{'input_ids': tensor([[    0,  1640, 16256,    43,  6517,   807,   140,     8,     5, 23387,
         14414,    37,    16,  3518,     7,    28,  2190,    32,   878,    66,
             9,   797,    19,     5,    80,   688,   454,  7713,  1053, 16383,
            62,    25,   566,     5,   144, 11355,     8, 16067,  5788,   655,
           789,     9,    10,  1939,   900,     4,    91,    18,    15,    10,
          2310,  7329,   768,    19,   925,     4,  3173,   274,  1180,  2520,
             6,    54,    18,  3271,  8026,   596,   140,  4265, 11445,  2498,
            16,  3953,    71,    10,  3418,   983,    14,   794,     5,   270,
             6,    54,    18, 12564,   320,  3287,   270,  2101, 15478,    11,
             5,  4583,     8,   202,   816,     7,    39,  1542,     6,  6356,
          7021,   194, 10881,    14,  2342, 23100,    39,   168,    18, 19150,
           808,    12,  1646, 18956,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

## 要約の生成

*   num_beamsを変えることで，どの範囲まで探索するかを変えることができる（この値を大きくするほど推論は遅くなる）
*   max_lengthで要約の出力最大長を指定する．この長さを超える場合は強制的に終了する

In [None]:
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=120, early_stopping=True)
print(summary_ids)
decoded_summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
print(decoded_summary)

tensor([[    2,     0,  6517,   807,   140,     8,     5, 23387, 14414,    37,
            16,  3518,     7,    28,  2190,    32,   878,    66,     9,   797,
             4,    20,    80,   688,   454,  7713,  1053, 16383,    62,    25,
           566,     5,   144, 11355,     8, 16067,  5788,   655,   789,     9,
            10,  1939,   900,     4,    91,    18,    15,    10,  2310,  7329,
           768,    19,   925,     4,  3173,   274,  1180,  2520,     6,    54,
            18,  3271,  8026,   596,   140,  4265, 11445,  2498,    16,  3953,
             4,     2]])
["President Donald Trump and the pandemic he is supposed to be fighting are running out of control. The two weeks until Election Day shaping up as among the most ugly and divisive periods ever ahead of a presidential vote. He's on a fresh collision course with Dr. Anthony Fauci, who's publicly questioning why Trump thinks mask wearing is weak."]


出力された要約


---


"President Donald Trump and the pandemic he is supposed to be fighting are running out of control. The two weeks until Election Day shaping up as among the most ugly and divisive periods ever ahead of a presidential vote. He's on a fresh collision course with Dr. Anthony Fauci, who's publicly questioning why Trump thinks mask wearing is weak."