# テキスト要約

要約には大きく分けて、抽出型と生成型の2種類があります。

1. 抽出型 — 元のテキストから重要な文を選択し、それを並び替えて要約とします。
2. 生成型 — 自然言語処理の技術を使って、新しい文を生成し、要約とします。

このノートブックでは、既存の要約へのアプローチをいくつか紹介します。最初にPythonのライブラリであるsumyを使います。sumyには、よく使われている要約アルゴリズムがいくつか実装されています。その次に、gensimの実装を使います。その後、SummaやBERTベースの抽出型要約を試してみましょう。

## 準備

### パッケージのインストール

gensimの[更新ログ](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#12-removed-gensimsummarization)によると、gensimの4系から`gensim.summarization`が取り除かれています。そのため、3系をインストールしましょう。

In [1]:
!pip -q install nltk==3.2.5 sumy==0.8.1 gensim==3.8.3 transformers==4.11.0 summa==1.2.0 sentencepiece==0.1.96 bert-extractive-summarizer==0.8.1

[K     |████████████████████████████████| 83 kB 1.2 MB/s 
[K     |████████████████████████████████| 24.2 MB 99 kB/s 
[K     |████████████████████████████████| 2.9 MB 25.5 MB/s 
[K     |████████████████████████████████| 54 kB 1.7 MB/s 
[K     |████████████████████████████████| 1.2 MB 41.2 MB/s 
[K     |████████████████████████████████| 10.1 MB 42.5 MB/s 
[K     |████████████████████████████████| 895 kB 47.2 MB/s 
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
[K     |████████████████████████████████| 3.3 MB 40.8 MB/s 
[K     |████████████████████████████████| 636 kB 40.0 MB/s 
[?25h  Building wheel for summa (setup.py) ... [?25l[?25hdone
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Building wheel for pycountry (setup.py) ... [?25l[?25hdone


### インポート

In [2]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### データのアップロード

要約対象のテキストとして、本章のフォルダに含まれる`nlphistory.txt`をアップロードしましょう。

In [3]:
from google.colab import files

uploaded = files.upload()

Saving nlphistory.txt to nlphistory.txt


### データの読み込み

In [4]:
with open("nlphistory.txt", encoding="utf-8") as f:
  text = f.read()

## Sumyを用いた要約

[Sumy](https://github.com/miso-belica/sumy)は、以下の要約アルゴリズムを提供しています。

1. Luhn – ヒューリスティックな手法
2. LSA – 潜在的意味解析
3. LexRank – PageRankとHITSにインスパイアされた教師なしのアプローチ
4. TextRank – テキストから抽出したキーワードを使ったグラフベースの要約


In [5]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer

num_sentences_in_summary = 2
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
parser = HtmlParser.from_url(url, Tokenizer("english"))

summarizer_list = (
    "TextRankSummarizer:",
    "LexRankSummarizer:",
    "LuhnSummarizer:",
    "LsaSummarizer",
)  # list of summarizers
summarizers = [
    TextRankSummarizer(),
    LexRankSummarizer(),
    LuhnSummarizer(),
    LsaSummarizer(),
]

for i, summarizer in enumerate(summarizers):
    print(summarizer_list[i])
    for sentence in summarizer(parser.document, num_sentences_in_summary):
        print((sentence))
    print("-" * 30)

TextRankSummarizer:
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
------------------------------
LexRankSummarizer:
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training documen

Sumyには他の要約アルゴリズムやオプションもあります。それらの扱いについては、皆様の課題としましょう。

## gensimを用いた要約

In [6]:
from gensim import corpora
from gensim.summarization import summarize, summarize_corpus
from gensim.summarization.textcleaner import split_sentences

# summarize method extracts the most relevant sentences in a text
print("Summarize:\n", summarize(text, word_count=200, ratio=0.1))


# the summarize_corpus selects the most important documents in a corpus:
sentences = split_sentences(text)  # Creates a corpus where each document is a sentence.
tokens = [sentence.split() for sentence in sentences]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(sentence_tokens) for sentence_tokens in tokens]

# Extracts the most important documents (shown here in BoW representation)
print("-" * 30, "\nSummarize Corpus\n", summarize_corpus(corpus, ratio=0.1))

Summarize:
 Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966.
This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, proba

`word_count`と`ratio`という2つのパラメータがあり、これらを調整することで出力が変化します。

1. `word_count`: 要約の最大単語数
2. `ratio`: 元のテキストの文のうち出力として返すべき割合

## Summaを用いた要約

Summaの要約器もTextRankを用いていますが最適化されています。最適化の詳細については、以下の論文から確認できます。

- [Variations of the Similarity Function of TextRank for Automated Summarization](https://arxiv.org/pdf/1602.03606.pdf).

In [7]:
from summa import keywords, summarizer

print("Summary:")
print(summarizer.summarize(text, ratio=0.1))

Summary:
However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others.


## BERTを用いた抽出型要約

In [10]:
from summarizer import Summarizer

model = Summarizer()
result = model(text, min_length=200, ratio=0.01)
full = "".join(result)
print("Summarization")
print(full)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Summarization
The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.


## 生成型要約

以前は、`RNN`を使ったモデルで要約を生成することがよく行われていました。最近では、`Transformer`を使って行うのが一般化しつつあります。第7章でも述べたように、生成型要約はさらなる研究が必要なトピックであり、実際のアプリケーションに採用するにはまだ早い可能性があります。

ここでは、T5（Text-To-Text Transfer Transformer）を用いて、生成型のテキスト要約をしてみましょう。

In [11]:
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration

# モデルの読み込み
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [12]:
text = """
don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache).

If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to translate a few Navajo sentences into English. At this point, these machine translated Navajo-English pairs can be added as additional training data to English-Navajo MT system. This results in a translation system with more examples to train on (even though some of these examples are synthetic). In general, though, if accuracy of translation is paramount, it would perhaps make sense to form a hybrid MT system which combines the neural models with rules and some form of post-processing, though.

"""

テキストの最初に`summarize:`と付けることで、要約であることを教えます。ちなみに、英語からドイツ語へ翻訳する場合は`translation: translate English to German:`と付けます。詳細については、以下のドキュメントを御覧ください。

- [T5](https://huggingface.co/transformers/model_doc/t5.html)

In [13]:
# 前処理
preprocess_text = text.strip().replace("\n", "")
t5_prepared_Text = "summarize: " + preprocess_text
print("original text preprocessed:")
print(preprocess_text)

device = torch.device("cpu")
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

original text preprocessed:
don’t build your own MT system if you don’t have to. It is more practical to make use of the translation APIs. When we use such APIs, it is important to pay closer attention to pricing policies. It would perhaps make sense to store the translations of frequently used text (called a translation memory or a translation cache). If you’re working with a entirely new language, or say a new domain where existing translation APIs do poorly, it would make sense to start with a domain knowledge based rule based translation system addressing the restricted scenario you deal with. Another approach to address such data scarce scenarios is to augment your training data by doing “back translation”. Let us say we want to translate from English to Navajo language. English is a popular language for MT, but Navajo is not. We do have a few examples of English-Navajo translation. In such a case, one can build a first MT model between Navajo-English, and use this system to trans

In [14]:
# テキストの生成
summary_ids = model.generate(
    tokenized_text,
    num_beams=4,
    no_repeat_ngram_size=2,
    min_length=30,
    max_length=100,
    early_stopping=True,
)
# there are more parameters which can be found at https://huggingface.co/transformers/model_doc/t5.html

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summarized text:")
print(output)

Summarized text:
it is more practical to make use of the translation APIs. if you’re working with a completely new language, it would make sense to store translations of frequently used text (called translation memory or translation cache)
