# 텍스트 요약

- [Huggingface ecosystem..](https://huggingface.co/)

## CNN / DailyMail 데이터셋

- `CNN / DailyMail 데이터셋` : 300,000개 "뉴스 기사"와 "요약"의 쌍으로 구성
  - "요약"은 CNN과 DailyMail 기사에 첨부한 글머리 목록의 내용
    - 요약이 본문에서 추출되지 않고 추상적(Abstractive)이라는 중요한 특징을 가짐
    - 즉, 단순히 본문에서 발췌한 요약이 아닌 새로운 문장으로 구성
  

### 패키지 설치

In [1]:
!pip install transformers nltk datasets

### Data load

- `CNN / DailyMail` 데이터셋은 3가지 특성을 가짐
1. 뉴스 기사를 담은 `article`
2. 요약에 해당하는 `highlights`
3. 기사의 고유 아이디 `id`

In [2]:
from datasets import load_dataset

# 약 2분 내외 소요
dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f'특성 : {dataset["train"].column_names}')

특성 : ['article', 'highlights', 'id']


In [3]:
sample = dataset['train'][0]
print(f"기사 (500개 문자 발췌, 총 길이 : {len(sample['article'])}) :")
print(sample['article'][:500])
print(f"\n요약 (길이 : {len(sample['highlights'])})")
print(sample['highlights'])

기사 (500개 문자 발췌, 총 길이 : 2527) :
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s

요약 (길이 : 217)
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


## 텍스트 요약 파이프라인
- 텍스트 요약을 수행할 수 있는 몇 가지 모델을 사용, 실습
- 단, 각 모델 별 최대 입력 크기가 각각 다르지만, 동일한 입력을 사용해 출력을 비교하고자 입력 텍스트를 2,000자로 제한해 수행

In [4]:
sample_text = dataset['train'][0]['article'][:2000]

sample_text

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [5]:
# 딕셔너리에 각 모델이 생성한 요약을 저장
summaries = {}

요약에서는 관례적으로 요약 문장을 줄 바꿈으로 나눠 나타내곤 합니다. 마침표마다 그 뒤에 줄바꿈 토큰을 추가해도 되지만, 그러면 `U.S`, `U.N`과 같은 문자열을 적절히 처리하지 못합니다.

`NLTK` 패키지에는 문장의 종결과 약어에 등장하는 구두점을 구별하는 더 정교한 알고리즘이 존재합니다. 따라서 줄 바꿈을 더 적절히 수행하기 위해 `NLTK` 패키지를 사용합니다.

In [6]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
example = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(example)

['The U.S. are a country.', 'The U.N. is an organization.']

## 텍스트 요약 모델 load
- `GPT-2`
- `T5`
- `BART`

### `GPT-2`

- [paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- Prompt
```text
\nTL;DR:\n
```

In [8]:
from transformers import pipeline, set_seed

set_seed(42)  # randomness 통제

In [9]:
pipe = pipeline("text-generation", model='gpt2')  # gpt-xl
pipe

<transformers.pipelines.text_generation.TextGenerationPipeline at 0x7ab27c7408e0>

In [10]:
gpt2_query = sample_text + '\nTL;DR:\n'  # Prompt 제공
gpt2_query

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [11]:
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
pipe_out

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box off

In [12]:
summaries['gpt2'] = "\n".join(sent_tokenize(pipe_out[0]['generated_text'][len(gpt2_query):]))
print(summaries['gpt2'])

I have seen "My Boy Jack" and written the screenplay, and it has received rave reviews (see here for details or check it out at The Movie Critic).
I expect he will be in the UK from now on.


### `T-5`

- [paper](https://arxiv.org/pdf/1910.10683.pdf)

- Prompt
```text
summarize: <ARTICLE>
```

In [13]:
pipe = pipeline("summarization", model='t5-base')  # t5-large
pipe_out = pipe(sample_text)
summaries['t5'] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [18]:
print(summaries['t5'])

young actor says he has no plans to fritter his cash away on fast cars, drink .
"i don't think I'll be particularly extravagant," he tells an australian interviewer .
details of how he'll mark his landmark birthday are under wraps .
despite growing fame and riches, the actor says his feet are firmly on the ground .


### `BART`

- [paper](https://arxiv.org/pdf/1910.13461.pdf)

In [15]:
pipe = pipeline('summarization', model='facebook/bart-large-cnn')
pipe_out = pipe(sample_text)
summaries['bart'] = '\n'.join(sent_tokenize(pipe_out[0]["summary_text"]))

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [17]:
print(summaries['bart'])

Harry Potter star Daniel Radcliffe turns 18 on Monday.
He gains access to a reported £20 million ($41.1 million) fortune.
Radcliffe says he has no plans to fritter his cash away on fast cars, drink and parties.
His earnings from the first five Potter films have been held in a trust fund.


## 요약 결과 비교하기

- `GPT-2` : "CNN/DM 데이터의 요약 task 학습 X"
- `T5` : "요약 task를 수행하도록 Finetuning됨(CNN/DM 데이터셋은 학습에 사용되지 않음)"
- `BART` : "CNN/DM 데이터 대상 요약 task를 수행하도록 Finetuning됨"

In [20]:
print("GROUND TRUTH")
print(dataset['train'][0]['highlights'])
print()

for model_name in summaries:
  print(model_name.upper())
  print(summaries[model_name])
  print("")

GROUND TRUTH
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

GPT2
I have seen "My Boy Jack" and written the screenplay, and it has received rave reviews (see here for details or check it out at The Movie Critic).
I expect he will be in the UK from now on.

T5
young actor says he has no plans to fritter his cash away on fast cars, drink .
"i don't think I'll be particularly extravagant," he tells an australian interviewer .
details of how he'll mark his landmark birthday are under wraps .
despite growing fame and riches, the actor says his feet are firmly on the ground .

BART
Harry Potter star Daniel Radcliffe turns 18 on Monday.
He gains access to a reported £20 million ($41.1 million) fortune.
Radcliffe says he has no plans to fritter his cash away on fast cars, drink and parties.
His earnings from the first five Po