In [1]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f"특성: {dataset['train'].column_names}")

  from .autonotebook import tqdm as notebook_tqdm
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 8.33k/8.33k [00:00<00:00, 16.2MB/s]
Downloading metadata: 100%|██████████| 9.88k/9.88k [00:00<00:00, 6.60MB/s]
Downloading readme: 100%|██████████| 15.1k/15.1k [00:00<00:00, 8.95MB/s]
Downloading data: 100%|██████████| 159M/159M [00:01<00:00, 96.8MB/s] 
Downloading data: 100%|██████████| 376M/376M [00:04<00:00, 78.3MB/s] 
Downloading data: 46.4MB [00:00, 105MB/s]                             
Downloading data: 2.43MB [00:00, 18.1MB/s]                  
Downloading data: 2.11MB [00:00, 17.1MB/s]                  
Generating train split: 287113 examples [00:34, 8254.17 examples/s]
Generating validation split: 13368 examples [00:18, 716.88 examples/s] 
Generating test split: 11490 examples [00:18, 

특성: ['article', 'highlights', 'id']





In [3]:
sample = dataset["train"][1]

print(f"""기사 (500개 문자 발췌, 총 길이: {len(sample["article"])}):""")
print(sample["article"][:500])
print(f'\n요약 (길이: {len(sample["highlights"])}):')
print(sample["highlights"])

기사 (500개 문자 발췌, 총 길이: 4051):
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most s

요약 (길이: 281):
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


In [4]:
sample_text = dataset["train"][1]["article"][:2000]
summaries = {}

In [6]:
import nltk
from nltk.tokenize import sent_tokenize

In [7]:
nltk.download("punkt")

string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/parkhyerin/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['The U.S. are a country.', 'The U.N. is an organization.']

In [9]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

summaries["baseline"] = three_sentence_summary(sample_text)

summaries["baseline"]

'Editor\'s note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.\nHere, Soledad O\'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.\nMIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."'

In [10]:
from transformers import pipeline, set_seed

# GPT-2

set_seed(42)
pipe = pipeline("text-generation", model="gpt2-xl")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(
    sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :])
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [11]:
summaries["gpt2"]

'1.\nIn Miami, where about one-third of all inmates are mentally ill, many prisoners on the "forgotten floor" are mentally ill and suffering from mental illness.\nI was in Miami for a trial.\nIt was a high-profile case about murder -- it would be the most high-profile murder case in Miami-Dade history.\nI\'ve been covering a lot'

In [12]:
# T5

pipe = pipeline("summarization", model="t5-large")
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 1.44MB/s]
model.safetensors: 100%|██████████| 2.95G/2.95G [00:28<00:00, 105MB/s] 
generation_config.json: 100%|██████████| 147/147 [00:00<00:00, 60.5kB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 1.00MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 1.74MB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [13]:
summaries["t5"]

'mentally ill inmates are housed on the ninth floor of a florida jail .\nmost face drug charges or charges of assaulting an officer .\njudge says arrests often result from confrontations with police .\none-third of all people in Miami-dade county jails are mental ill .'

In [14]:
# BART
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

config.json: 100%|██████████| 1.58k/1.58k [00:00<00:00, 197kB/s]
model.safetensors: 100%|██████████| 1.63G/1.63G [00:19<00:00, 82.7MB/s]
generation_config.json: 100%|██████████| 363/363 [00:00<00:00, 148kB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 4.33MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 763kB/s]
tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 6.79MB/s]


In [15]:
summaries["bart"]

'Mentally ill inmates are housed on the "forgotten floor" of Miami-Dade jail.\nMost often, they face drug charges or charges of assaulting an officer.\nJudge Steven Leifman says the arrests often result from confrontations with police.\nHe says about one-third of all people in the county jails are mentally ill.'

In [17]:
# Pegasus
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
summaries["pegasus"]

'Mentally ill inmates in Miami are housed on the "forgotten floor"<n>The ninth floor is where they\'re held until they\'re ready to appear in court.\nMost often, they face drug charges or charges of assaulting an officer.\nThey end up on the ninth floor severely mentally disturbed .'