[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kjmazidi/NLP/blob/master/Part_7-AI_Trends/Chapter_27_LLM/HF-NLP-tasks.ipynb)

###### Code accompanies *Natural Language Processing* by KJG Mazidi, all rights reserved.

### Summarization

This notebooks demonstrates another NLP task: summarization. 

In [1]:
# Transformers installation
! pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:

Note that the following code specifies the task, summarization, and also identified a particular model to use. 

In [7]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



This raw text is from Project Gutenberg, and is the second paragraph in Mary Shelley's *Frankenstein*. Summarization models are trained on explanatory text, not narrative, but let's see how the model does. 

In [8]:
raw_text = """I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand this
feeling? This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icy climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpetual splendour. There—for with your leave, my sister, I will put
some trust in preceding navigators—there snow and frost are banished;
and, sailing over a calm sea, we may be wafted to a land surpassing in
wonders and in beauty every region hitherto discovered on the habitable
globe. Its productions and features may be without example, as the
phenomena of the heavenly bodies undoubtedly are in those undiscovered
solitudes. What may not be expected in a country of eternal light? I
may there discover the wondrous power which attracts the needle and may
regulate a thousand celestial observations that require only this
voyage to render their seeming eccentricities consistent for ever. I
shall satiate my ardent curiosity with the sight of a part of the world
never before visited, and may tread a land never before imprinted by
the foot of man. These are my enticements, and they are sufficient to
conquer all fear of danger or death and to induce me to commence this
laborious voyage with the joy a child feels when he embarks in a little
boat, with his holiday mates, on an expedition of discovery up his
native river. But supposing all these conjectures to be false, you
cannot contest the inestimable benefit which I shall confer on all
mankind, to the last generation, by discovering a passage near the pole
to those countries, to reach which at present so many months are
requisite; or by ascertaining the secret of the magnet, which, if at
all possible, can only be effected by an undertaking such as mine."""

In [11]:
len(raw_text)

2200

In [9]:
summary = pipe(raw_text)

In [10]:
print(summary)

[{'summary_text': ' Margaret, Margaret, writes: "I am already far north of London, and as I walk in the streets of St.Petersburgh, I feel a cold northern breeze play upon my cheeks, which…braces my nerves and fills me with delight" "I try in vain to be persuaded that the pole is the seat of a seat of cold and desolation; it ever presents itself to my imagination as the region of beauty and delight"'}]


In [21]:
len(summary[0]['summary_text'])

382

The original raw text had 2200 characters, and the summary had 382. The raw text was reduced to about 17%. But is it good? The summarizer confused the narrator with the recipient of the letter. In all fairness, not enough context was given for the summarizer. 

### How to use the model and tokenizer

The following code shows how to use the tokenizer and model. This does exactly the same thing as above, but gives a little more insight into the Transformers API.

In [22]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")



In [25]:
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)


In [26]:
summary2 = summarizer(raw_text)

In [27]:
print(summary2)

[{'summary_text': ' Margaret, Margaret, writes: "I am already far north of London, and as I walk in the streets of St.Petersburgh, I feel a cold northern breeze play upon my cheeks, which…braces my nerves and fills me with delight" "I try in vain to be persuaded that the pole is the seat of a seat of cold and desolation; it ever presents itself to my imagination as the region of beauty and delight"'}]


In [28]:
len(summary2[0]['summary_text'])

382