### Transformer

In [1]:

from transformers import BertModel
from transformers import pipeline

### Search semantics

The Hugging Face summarization task page lists models that support summarization. In this section, we will the following resources:

- data: We will work with the 'xsum dataset,' containing a collection of BBC articles and their corresponding summaries. This dataset serves as the foundation for our tasks.

- model: Our chosen model is the 't5-small model,' with 60 million parameters (equivalent to 242MB for PyTorch). T5, an encoder-decoder model developed by Google, boasts versatility, supporting various tasks including summarization, translation, question-answering, and text classification.

In [1]:
from datasets import load_dataset
import pandas as pd

xsum_dataset = load_dataset(
    "xsum", version="1.2.0"
) 

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


This dataset provides 3 columns:

- document: the BBC article text
- summary: a "ground-truth" summary --> Note how subjective this "ground-truth" is. Is this the same summary you would write? This a great example of how many LLM applications do not have obvious "right" answers.
- id: article ID

In [3]:
# Taking a sample of 100 rows
xsum_sample = xsum_dataset["train"].select(range(1000)).to_pandas()

for idx, row in xsum_sample.sample(2).iterrows():
    print("")
    print(f"Document: {row['document']}")
    print(f"Summary: {row['summary']}")


Document: Powys council's cabinet said the loss of £1.6m over the next three years has affected the number of classes it could afford and it was launching a review.
It will look at whether its sixth forms are financially viable and educationally sustainable.
The Welsh government said it is working with Powys to minimise the impact of cuts to learning.
The council launched a similar review three years ago but eventually decided not to shut any sixth forms in the county.
Since then, the council has backed the takeover of the struggling John Beddoes School in Presteigne by Newtown High School, meaning the number of sixth forms in the county will drop from 13 to 12 from April.
Council cabinet member for learning Myfanwy Alexander said: "Changes to the way post-16 funding is delivered and a decline in pupil numbers have had a severe impact on Powys sixth forms.
"Learner choice will be hit hard and the sustainability of Powys sixth forms will be seriously affected."
In September 2012, the c

In [8]:
# Combining 'document' and 'summary' columns
xsum_sample["combined"] = (
    "Document: " + xsum_sample.document.str.strip() + "; Summary: " + xsum_sample.summary.str.strip()
)

'Document: The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate 

In [9]:
from sentence_transformers import SentenceTransformer

#encoding the data
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
encoded_data = encoder.encode(xsum_sample["combined"])

In [12]:
import faiss
import numpy as np

index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
faiss.normalize_L2(encoded_data)
index.add_with_ids(encoded_data, np.arange(len(encoded_data)))

In [13]:
search_text = "harry potter"

In [18]:
search_vector = encoder.encode(search_text)
_vector = np.array([search_vector])
faiss.normalize_L2(_vector)

k = 1
distances, ann = index.search(_vector, k=k)
results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})

In [19]:
distances

array([[0.59804404]], dtype=float32)

In [None]:
xsum_sample["summary"][results['ann'][0]]

In [None]:
xsum_sample["summary"][results['ann'][1]]

### Summarization

In [33]:
summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=20,
    max_length=60,
    truncation=True,
) 

In [34]:
xsum_sample["summary"][0]

'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.'

In [37]:
output = summarizer(xsum_sample["document"][0])

In [38]:
output[0]['summary_text']

'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'

In [42]:
# Apply to a batch of articles
def summarize(input):
    output = summarizer(input)
    return output[0]['summary_text']
    
xsum_sample["document"].sample(10).apply(summarize)   

471    bale and Ronaldo meet in a semi-final at a maj...
882    police responded to reports of cars being driv...
87     the reef is experiencing its worst coral bleac...
101    the 77th Brigade will be formally created in A...
768    president's long-time rival is on trial for wa...
364    universities Wales will publish its manifesto ...
880    the 14 men, from across England and Northern I...
91     cancer research UK found 39% of Scots consumed...
967    the body parts and tissue samples were retaine...
225    the independent police Complaints Commission h...
Name: document, dtype: object

### Search and sampling in inference

You may see parameters like `num_beams`, `do_sample`, etc. specified in Hugging Face pipelines.  These are inference configurations.

LLMs work by predicting (generating) the next token, then the next, and so on.  The goal is to generate a high probability sequence of tokens, which is essentially a search through the (enormous) space of potential sequences.

To do this search, LLMs use one of two main methods:
* **Search**: Given the tokens generated so far, pick the next most likely token in a "search."
   * **Greedy search** (default): Pick the single next most likely token in a greedy search.
   * **Beam search**: Greedy search can be extended via beam search, which searches down several sequence paths, via the parameter `num_beams`.
* **Sampling**: Given the tokens generated so far, pick the next token by sampling from the predicted distribution of tokens.
   * **Top-K sampling**: The parameter `top_k` modifies sampling by limiting it to the `k` most likely tokens.
   * **Top-p sampling**: The parameter `top_p` modifies sampling by limiting it to the most likely tokens up to probability mass `p`.

You can toggle between search and sampling via parameter `do_sample`.

For more background on search and sampling, see [this Hugging Face blog post](https://huggingface.co/blog/how-to-generate).

We will illustrate these various options below using our summarization pipeline.

In [23]:
# We can instead do a beam search by specifying num_beams.
summarizer(xsum_sample["document"][0], num_beams=10)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]

In [25]:
# Alternatively, we could use sampling.
summarizer(xsum_sample["document"][0], do_sample=True)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the waters breached a retaining wall, flooding many commercial properties .'}]

In [26]:
# We can modify sampling to be more greedy by limiting sampling to the top_k or top_p most likely next tokens.
summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)›

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]