### Transformer

In [1]:
from datasets import load_dataset
from transformers import BertModel
from transformers import pipeline
import pandas as pd
import numpy as np

### Search semantics

The Hugging Face summarization task page lists models that support summarization. In this section, we will the following resources:

- data: We will work with the 'xsum dataset,' containing a collection of BBC articles and their corresponding summaries. This dataset serves as the foundation for our tasks.

- model: Our chosen model is the 't5-small model,' with 60 million parameters (equivalent to 242MB for PyTorch). T5, an encoder-decoder model developed by Google, boasts versatility, supporting various tasks including summarization, translation, question-answering, and text classification.

In [2]:
xsum_dataset = load_dataset(
    "xsum", version="1.2.0"
)  

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


This dataset provides 3 columns:

- document: the BBC article text
- summary: a "ground-truth" summary --> Note how subjective this "ground-truth" is. Is this the same summary you would write? This a great example of how many LLM applications do not have obvious "right" answers.
- id: article ID

In [3]:
# Taking a sample of 100 rows
xsum_sample = xsum_dataset["train"].select(range(1000)).to_pandas()

# Combining 'document' and 'summary' columns
xsum_sample["combined"] = (
    "Document: " + xsum_sample.document.str.strip() + "; Summary: " + xsum_sample.summary.str.strip()
)

In [4]:
from sentence_transformers import SentenceTransformer

#encoding the data
encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
encoded_data = encoder.encode(xsum_sample["combined"])

In [5]:
encoded_data

array([[-0.12973613, -0.07995621, -0.02103525, ...,  0.01458147,
        -0.04181118,  0.05969834],
       [-0.10183043, -0.00813398,  0.01535375, ...,  0.03995895,
        -0.10245819,  0.08624592],
       [-0.06544365, -0.22466174,  0.01042669, ...,  0.06865789,
         0.0731439 ,  0.01244215],
       ...,
       [ 0.04156043,  0.15200093,  0.04194619, ...,  0.05652885,
         0.0718336 , -0.05565466],
       [-0.06733167,  0.10981705, -0.07706451, ...,  0.05933914,
        -0.0320871 , -0.04915017],
       [-0.1334438 , -0.14712858,  0.00372515, ...,  0.01369817,
         0.0259726 , -0.09492668]], dtype=float32)

In [6]:
import faiss

In [7]:
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))

In [8]:
faiss.normalize_L2(encoded_data)
index.add_with_ids(encoded_data, np.arange(len(encoded_data)))

In [9]:
search_text = "harry potter"

In [None]:
search_vector = encoder.encode(search_text)
_vector = np.array([search_vector])
faiss.normalize_L2(_vector)

k = 2
distances, ann = index.search(_vector, k=k)
results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})

In [None]:
distances

In [None]:
xsum_sample["summary"][results['ann'][0]]

In [None]:
xsum_sample["summary"][results['ann'][1]]

### Summarization

In [33]:
summarizer = pipeline(
    task="summarization",
    model="t5-small",
    min_length=20,
    max_length=60,
    truncation=True,
) 

In [34]:
xsum_sample["summary"][0]

'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.'

In [37]:
output = summarizer(xsum_sample["document"][0])

In [38]:
output[0]['summary_text']

'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'

In [42]:
# Apply to a batch of articles
def summarize(input):
    output = summarizer(input)
    return output[0]['summary_text']
    
xsum_sample["document"].sample(10).apply(summarize)   

471    bale and Ronaldo meet in a semi-final at a maj...
882    police responded to reports of cars being driv...
87     the reef is experiencing its worst coral bleac...
101    the 77th Brigade will be formally created in A...
768    president's long-time rival is on trial for wa...
364    universities Wales will publish its manifesto ...
880    the 14 men, from across England and Northern I...
91     cancer research UK found 39% of Scots consumed...
967    the body parts and tissue samples were retaine...
225    the independent police Complaints Commission h...
Name: document, dtype: object

### Search and sampling in inference

You may see parameters like `num_beams`, `do_sample`, etc. specified in Hugging Face pipelines.  These are inference configurations.

LLMs work by predicting (generating) the next token, then the next, and so on.  The goal is to generate a high probability sequence of tokens, which is essentially a search through the (enormous) space of potential sequences.

To do this search, LLMs use one of two main methods:
* **Search**: Given the tokens generated so far, pick the next most likely token in a "search."
   * **Greedy search** (default): Pick the single next most likely token in a greedy search.
   * **Beam search**: Greedy search can be extended via beam search, which searches down several sequence paths, via the parameter `num_beams`.
* **Sampling**: Given the tokens generated so far, pick the next token by sampling from the predicted distribution of tokens.
   * **Top-K sampling**: The parameter `top_k` modifies sampling by limiting it to the `k` most likely tokens.
   * **Top-p sampling**: The parameter `top_p` modifies sampling by limiting it to the most likely tokens up to probability mass `p`.

You can toggle between search and sampling via parameter `do_sample`.

For more background on search and sampling, see [this Hugging Face blog post](https://huggingface.co/blog/how-to-generate).

We will illustrate these various options below using our summarization pipeline.

In [23]:
# We can instead do a beam search by specifying num_beams.
summarizer(xsum_sample["document"][0], num_beams=10)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]

In [25]:
# Alternatively, we could use sampling.
summarizer(xsum_sample["document"][0], do_sample=True)

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the waters breached a retaining wall, flooding many commercial properties .'}]

In [26]:
# We can modify sampling to be more greedy by limiting sampling to the top_k or top_p most likely next tokens.
summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)›

[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . the water breached a retaining wall, flooding many commercial properties .'}]