<div class="alert alert-info">
    
➡️ Before you start, make sure that you are familiar with the **[study guide](https://liu-nlp.ai/text-mining/logistics/)**, in particular the rules around **cheating and plagiarism** (found in the course memo).

➡️ If you use code from external sources (e.g. StackOverflow, ChatGPT, ...) as part of your solutions, don't forget to add a reference to these source(s) (for example as a comment above your code).

➡️ Make sure you fill in all cells that say **`YOUR CODE HERE`** or **YOUR ANSWER HERE**.  You normally shouldn't need to modify any of the other cells.

</div>

# L5: Text Summarization

In this lab, you will use large language models (LLMs) to generate summaries of (short) news articles.  In the first part, you will use an encoder language model to perform extractive text summarization, while in the second part, you will use a decoder LLM to perform abstractive text summarization.

**_A technical note:_** This lab is made to work without requiring a GPU.  If you have access to a GPU, you will be able to use larger (and better) models than the ones used by default in this notebook, and inference time should be much faster.  In order for this to work, you need to make sure that you have CUDA installed, that [PyTorch is installed for the correct CUDA version](https://pytorch.org/get-started/locally/), and also that you have installed the GPU-enabled version of [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) (which is the default, but not the one used in the lab environment).

In [None]:
# Define some helper functions that are used in this notebook

%matplotlib inline
from IPython.display import display, HTML

def success():
    display(HTML('<div class="alert alert-success"><strong>Checks have passed!</strong></div>'))

## Dataset 1: DBpedia

We have prepared a data set containing a small sample of sentences from DBpedia, each containing the word _record_, but describing different types of entities: either a company, a (music) album, or an athlete.

In [None]:
import pandas as pd
import numpy as np
import bz2

with bz2.open("dbpedia_record_sample.json.bz2", "rt", encoding="utf-8") as f:
    df = pd.read_json(f)

There are two labelled columns in the data set: `sentence` (the sentence from DBpedia), and `label` (the category of the DBpedia entry where the sentence is from).  The `label` column can take three values: Company, Album, or Athlete. 

In [None]:
pd.set_option('display.max_colwidth', None)
df.head()

## Introduction (_or_ Problem 0): Sentence embeddings

Neural language models can be used to produce _embeddings_ of words, sentences, or entire documents — i.e., a representation of that string of text in a large-dimensional vector space.

In this lab, we are interested in _sentence embeddings_. The library [SentenceTransformers](https://www.sbert.net/) is specifically made for embedding sentences with neural, transformer-based models.  We start by importing this library and loading a small, pre-trained language model.

**_Note:_** The first time you run this, the model will automatically be downloaded onto your machine (ca. 90 MB).

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

It’s very easy to obtain sentence embeddings from a loaded model:

In [None]:
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
model.encode(sentences)

As you can see, the result of calling `model.encode()` is an array of vectors of floating-point numbers.  Let’s check the _dimensionality_ of these embeddings:

In [None]:
embeddings = model.encode(sentences)
embeddings.shape

Your first task is to simply **encode all sentences from the DBpedia dataset** using this embedding model, and also produce a list of the corresponding _labels_ for later use.

In [None]:
embeddings = ...
labels = []       # labels should be a list so that labels[i] corresponds to the sentence encoded by embeddings[i]

# YOUR CODE HERE
raise NotImplementedError()

## Problem 1: Visualisation with t-SNE

A downside of embeddings like these is that a 384-dimensional embedding space is hard to visualise or interpret.  We need a _dimensionality reduction_ technique like [t-SNE](https://scikit-learn.org/1.5/modules/manifold.html#t-sne) to produce a visualization of vectors in such a space.  This is conveniently implemented by scikit-learn, so we first import the relevant classes:

In [None]:
from sklearn.manifold import TSNE
import seaborn as sns
sns.set(style="white")

Your task is to **visualise the sentence embeddings from the DBpedia dataset** by using t-SNE to map them into a two-dimensional space, then plot the resulting points as a scatterplot, using the category labels for the _color_ of each point.  This gives you a way to visualise the vectors with respect to the category labels.

Complete the function in the cell below, where we have already instantiated the TSNE class.  You can use either plain Matplotlib or Seaborn for the plot.

In [None]:
def plot_tsne(vectors, labels, perplexity=30.0, max_iter=1000):
    """Compute and plot a t-SNE reduction of the given vectors.
    
    Arguments:
        vectors: A list of embedding vectors.
        labels: A list of class labels; must have the same length as `vectors`.
        perplexity (float): A hyperparameter of the t-SNE algorithm; recommended values
            are between 5 and 50, and can result in significantly different results.
        max_iter (int): A hyperparameter of the t-SNE algorithm, controlling the maximum
            number of iterations of the optimization algorithm.

    Returns:
        Nothing, but shows the plot.
    """
    tsne = TSNE(verbose=True, perplexity=perplexity, max_iter=max_iter)
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

If your implementation is correct, you should be able to produce the plot by running the following cell:

In [None]:
plot_tsne(embeddings, labels, perplexity=30.0, max_iter=1000)

Afterwards, **re-run the visualisation** a few times with different values for the `perplexity` and `max_iter` parameter, and observe if and how the resulting visualisation changes.  Take a moment to consider how you would interpret the results; you will need this for the reflection part.

## Dataset 2: CNN/DailyMail

In the remainder of this lab, we will look specifically at the task of text summarization. For this, we are using a subset of the CNN/DailyMail&nbsp;3.0.0 data set, a popular dataset for summarization composed of news articles.

In [None]:
with bz2.open("cnn_dailymail_3.0.0_shorts.json.bz2", "rt", encoding="utf-8") as f:
    news_df = pd.read_json(f)

There are two labelled columns in the data set: `article` (the full news article), and `highlights` (highlights from the article, which we will treat as a “reference summary”).

The entire dataset contains 5,000 news articles, but since some of the techniques we will explore here can be quite compute-intensive (depending on the hardware used to run this lab), we limit ourselves to a hand-picked selection of ten articles for some of the exercises:

In [None]:
indices = [6, 53, 56, 340, 730, 1940, 1983, 2404, 2826, 4673]
short_news_df = news_df.iloc[indices]
short_news_df

## Problem 2: Extractive summarization

In this problem, we will produce a summary of a news article by extracting a small number of sentences from it.

### Task 2.1: Extracting sentence embeddings

In the data set, each news article is given as a single string.  Your first task is to **split the text up into sentences** and run them through the sentence embedding model from the Introduction.  You can perform the sentence splitting with the help of spaCy:

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

In [None]:
def get_sentences_and_embeddings(text):
    """Splits the text into sentences and computes embeddings for each of them.

    Arguments:
        text: The text to process, e.g. one entire news article.

    Returns:
        A tuple (sentences, sentence_vectors).  `sentences` should be a list of 
        sentences from the text, while `sentence_vectors` should be a list of
        embedding vectors corresponding to these sentences.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

The following cell shows how your function should be called, and sanity-checks the returned value for one of the news articles:

In [None]:
sents, vecs = get_sentences_and_embeddings(short_news_df.iloc[2]["article"])

# Check if the article appears to be split up correctly
assert len(sents) == len(vecs) == 13, "The news article should produce 13 sentences and 13 vectors"
assert sents[4] == 'The Disney Wonder is registered in the Bahamas.', "The fifth sentence in the article should be 'The Disney Wonder is registered in the Bahamas.'"
success()

### Task 2.2: Extractive summarization with MMR

You now have all the necessary inputs to produce an extractive summary of a news article. Write a function that takes a list of sentences and their corresponding embedding vectors, as well as a number of sentences to extract from it. Your function should then implement the **maximum marginal relevance (MMR) algorithm** as follows:

1. Initially, your candidate set $C$ contains all sentences in the news article, and the set of selected sentences $S$ is empty.
2. As the “profile vector” $p$, use the **centroid** of all the sentence vectors from the news article.
3. Pick the next sentence to extract from $C\textbackslash S$ using the marginal relevance formula:

$$
s_i = \textrm{arg\,max}_{s \in C\textbackslash S} ~\left( \textrm{sim}(s, p) - \textrm{max}_{s_j \in S} ~\textrm{sim}(s, s_j) \right)
$$

In this formula, “sim” is the plain **cosine similarity** between the vectors.  We recommend you use the [`cosine_similarity` function from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) for this purpose.

4. Repeat step 3 until you have extracted $n$ sentences, where $n$ is given as an argument to your function.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def summarize(sentences, vectors, n=3):
    """Produce an extractive summary from a list of sentences and their vectors.

    Arguments:
        sentences (list): A list of sentences.
        vectors (list): A list of vectors, one for each sentence.
        n (int): The number of sentences to extract for the summary.

    Returns:
        A summary of `n` extracted sentences, as a single string.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

The following cell puts it all together: it extracts the sentences and embeddings for one news article, generates the extractive summary, and displays it:

In [None]:
news = short_news_df.iloc[4]
sents, vecs = get_sentences_and_embeddings(news["article"])
summary = summarize(sents, vecs, n=3)

# Show the article, highlights, and extracted summary in a table
pd.DataFrame({
    "article": [news["article"]],
    "highlights": [news["highlights"]],
    "summary": [summary]
})

If your implementation is correct, your summary should look like this:

> The Premier League has announced that next season's competition will begin on August 8. The move could allow Roy Hodgson more time to prepare his squad for Euro 2016, if England qualify . The Premier League announced the news on Twitter on Wednesday .

Try it out with different news articles to see what happens!

## Problem 3: Evaluating text summarization

In this problem, you implement a ROUGE metric to automatically compare the extracted summaries against the “highlights” column of the data set.

### Task 3.1: Implementing ROUGE-2

Concretely, you should implement **ROUGE-2**, which is the version of the ROUGE metric based on bigram overlap between a system output and a reference.  In our case, the “system output” is the (concatenated) string of sentences from the generated summary, while the “reference” is the `highlights` column from the news data set.  The ROUGE-2 score is the F1-score computed from the **number of overlapping bigrams** compared against the **total number of bigrams** in the system output and the reference.

You will need to tokenize the inputs in order to compute the bigram overlap.  The caveat here is that different tokenizers may result in different ROUGE scores.  For the purpose of this problem, you should use spaCy to tokenize your input.  If in doubt, you can refer to [the spaCy 101 documentation on tokenization](https://spacy.io/usage/spacy-101#annotations-token).


In [None]:
def rouge_2(system, reference):
    """Compute the ROUGE-2 score between a system output and a reference.
    
    Arguments:
        system (str): The system output, as a single string.
        reference (str): The reference to compare against, as a single string.

    Returns:
        The F1-score of the ROUGE-2 metric between system output and reference.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

The following cell tests your implementation of ROUGE-2 with some toy examples:

In [None]:
assert rouge_2("System output.", "Reference summary.") == 0.0, "Two strings without any bigram overlap should return a score of zero"
assert rouge_2("Two identical strings.", "Two identical strings.") == 1.0, "Two identical strings should return a score of one"
assert rouge_2("This is my summary.", "This is the summary.") == 0.5, "In this example, when using spaCy’s tokenization, exactly half of the bigrams overlap, so the ROUGE-2 score should be 0.5"
success()

### Task 3.2: Baselines for extractive summarization

Now that you have an implementation of an evaluation metric, you can use it to evaluate the extractive summaries on the small set of news articles in `short_news_df`.

After running the following cell:
- `extractive` should contain the extractive summaries for all articles in `short_news_df`, one summary per article, with **two sentences per summary**.
- `extractive_rouge_2` should contain the average ROUGE-2 score for the extractive summaries, when evaluated against the "highlights" column in `short_news_df`.

In [None]:
extractive = []            # Summaries of all articles in `short_news_df`
extractive_rouge_2 = ...   # Average ROUGE-2 score
# YOUR CODE HERE
raise NotImplementedError()

We cannot say much about the ROUGE-2 score unless we compare it against a **baseline**.  A simple baseline for text summarization is to just take the first $n$&nbsp;sentences from the article.  In the following cell, you should compute the ROUGE-2 score of this baseline, i.e., taking the **first two sentences** of the article as your “summary”.

In [None]:
baseline_rouge_2 = ...     # Average ROUGE-2 score of baseline
# YOUR CODE HERE
raise NotImplementedError()

#### 🤞 Test your code

The following cell prints both ROUGE-2 scores:

In [None]:
print(f"ROUGE-2 (baseline)  : {baseline_rouge_2:.4f}")
print(f"ROUGE-2 (extractive): {extractive_rouge_2:.4f}")

## Problem 4: Prompting an instruction-tuned LLM

In order to perform _abstractive_ summarization, we will turn to an open-source, instruction-fine-tuned large language model (LLM).  Concretely, we will use a quantized version of a [Llama 3.2](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) model, released in September 2024, that we load via the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) library.  _Quantization_ is a technique to distill a model down to a smaller size, allowing it to run faster and on smaller hardware, while sacrificing some of its performance.

The model is loaded in the following cell. Please note:

- **If you are running this on the LiU computers,** the following cell will use the model file that we already downloaded to the shared course folder.
- **If you are running this on your own computer,** the following cell will automatically download the model from Huggingface the first time you run it, which **requires approx.&nbsp;2&nbsp;GB of disk space!**

In [None]:
import os
from llama_cpp import Llama

# Path to the model on the LiU computers
model_path = "/courses/TDDE16/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf"

if os.path.exists(model_path):
    # Running on the LiU computers
    llm = Llama(model_path, verbose=False)
else:
    # *NOT* running on the LiU computers
    llm = Llama.from_pretrained(
        repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
    	filename="Llama-3.2-3B-Instruct-Q4_K_M.gguf",   # 2GB
        verbose=False,
    )

print(f"Using model: {llm.metadata.get('general.name')}")
print(f"Stored at: {llm.model_path}")

Generating text with this model is simple — we can just call the `llm` object directly.  Note that we don’t have to (and in fact _shouldn’t_) perform any preprocessing on the text input, as this is all done within the model.

In [None]:
llm("What is the capital of Sweden?", max_tokens=64)

**Run the cell above multiple times** and see how the output changes!

You should notice two things:
1. The call to `llm()` returns more than just the generated text; the generated output is wrapped in a bunch of metadata about the model and the generation.
2. The generated text contains more than just the answer to our question, adding more or less relevant (or sensible) output after the answer itself. _(If you haven’t observed that, run the generation a few more times!)_

Let’s look at the second issue first.  Instruction-tuned models like Llama are fine-tuned with a specific _chat_ or _prompt template_ that structures the input in a pre-defined way.  For Llama models, this template might look approximately like this:

```
<|start_header_id|>system<|end_header_id|>
Answer the user query as accurately as possible.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the capital of Sweden?<|eot_id|>
```

There are different, clearly separated parts in the template, corresponding to different _roles_ of the respective input:

- The **system** role contains global instructions for the LLM (e.g. how to behave, or meta information such as its knowledge cutoff date).
- The **user** role contains instructions from the user.
- The **input** role (not shown above) contains additional _input data_ that the LLM can refer to when processing the user query.

The llama-cpp-python library provides a function for automatically applying the correct template for the loaded model; it takes a list of _messages_, which are text strings with their corresponding roles.  Here’s an example of how to call it:

In [None]:
llm.create_chat_completion(
	messages=[
		{
			"role": "user",
			"content": "What is the capital of Sweden?"
		},
	],
    max_tokens=64,
)

Hopefully, the output now only contains the answer to the question, and no further chatter! You should also see that, in contrast to simply calling `llm()`, using the prompt template makes the model response appear in a message with the **assistant** role.

### Task 4.1: Understanding prompt templates

Your task now is to test the behaviour of prompt templates (and your understanding of them) with the following exercise:

**Write a function that returns the capital of a country by prompting the LLM.**  In particular, your function should:

- use the "system" message to make the LLM return only the desired answer, rather than a full sentence (i.e., not _"The capital of Sweden is Stockholm."_ but just _"Stockholm"_)
- use the "input" message to supply the name of the country
- return only a text string with the generated answer

In [None]:
def get_capital_of(country):
    """Prompt an LLM for the capital of a country.
    
    Arguments:
        country (str): The name of the country.

    Returns:
        A text string containing the LLM’s response.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

The following cell should return only “Stockholm”:

In [None]:
get_capital_of("Sweden")

If your code works with the input “Sweden”, **re-run the function with different inputs** to see what happens.  Does the LLM return the correct answer even for lesser-known countries?  What happens if you give it input that’s not an actual country?

In [None]:
# Example — try out your own inputs!
get_capital_of("Iceland")

## Problem 5: Abstractive summarization

We now have everything we need to perform abstractive summarization, by prompting our LLM to generate a summary based on the input text.

### Task 5.1: Prompting for summaries

In the same vein as in Task 4.1, **write a function that returns a summary of the input text** by prompting the LLM.  Note the following:

- Since we could specify the desired number of sentences for the extractive summaries (in Task&nbsp;2.2), we want to have the same functionality here, so construct your messages in a way that request the given number of sentences from the LLM.  (Note that this is not an absolute guarantee that the LLM will always follow this instruction, but it should do so in most cases.)
- As before, construct your messages in such a way that the returned response will only contain the requested summary, without any further text (such as “Sure, here is a summary...”).

In [None]:
def llm_summarize(text, n=3):
    """Produce an abtractive summary for a given text.
    
    Arguments:
        text (str): The text to summarize.
        n (int): The number of sentences to extract for the summary.

    Returns:
        A text string containing the abstractive summary produced by the LLM.
    """
    # YOUR CODE HERE
    raise NotImplementedError()

#### 🤞 Test your code

The following cell should produce a summary for one of the news articles:

In [None]:
news = short_news_df.iloc[4]
summary = llm_summarize(news["article"], n=3)

# Show the article, highlights, and extracted summary in a table
pd.DataFrame({
    "article": [news["article"]],
    "highlights": [news["highlights"]],
    "summary": [summary]
})

### Task 5.2: Generate all summaries & evaluate

The only thing left to do is to generate summaries for all the articles in our (tiny) dataset, compute their ROUGE-2 scores against the "highlights" column, and compare the results against both the baseline and the extractive summaries.

After running the following cell:
- `abstractive` should contain the abstractive summaries for all articles in `short_news_df`, one summary per article.
  -  **Important:** Since we used two sentences for the extractive summarization & baselines, we should also prompt for **two-sentence summaries** here.
- `abstractive_rouge_2` should contain the average ROUGE-2 score for the abstractive summaries, when evaluated against the "highlights" column.

**_Note:_** This cell should take the longest time to run out of all exercises in this notebook; in our testing, it took around 3–5 minutes on the lab computers.  You may want to [use a progress bar](https://rich.readthedocs.io/en/stable/progress.html) or print an update in each loop iteration to get an idea of how many articles have already been processed.

In [None]:
%%time

abstractive = []             # Summaries of all articles in `short_news_df`
abstractive_rouge_2 = ...    # Average ROUGE-2 score
# YOUR CODE HERE
raise NotImplementedError()

#### 🤞 Test your code

The following cell prints all ROUGE-2 scores, and the cell after that displays a table with the highlights and generated summaries, side-by-side:

In [None]:
print(f"ROUGE-2 (baseline)   : {baseline_rouge_2:.4f}")
print(f"ROUGE-2 (extractive) : {extractive_rouge_2:.4f}")
print(f"ROUGE-2 (abstractive): {abstractive_rouge_2:.4f}")

In [None]:
pd.set_option('display.max_colwidth', None)
pd.DataFrame({
    "highlights": short_news_df.loc[:, "highlights"],
    "extractive": extractive,
    "abstractive": abstractive
})

As before, the summaries (and therefore the ROUGE scores) will most likely be different if you run the same prompt again, so we encourage you to run it at least twice before you write up your reflection!

## Individual reflection

<div class="alert alert-info">
    <strong>After you have solved the lab,</strong> write a <em>brief</em> reflection (max. one A4 page) on the question(s) below.  Remember:
    <ul>
        <li>You are encouraged to discuss this part with your lab partner, but you should each write up your reflection <strong>individually</strong>.</li>
        <li><strong>Do not put your answers in the notebook</strong>; upload them in the separate submission opportunity for the reflections on Lisam.</li>
    </ul>
</div>

1. In Problem&nbsp;1, you visualised sentence embeddings with the help of t-SNE.  How do you interpret the results you got?  Were there any patterns, and/or did the parameters of t-SNE make a difference?
2. Pick one example from the final evaluation in Task&nbsp;5.2 and discuss what qualitative differences you see between the extractive and abstractive summaries.  Do the ROUGE-2 scores you got agree with your own judgment on how “good” the summaries are?  What conclusion do you draw from this experiment?

**Congratulations on finishing this lab! 👍**

<div class="alert alert-info">
    
➡️ Before you submit, **make sure the notebook can be run from start to finish** without errors.  For this, _restart the kernel_ and _run all cells_ from top to bottom. In Jupyter Notebook version 7 or higher, you can do this via "Run$\rightarrow$Restart Kernel and Run All Cells..." in the menu (or the "⏩" button in the toolbar).

</div>