# Assignment 3: Summarization with LLMs

**Description:** This assignment covers the task of summarization which is the process of generating an abridged version of the input. With the ascendance of LLMs, we have a new way of generating summaries. Now, rather than fine-tuning. moel to generate summaries, we can simply provide explicit instructios for the summary we want the model to generate.  By finishing this assignment you should also be able to develop an intuition for:


* How well summarization systems work
* The effects of hyperparameters on outcomes
* The effects of prompts on the output of an LLM
* Evaluation of output using ROUGE



This notebook must be run on a Google Colab as it requires a GPU. By default, when you open the notebook in Colab it will configure a GPU.  Summarization commands can take up to five minutes to run depending on the hyperparameters you use.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/assignment/a3/SummarizationLLM_test.ipynb)

The overall assignment structure is as follows:

 Setup

1. Gemma 2 for abstractive summarization




**INSTRUCTIONS:**:

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.

* **### YOUR CODE HERE** indicates that you are supposed to write code.

* In order to complete the assignment with the Gemma model you will need to get an account on [Hugging Face](https://huggingface.co).  It is free.  Once you have the account on Hugging Face you will need to create an Access Token.  Go
to Access Token under your profile and generate a token with write permissions for colab.  You will need to copy that token and add it to the secrets in your Colab account with the name `HF_TOKEN` and the value of the string of your access token.

* In addition, you will need to visit the [Model Card for the Gemma 2 model](https://huggingface.co/google/gemma-2-9b-it).  At the top of the page you will see a notice saying you need to request perrmission to use the model.  While logged in to your Hugging Face account, click the button to request permission.  It can sometimes take up to 10 or 15 minutes to get approved.  Once you are approved the message on the Model Card will change to indicate you have been granted access to the model.


## Setup

In [1]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!pip install -q -U flash_attn
!pip install -q -U datasets==3.6.0

In [2]:
#help track which versions of libraries we're using
!pip list | grep transformers
!pip list | grep accelerate
!pip list | grep bitsandbytes
!pip list | grep datasets

In [3]:
import datasets
from transformers import pipeline, BitsAndBytesConfig
import bitsandbytes as bnb
import torch
import random
import pandas as pd
from tqdm import tqdm


In [4]:
!pip install -q evaluate
import evaluate

In [5]:
!pip install -q rouge_score

In [6]:
#let's make longer output readable without horizontal scrolling
from pprint import pprint

Now let's get the data we're going to use.

In [7]:
#import datasets
#import random
#import torch
from datasets import load_dataset

def load_and_sample_dataset(num_samples=11):
    """
    Load and sample records from the X-Sum dataset
    """
    #dataset = datasets.load_dataset("xsum", split="train", cache_dir=None, trust_remote_code=True)

    dataset = load_dataset("EdinburghNLP/xsum", split="test")
    selected_indices = random.sample(range(len(dataset)), num_samples)
    selected_samples = dataset.select(selected_indices)
    return selected_samples

In [8]:
# Set random seed for reproducibility
random.seed(42)
torch.manual_seed(42)

# Load dataset
print("Loading dataset...")
dataset = load_and_sample_dataset()

In [9]:
display(dataset)

What do our input documents lok like?  Let's see the first of them.

In [11]:
dataset[0]['document']

And what does the corresponding summmary look like?  This is our target.

In [12]:
dataset[0]['summary']

We'll also take advantage of a Hugging Face abstraction called a pipeline.  It is an easy way of experimenting with a model in inference mode.  We'll use that here to experiment with prompts (and possibly some hyperparameters) to imporve the quality of our results.

It takes a while to load this model -- on the order of ten minutes -- but once it is loaded you can keep reusing the loaded model and improve your prompt.



In [13]:
"""
Initialize the pipeline with bitsandbytes quantization
"""
# Configure bitsandbytes for 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Initialize pipeline
model_id = "google/gemma-2-9b-it"

summarizer = pipeline(
   "text-generation",
   model=model_id,
   model_kwargs={"dtype": torch.bfloat16, "quantization_config": quantization_config},
   device_map="auto",
   trust_remote_code=True,
)

As a reminder, here's the record we're dealing with.

In [14]:
dataset[0]

Let's just generate one summary so we can see what it looks like

In [15]:
prompt = [
            {"role": "user", "content": "Generate a summary of this text: " + dataset[0]['document']}
        ]



outputs = summarizer(
  prompt,
  max_new_tokens=256,
  do_sample = True,
  temperature = 0.3,
  top_p = 0.95
)

summary = outputs[0]["generated_text"][-1]

Let's see what the generated summary looks like.

In [16]:
summary

How does it compare with the reference? Let's compare your candidate and the reference using the ROUGE metric.

In [17]:
rouge = evaluate.load('rouge')


# Process each sample
print("Generating summaries and calculating ROUGE scores...")



# Calculate ROUGE scores
predictions = [summary['content']]
references = [[dataset[0]['summary']]]
rouge_scores = rouge.compute(predictions=predictions, references=references)
rouge_scores

Now, it's your turn.  Please improve the prompt below so that you get output that, when scored using ROUGE, the average scores for the entire data sample of 11 records exceeds these thresholds:
* Rouge-1 > 0.2
* Rouge-2 > 0.03
* Rouge-L > 0.15

You may use sampling with Top K or Top P and termperature if you like but the prompt is what will have the greatest effect on your output.  Your prompt should give as specific instructions as possible.  These LLMs are trained to follow instructions so be very specific in your request.  Individual words can make a large difference so take a little time to experiment with synonyms and alternate ways of phrasing things.

In [18]:
# Store results for aggregate scoring
results = []

Enter your prompt in the space below and then run the code.  

In [19]:
dataset[6]

In [20]:
for idx, sample in enumerate(tqdm(dataset)):
    try:
      prompt = [
      ### YOUR CODE HERE



      ### END YOUR CODE
              ]


      # Generate summary via the pipeline
      outputs = summarizer(
                          prompt,
                          max_new_tokens=512,
      )

      summary = outputs[0]["generated_text"][-1]

      # Calculate ROUGE scores
      predictions = [summary['content']]
      references = [[sample['summary']]]
      rouge_scores = rouge.compute(predictions=predictions, references=references)


      # Store results
      results.append({
          'id': idx,
          'original_text': sample['document'][:500],  # Store truncated text for readability
          'reference_summary': sample['summary'],
          'generated_summary': summary,
           **rouge_scores
      })

      # Print progress update every 10 samples
      if (idx + 1) % 10 == 0:
          print(f"\nProcessed {idx + 1} samples")
          print(f"Latest ROUGE-1: {rouge_scores['rouge1']:.4f}")

    except Exception as e:
      print(f"Error processing sample {idx}: {str(e)}")
      continue

Calculate and print the average scores.

In [21]:
# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Calculate and print average ROUGE scores
avg_scores = results_df[['rouge1', 'rouge2', 'rougeL']].mean()
print("\nAverage ROUGE Scores:")
for metric, score in avg_scores.items():
   print(f"{metric}: {score:.4f}")

# Print some example summaries
print("\nExample Summaries:")
for i in range(min(5, len(results_df))):
   print(f"\nExample {i+1}:")
   print(f"Reference: {results_df.iloc[i]['reference_summary']}")
   print(f"Generated: {results_df.iloc[i]['generated_summary']}")

**QUESTION:**

1.1 What is the number of words in your prompt once you've met the scoring criteria?

1.2 What is the avg ROUGE-1 score you get once you've met the scoring criteria?

1.3 What is the avg ROUGE-2 score you get once you've met the scoring criteria?

1.4 What is the avg ROUGE-L score you get once you've met the scoring criteria?

1.5 How helpful do you find ROUGE to be in creating better summaries?  How do you think it could be improved? Please write a five sentence response in the text cell below.

*** YOUR ANSWER TO QUESTION 1.5 HERE ***

*** END YOUR ANSWER ***