# Testing Pegasus Summarization on BBC and BBC Sports

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#About-PEGASUS" data-toc-modified-id="About-PEGASUS-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>About PEGASUS</a></span></li><li><span><a href="#Import-Libraries-and-Settings" data-toc-modified-id="Import-Libraries-and-Settings-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Libraries and Settings</a></span></li><li><span><a href="#Import-Datasets" data-toc-modified-id="Import-Datasets-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import Datasets</a></span></li><li><span><a href="#Abstract-Summarization-With-Pegasus-on-bbc_sports" data-toc-modified-id="Abstract-Summarization-With-Pegasus-on-bbc_sports-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Abstract Summarization With Pegasus on <code>bbc_sports</code></a></span><ul class="toc-item"><li><span><a href="#Appending-Summaries-to-DF" data-toc-modified-id="Appending-Summaries-to-DF-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Appending Summaries to DF</a></span></li></ul></li></ul></div>

## About PEGASUS

In the last week of December 2019, Google Brain team launched this state of the art summarization model PEGASUS, which expands to Pre-training with Extracted Gap-sentences for Abstractive Summarization. Here, we will just be looking at how we can generate summaries using the pre-trained model.

Let’s see how we can use the given pre-trained model to generate summaries for our text.

**Reference Used**

- https://signal.onepointltd.com/post/102ghb9/exploring-pegasus-a-new-text-summarization-nlp-model**
- https://huggingface.co/transformers/model_doc/pegasus.html#usage-example

## Import Libraries and Settings

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
import io

# conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
import torch

# conda install -c conda-forge python-dotenv
from dotenv import load_dotenv

# conda install -c anaconda sqlalchemy
from sqlalchemy import create_engine

# conda install -c conda-forge transformers
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from transformers import pipeline
from transformers import AutoModelWithLMHead, AutoTokenizer



In [2]:
sns.set_theme(style="whitegrid")

In [3]:
pd.options.display.max_rows = 1000

In [4]:
load_dotenv() # => True if no error

True

In [5]:
# Load secrets from the .env file
db_name = os.getenv("db_name")
db_username = os.getenv("db_username")
db_password = os.getenv("db_password")
db_table_schema = os.getenv("db_table_schema")
connection_string = f"postgres://{db_username}:{db_password}@localhost:5432/{db_name}"
engine = create_engine(connection_string)

## Import Datasets

In [6]:
# List of available tables in the DB
q = """
SELECT * 
FROM information_schema.tables
WHERE table_catalog = '{db_name}'
AND table_schema = '{db_table_schema}';
""".format(
    db_name = db_name,
    db_table_schema = db_table_schema
)

pd.read_sql(q, con=engine)[["table_name"]]

Unnamed: 0,table_name
0,AllTheNews21
1,BBCArticles
2,BBCSportsArticles


In [7]:
# BBCSports Dataset
q = """
SELECT *
FROM public."BBCSportsArticles";
"""

bbc_sports = pd.read_sql(q, con=engine)
display(bbc_sports.shape)
display(bbc_sports.head())

(737, 3)

Unnamed: 0,category,titles,contents
0,athletics,Claxton hunting first major medal,British hurdler Sarah Claxton is confident she...
1,athletics,O'Sullivan could run in Worlds,Sonia O'Sullivan has indicated that she would ...
2,athletics,Greene sets sights on world title,Maurice Greene aims to wipe out the pain of lo...
3,athletics,IAAF launches fight against drugs,The IAAF - athletics' world governing body - h...
4,athletics,"Dibaba breaks 5,000m world record",Ethiopia's Tirunesh Dibaba set a new world rec...


## Abstract Summarization With Pegasus on `bbc_sports`

In [8]:
# Generating pegasus summary
pegasus_summaries = np.array([])

# Define a function that would generate the summary
def generate_pegasus_summary(src_text):
    
    # Choosing a model: "Pegasus-XSUM"
    model_name = 'google/pegasus-xsum'

    # Set PyTorch
    torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # Set Tokenizer based on model above
    tokenizer = PegasusTokenizer.from_pretrained(model_name)

    # Set the Pegasus Model
    model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
    
    batch = tokenizer.prepare_seq2seq_batch(
        src_text, 
        truncation=True, 
        padding='longest',
        return_tensors="pt"
    ).to(torch_device)
    
    translated = model.generate(**batch)

    tgt_text = tokenizer.batch_decode(
        translated, 
        skip_special_tokens=True
    )
        
    # Finally, return the short summary
    return tgt_text[0]

### Appending Summaries to DF

In [None]:
bbc_sports["summary_pegasus"] = bbc_sports["contents"].map(generate_pegasus_summary)
bbc_sports