# Testing T5 Transfer Learning Summarization on BBC Sports

In [1]:
!pip install transformers==4.2.0

Collecting transformers==4.2.0
  Downloading transformers-4.2.0-py3-none-any.whl (1.8 MB)
[?25l[K     |▏                               | 10 kB 25.8 MB/s eta 0:00:01[K     |▍                               | 20 kB 30.5 MB/s eta 0:00:01[K     |▋                               | 30 kB 33.1 MB/s eta 0:00:01[K     |▊                               | 40 kB 25.8 MB/s eta 0:00:01[K     |█                               | 51 kB 22.1 MB/s eta 0:00:01[K     |█▏                              | 61 kB 18.4 MB/s eta 0:00:01[K     |█▎                              | 71 kB 17.0 MB/s eta 0:00:01[K     |█▌                              | 81 kB 17.7 MB/s eta 0:00:01[K     |█▊                              | 92 kB 17.2 MB/s eta 0:00:01[K     |█▉                              | 102 kB 17.7 MB/s eta 0:00:01[K     |██                              | 112 kB 17.7 MB/s eta 0:00:01[K     |██▎                             | 122 kB 17.7 MB/s eta 0:00:01[K     |██▍                             | 133

In [2]:
!pip install torch

You should consider upgrading via the '/opt/python/envs/default/bin/python -m pip install --upgrade pip' command.[0m


## Import Libraries and Settings

In [3]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelWithLMHead

In [4]:
# Initialize pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)

# Runtime: t5-base: 49s



## Transfer Learning Summarization on `bbc_sports`

In [5]:
# Read BBC Sports
bbc_sports = pd.read_csv("/data/workspace_files/bbc_sports.csv")
print(bbc_sports.shape)
bbc_sports.head()

(737, 4)


Unnamed: 0.1,Unnamed: 0,category,titles,contents
0,0,athletics,Claxton hunting first major medal,British hurdler Sarah Claxton is confident she...
1,1,athletics,O'Sullivan could run in Worlds,Sonia O'Sullivan has indicated that she would ...
2,2,athletics,Greene sets sights on world title,Maurice Greene aims to wipe out the pain of lo...
3,3,athletics,IAAF launches fight against drugs,The IAAF - athletics' world governing body - h...
4,4,athletics,"Dibaba breaks 5,000m world record",Ethiopia's Tirunesh Dibaba set a new world rec...


In [6]:
# Generating t5 summary
t5_summaries = np.array([])

In [7]:
# Loop through the texts to generate the summaries
for txt in bbc_sports["contents"]:

    # Tokenize and tensorize the text
    # For tasks in T5, add the task verb. In our case: summarize
    # Max length of tokens supported by T5 is 512
    inputs = tokenizer.encode("summarize: ", txt, return_tensors="pt", max_length=512, truncation=True)

    # Generate the summaries: 18 words < summary < 150 words
    outputs = model.generate(inputs, max_length=150, min_length=18, length_penalty=5, num_beams=2)

    # Convert summary output tensor IDs to text
    summary = tokenizer.decode(outputs[0])

    # Append result
    t5_summaries = np.append(t5_summaries, summary)


# Runtime Total: t5-base: BBC-Sports = 1h40m

In [8]:
# Check list of summaries
t5_summaries

array(["<pad> the 25-year-old has already smashed the record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. she is confident she can win her first major medal at next month's European Indoors in Madrid.</s>",
       "<pad> Sonia O'Sullivan is preparing for the London marathon on 17 April. the 35-year-old is currently training at her base in australia. she will also take part in the Bupa Great Ireland Run on 9 April in Dublin.</s>",
       '<pad> Maurice Greene aims to win the world 100m title in athens this summer. the american lost the semi-final to fellow american Justin Gatlin in 9.87 seconds. he will face mark Lewis-francis in the 60m at the british grand prix on friday.</s>',
       '<pad> two task forces have been set up to examine doping and nutrition issues. about 60 people attended the meeting in Monaco, including IAAF chief Lamine Diack and Namibian athlete Frankie Fredericks. the two task forces will report back to the IAAF Coun

In [9]:
bbc_sports["summary_t5"] = t5_summaries
bbc_sports

Unnamed: 0.1,Unnamed: 0,category,titles,contents,summary_t5
0,0,athletics,Claxton hunting first major medal,British hurdler Sarah Claxton is confident she...,<pad> the 25-year-old has already smashed the ...
1,1,athletics,O'Sullivan could run in Worlds,Sonia O'Sullivan has indicated that she would ...,<pad> Sonia O'Sullivan is preparing for the Lo...
2,2,athletics,Greene sets sights on world title,Maurice Greene aims to wipe out the pain of lo...,<pad> Maurice Greene aims to win the world 100...
3,3,athletics,IAAF launches fight against drugs,The IAAF - athletics' world governing body - h...,<pad> two task forces have been set up to exam...
4,4,athletics,"Dibaba breaks 5,000m world record",Ethiopia's Tirunesh Dibaba set a new world rec...,<pad> Ethiopia's Tirunesh Dibaba sets new worl...
...,...,...,...,...,...
732,732,tennis,Agassi into second round in Dubai,Fourth seed Andre Agassi beat Radek Stepanek 6...,<pad> andre Agassi beats Radek Stepanek 6-4 7-...
733,733,tennis,Mauresmo fights back to win title,World number two Amelie Mauresmo came from a s...,<pad> amelie mauresmo wins the diamond games i...
734,734,tennis,Federer wins title in Rotterdam,World number one Roger Federer won the World I...,<pad> world number one beats Ivan Ljubicic 5-7...
735,735,tennis,GB players warned over security,Britain's Davis Cup players have been warned n...,<pad> a suicide bombing in a nightclub in the ...


In [10]:
# Export result to CSV
bbc_sports.to_csv("/data/workspace_files/bbc_sports_t5_summarized.csv")