# Project: What's The Hot Topic In Town? - kelvin.ahiakpor & emmanuel.acquaye

# Phase 4         
Summarize News Articles

### Natural Language Processing

This notebook addresses Phase 4 of the What's The Hot Topic in Topic Town? project: **Summarize News Articles**.  
The self-created rubric, in our repository, explains the requirement for a proper execution of this phase as seen below.   
**Description:** Fine tune the BERT base cased model from Hugging Face with labeled dataset of news articles
and generate one paragraph news summaries with it

### Repository Link

Here is a link to our repository:

[What's The Hot Topic In Town?](https://github.com/kelvin-ahiakpor/Whats.The.Hot.Topic.In.Town)

### Imports

In [None]:
%%capture
!pip install datasets==1.0.2
!pip install transformers==4.2.1
!pip install torch evaluate accelerate pandas rouge_score wandb sacrebleu wandb

import os
import wandb
import torch
import shutil
import evaluate
import datasets
import accelerate
import transformers

import pandas as pd
from torch.cuda.amp import autocast
from IPython.display import display, HTML
from datasets import ClassLabel
from datasets import load_metric

from transformers import pipeline
from transformers import BertTokenizerFast
from transformers import EncoderDecoderModel
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

from IPython.display import clear_output

from dataclasses import dataclass, field
from typing import Optional

### Setting job timeout for computation

In [None]:
os.environ['JOBLIB_START_METHOD'] = 'loky'
os.environ['JOBLIB_TIMEOUT'] = '300'

### Setting up weights and biases (WANDB) to log training metrics

In [None]:
os.environ['WANDB_API_KEY'] = 'removed-for-privacy'
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mkelvin-ahiakpor[0m ([33mkelvin-ahiakpor-ashesi-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

### Selecting first available GPU if available

In [None]:
device = 0 if torch.cuda.is_available() else -1

### Loading training data

In [None]:
train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

### Custom recipes for DataFrame inspection
Object predictions will be stored in a dataframe so it can be displayed in a friendly manner in this notebook

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.precision', 3)

### Understanding the data

##### Inspecting dataset's metadata

In [None]:
train_data

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 287113
})

In [None]:
train_data.features

{'article': Value(dtype='string', id=None),
 'highlights': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [None]:
train_data.info.splits

{'train': SplitInfo(name='train', num_bytes=1261703785, num_examples=287113, shard_lengths=[115705, 115704, 55704], dataset_name='cnn_dailymail'),
 'validation': SplitInfo(name='validation', num_bytes=57732412, num_examples=13368, shard_lengths=None, dataset_name='cnn_dailymail'),
 'test': SplitInfo(name='test', num_bytes=49925732, num_examples=11490, shard_lengths=None, dataset_name='cnn_dailymail')}

So far we see our dataset is already split into train, validation and test with 287113, 13368, 11490 examples respectively

##### Peek at training data

Our input is called article and our labels are called highlights. Let's now print out the first example of the training data to get a feeling for the data.

In [None]:
df = pd.DataFrame(train_data[:1])
del df["id"]
for column, typ in train_data.features.items():
      if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))

Unnamed: 0,article,highlights
0,"LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in ""Harry Potter and the Order of the Phoenix"" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. ""I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,"" he told an Australian interviewer earlier this month. ""I don't think I'll be particularly extravagant. ""The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs."" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film ""Hostel: Part II,"" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. ""I'll definitely have some sort of party,"" he said in an interview. ""Hopefully none of you will be reading about it."" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. ""People are always looking to say 'kid star goes off the rails,'"" he told reporters last month. ""But I try very hard not to go that way because it would be too easy for them."" His latest outing as the boy wizard in ""Harry Potter and the Order of the Phoenix"" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films. Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called ""My Boy Jack,"" about author Rudyard Kipling and his son, due for release later this year. He will also appear in ""December Boys,"" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's ""Equus."" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: ""I just think I'm going to be more sort of fair game,"" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.",Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund .


The input data seems to consist of short news articles. Interestingly, the labels appear to be bullet-point-like summaries. At this point, we should probably take a look at a couple of other examples to get a better feeling for the data.

In [None]:
df = pd.DataFrame(train_data[1:4])
del df["id"]
for column, typ in train_data.features.items():
      if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))

Unnamed: 0,article,highlights
0,"Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the ""forgotten floor,"" where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the ""forgotten floor."" Here, inmates with the most severe mental illnesses are incarcerated until they're ready to appear in court. Most often, they face drug charges or charges of assaulting an officer --charges that Judge Steven Leifman says are usually ""avoidable felonies."" He says the arrests often result from confrontations with police. Mentally ill people often won't do what they're told when police arrive on the scene -- confrontation seems to exacerbate their illness and they become more paranoid, delusional, and less likely to follow directions, according to Leifman. So, they end up on the ninth floor severely mentally disturbed, but not getting any real help because they're in jail. We toured the jail with Leifman. He is well known in Miami as an advocate for justice and the mentally ill. Even though we were not exactly welcomed with open arms by the guards, we were given permission to shoot videotape and tour the floor. Go inside the 'forgotten floor' » . At first, it's hard to determine where the people are. The prisoners are wearing sleeveless robes. Imagine cutting holes for arms and feet in a heavy wool sleeping bag -- that's kind of what they look like. They're designed to keep the mentally ill patients from injuring themselves. That's also why they have no shoes, laces or mattresses. Leifman says about one-third of all people in Miami-Dade county jails are mentally ill. So, he says, the sheer volume is overwhelming the system, and the result is what we see on the ninth floor. Of course, it is a jail, so it's not supposed to be warm and comforting, but the lights glare, the cells are tiny and it's loud. We see two, sometimes three men -- sometimes in the robes, sometimes naked, lying or sitting in their cells. ""I am the son of the president. You need to get me out of here!"" one man shouts at me. He is absolutely serious, convinced that help is on the way -- if only he could reach the White House. Leifman tells me that these prisoner-patients will often circulate through the system, occasionally stabilizing in a mental hospital, only to return to jail to face their charges. It's brutally unjust, in his mind, and he has become a strong advocate for changing things in Miami. Over a meal later, we talk about how things got this way for mental patients. Leifman says 200 years ago people were considered ""lunatics"" and they were locked up in jails even if they had no charges against them. They were just considered unfit to be in society. Over the years, he says, there was some public outcry, and the mentally ill were moved out of jails and into hospitals. But Leifman says many of these mental hospitals were so horrible they were shut down. Where did the patients go? Nowhere. The streets. They became, in many cases, the homeless, he says. They never got treatment. Leifman says in 1955 there were more than half a million people in state mental hospitals, and today that number has been reduced 90 percent, and 40,000 to 50,000 people are in mental hospitals. The judge says he's working to change this. Starting in 2008, many inmates who would otherwise have been brought to the ""forgotten floor"" will instead be sent to a new mental health facility -- the first step on a journey toward long-term treatment, not just punishment. Leifman says it's not the complete answer, but it's a start. Leifman says the best part is that it's a win-win solution. The patients win, the families are relieved, and the state saves money by simply not cycling these prisoners through again and again. And, for Leifman, justice is served. E-mail to a friend .","Mentally ill inmates in Miami are housed on the ""forgotten floor""\nJudge Steven Leifman says most are there as a result of ""avoidable felonies""\nWhile CNN tours facility, patient shouts: ""I am the son of the president""\nLeifman says the system is unjust and he's fighting for change ."
1,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who were on the Minneapolis bridge when it collapsed told harrowing tales of survival. ""The whole bridge from one side of the Mississippi to the other just completely gave way, fell all the way down,"" survivor Gary Babineau told CNN. ""I probably had a 30-, 35-foot free fall. And there's cars in the water, there's cars on fire. The whole bridge is down."" He said his back was injured but he determined he could move around. ""I realized there was a school bus right next to me, and me and a couple of other guys went over and started lifting the kids off the bridge. They were yelling, screaming, bleeding. I think there were some broken bones."" Watch a driver describe his narrow escape » . At home when he heard about the disaster, Dr. John Hink, an emergency room physician, jumped into his car and rushed to the scene in 15 minutes. He arrived at the south side of the bridge, stood on the riverbank and saw dozens of people lying dazed on an expansive deck. They were in the middle of the Mississippi River, which was churning fast, and he had no way of getting to them. He went to the north side, where there was easier access to people. Ambulances were also having a hard time driving down to the river to get closer to the scene. Working feverishly, volunteers, EMTs and other officials managed to get 55 people into ambulances in less than two hours. Occasionally, a pickup truck with a medic inside would drive to get an injured person and bring him back up even ground, Hink told CNN. The rescue effort was controlled and organized, he said; the opposite of the lightning-quick collapse. ""I could see the whole bridge as it was going down, as it was falling,"" Babineau said. ""It just gave a rumble real quick, and it all just gave way, and it just fell completely, all the way to the ground. And there was dust everywhere and it was just like everyone has been saying: It was just like out of the movies."" Babineau said the rear of his pickup truck was dangling over the edge of a broken-off section of the bridge. He said several vehicles slid past him into the water. ""I stayed in my car for one or two seconds. I saw a couple cars fall,"" he said. ""So I stayed in my car until the cars quit falling for a second, then I got out real quick, ran in front of my truck -- because behind my truck was just a hole -- and I helped a woman off of the bridge with me. ""I just wanted off the bridge, and then I ran over to the school bus. I started grabbing kids and handing them down. It was just complete chaos."" He said most of the children were crying or screaming. He and other rescuers set them on the ground and told them to run to the river bank, but a few needed to be carried because of their injuries. See rescuers clamber over rubble » . Babineau said he had no rescue training. ""I just knew what I had to do at the moment."" Melissa Hughes, 32, of Minneapolis, told The Associated Press that she was driving home when the western edge of the bridge collapsed under her. ""You know that free-fall feeling? I felt that twice,"" Hughes said. A pickup landed on top of her car, but she was not hurt. ""I had no idea there was a vehicle on my car,"" she told AP. ""It's really very surreal."" Babineau told the Minneapolis Star-Tribune: ""On the way down, I thought I was dead. I literally thought I was dead. ""My truck was completely face down, pointed toward the ground, and my truck got ripped in half. It was folded in half, and I can't believe I'm alive."" See and hear eyewitness accounts » . Bernie Toivonen told CNN's ""American Morning"" that his vehicle was on a part of the bridge that ended up tilted at a 45-degree angle. ""I knew the deck was going down, there was no question about it, and I thought I was going to die,"" he said. After the bridge settled and his car remained upright, ""I just put in park, turned the key off and said, 'Oh, I'm alive,' "" he said. E-mail to a friend .","NEW: ""I thought I was going to die,"" driver says .\nMan says pickup truck was folded in half; he just has cut on face .\nDriver: ""I probably had a 30-, 35-foot free fall""\nMinnesota bridge collapsed during rush hour Wednesday ."
2,"WASHINGTON (CNN) -- Doctors removed five small polyps from President Bush's colon on Saturday, and ""none appeared worrisome,"" a White House spokesman said. The polyps were removed and sent to the National Naval Medical Center in Bethesda, Maryland, for routine microscopic examination, spokesman Scott Stanzel said. Results are expected in two to three days. All were small, less than a centimeter [half an inch] in diameter, he said. Bush is in good humor, Stanzel said, and will resume his activities at Camp David. During the procedure Vice President Dick Cheney assumed presidential power. Bush reclaimed presidential power at 9:21 a.m. after about two hours. Doctors used ""monitored anesthesia care,"" Stanzel said, so the president was asleep, but not as deeply unconscious as with a true general anesthetic. He spoke to first lady Laura Bush -- who is in Midland, Texas, celebrating her mother's birthday -- before and after the procedure, Stanzel said. Afterward, the president played with his Scottish terriers, Barney and Miss Beazley, Stanzel said. He planned to have lunch at Camp David and have briefings with National Security Adviser Stephen Hadley and White House Chief of Staff Josh Bolten, and planned to take a bicycle ride Saturday afternoon. Cheney, meanwhile, spent the morning at his home on Maryland's eastern shore, reading and playing with his dogs, Stanzel said. Nothing occurred that required him to take official action as president before Bush reclaimed presidential power. The procedure was supervised by Dr. Richard Tubb, Bush's physician, and conducted by a multidisciplinary team from the National Naval Medical Center in Bethesda, Maryland, the White House said. Bush's last colonoscopy was in June 2002, and no abnormalities were found, White House spokesman Tony Snow said. The president's doctor had recommended a repeat procedure in about five years. A colonoscopy is the most sensitive test for colon cancer, rectal cancer and polyps, small clumps of cells that can become cancerous, according to the Mayo Clinic. Small polyps may be removed during the procedure. Snow said on Friday that Bush had polyps removed during colonoscopies before becoming president. Snow himself is undergoing chemotherapy for cancer that began in his colon and spread to his liver. Watch Snow talk about Bush's procedure and his own colon cancer » . ""The president wants to encourage everybody to use surveillance,"" Snow said. The American Cancer Society recommends that people without high risk factors or symptoms begin getting screened for signs of colorectal cancer at age 50. E-mail to a friend .","Five small polyps found during procedure; ""none worrisome,"" spokesman says .\nPresident reclaims powers transferred to vice president .\nBush undergoes routine colonoscopy at Camp David ."


Next, let's get a sense of the length of input data and labels.  
As models compute length in *token-length*, we will make use of the `bert-base-cased` tokenizer to compute the article and summary length.

##### Quick statistics
Artilce and Summary Length,
Histogram Plots,
Correlation Matrix,
Categorical Variable Selection &
Important Numeric Features from PCA

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Next, we make use of `.map()` to compute the length of the article and its summary. The maximum length that `bert-base-uncased` can process amounts to 512, we are also interested in the percentage of input samples being longer than the maximum length.
Similarly, we compute the percentage of summaries that are longer than 64, and 128 respectively as these will help us meet our goal of providing multi-paragraph summaries if possible

We can define the `.map()` function as follows.

In [None]:
# map article and summary len to dict as well as if sample is longer than 512 tokens
def map_to_length(x):
    x["article_len"] = len(tokenizer(x["article"]).input_ids)
    x["article_longer_512"] = int(x["article_len"] > tokenizer.model_max_length)
    x["summary_len"] = len(tokenizer(x["highlights"]).input_ids)
    x["summary_longer_64"] = int(x["summary_len"] > 64)
    x["summary_longer_128"] = int(x["summary_len"] > 128)
    return x

It should be sufficient to look at the first 10000 samples. We can speed up the mapping by using multiple processes with `num_proc=4`.

In [None]:
sample_size = 10000
data_stats = train_data.select(range(sample_size)).map(map_to_length, num_proc=4)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

Having computed the length for the first 10000 samples, we can now average them together. We can make use of the `.map()` function with `batched=True` and `batch_size=-1` to have access to all 10000 samples within the `.map()` function.

We have also identified that some articles are too long and can cause training errors we will `use.filter()` to remove them in a data cleaning section below

In [None]:
def compute_and_print_stats(x):
    if len(x["article_len"]) == sample_size:
        print(
            "Article Mean: {}, %-Articles > 512:{}, Summary Mean:{}, %-Summary > 64:{}, %-Summary > 128:{}".format(
                sum(x["article_len"]) / sample_size,
                sum(x["article_longer_512"]) / sample_size,
                sum(x["summary_len"]) / sample_size,
                sum(x["summary_longer_64"]) / sample_size,
                sum(x["summary_longer_128"]) / sample_size,
            )
        )

output = data_stats.map(compute_and_print_stats, batched=True, batch_size=-1,)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Article Mean: 827.079, %-Articles > 512:0.7344, Summary Mean:59.5725, %-Summary > 64:0.3499, %-Summary > 128:0.0


We can see that on average an article contains 827 tokens with *ca.* 3/4 of the articles being longer than the model's `max_length` 512. The summary is on average 59 tokens long. About 35% of our 10000-sample summaries are longer than 64 tokens, but none are longer than 128 tokens.

`bert-base-cased` is limited to 512 tokens, which means we would have to cut possibly important information from the article. Because most of the important information is often found at the beginning of articles and because we want to be computationally efficient, we decide to stick to `bert-base-cased` with a `max_length` of 512 in this notebook. This choice is not optimal but has shown to yield [good results](https://arxiv.org/abs/1907.12461) on CNN/Dailymail.


#### Some notes so far

1. Evaluation :   

    **Case-sensitivity**
    - The text in our peeks were *case-sensitive*. This means that we have to be careful if we want to use *case-insensitive* models. As *CNN/Dailymail* is a summarization dataset, the model will be evaluated using the *ROUGE* metric. Checking the description of *ROUGE* in 🤗datasets, *cf.* [here](https://huggingface.co/metrics/rouge), we can see that the metric is *case-insensitive*, meaning that *upper case* letters will be normalized to *lower case* letters during evaluation. Thus, we can safely leverage *uncased* checkpoints, such as `bert-base-uncased`.

# Task 1
Test the pretrained BERT-Based model for summarization

Why use bert2BERT and not BERT_BASE?  
bert2BERT is a Reusable Pretrained Language Models
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. bert2BERT allows us to  effectively transfer the knowledge of an existing smaller pre-trained model like BERT_BASE to a large model  through parameter initialization and significantly improve the pre-training efficiency of the large model. [1]

**Loading a current default summarization model bart-large from the summarization pipeline**

In [None]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn",device=device)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
def summarize_article(article, tokenizer, summarizer):
    inputs = tokenizer(article, return_tensors="pt", truncation=True, max_length=512)
    inputs = {key: val.to("cuda") for key, val in inputs.items()}  # Ensure inputs are on GPU
    summary = summarizer(article, max_length=150, min_length=50, do_sample=True)
    return summary[0]['summary_text']

**Generating a few sample summaries with it**

In [None]:
news_articles = [example["article"] for example in train_data.select(range(3))]
target_summaries = [example["highlights"] for example in train_data.select(range(3))]

Summarize the articles and store summaries in summaries list. We set `do_sample` to `True` to avoid greedy decoding and provide a more natural, human-like summary.

In [None]:
pre_train_summaries = []
for article in news_articles:
    try:
        summary = summarize_article(article, tokenizer, summarizer)
        pre_train_summaries.append(summary)
    except Exception as e:
        pre_train_summaries.append("Error, article was too long")

Create a dataframe containing the pretrained BART summaries and the target summaries from our fine-tuning dataset

In [None]:
summaries = pd.DataFrame({
    'Pre-trained BART Summaries': pre_train_summaries,
    'Target Summaries': target_summaries
})

In [None]:
summaries

Unnamed: 0,Pre-trained BART Summaries,Target Summaries
0,Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund. Details of how he'll mark his landmark birthday are under wraps.,Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund .
1,"Judge Steven Leifman is an advocate for justice and the mentally ill. About one-third of all people in Miami-Dade county jails are mentally ill, he says. He says the sheer volume is overwhelming the system. Starting in 2008, many inmates will be sent to a new mental health facility.","Mentally ill inmates in Miami are housed on the ""forgotten floor""\nJudge Steven Leifman says most are there as a result of ""avoidable felonies""\nWhile CNN tours facility, patient shouts: ""I am the son of the president""\nLeifman says the system is unjust and he's fighting for change ."
2,"NEW: ""I probably had a 30-, 35-foot free fall,"" survivor Gary Babineau says. NEW: ""My truck was completely face down, pointed toward the ground, and my truck got ripped in half,"" he says. Dr. John Hink says he saw dozens of people lying dazed on an expansive deck. ""It just gave a rumble real quick, and it all just gave way,"" survivor says.","NEW: ""I thought I was going to die,"" driver says .\nMan says pickup truck was folded in half; he just has cut on face .\nDriver: ""I probably had a 30-, 35-foot free fall""\nMinnesota bridge collapsed during rush hour Wednesday ."


These are good summaries and this is what we will be finetuning bert-base-uncased to do in the next task.

# Task 2
Fine-Tune bert-base-uncased as bert2BERTMK on the cnn_dailymail dataset
KM - Kelvin and Manuel

**Note**: We are using the cnn daily mail dataset because it is tailored for abstractive summarization where as the BBC Xsum dataset is for extreme summarization. We need abstractive because it can help us achieve our goal of multi-sentence summaries.

### Loading validation data

First, we load 10% of the validation dataset for faster validation:

In [None]:
val_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="validation[:10%]")

### Data Preprocessing

In [None]:
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

In [None]:
batch_size=4
encoder_max_length=512
decoder_max_length=128

In [None]:
def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_max_length)
    outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_max_length)

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["labels"] = outputs.input_ids.copy()

    # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`.
    # We have to make sure that the PAD token is ignored
    batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in
                       batch["labels"]]

    return batch

In [None]:
# only use 32 training examples for notebook - DELETE LINE FOR FULL TRAINING
train_data = train_data.select(range(140000)) #70000

train_data = train_data.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "highlights", "id"]
)
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)

Map:   0%|          | 0/140000 [00:00<?, ? examples/s]

In [None]:
# only use 16 training examples for notebook - DELETE LINE FOR FULL TRAINING
#val_data = val_data.select(range(1200))

val_data = val_data.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "highlights", "id"]
)
val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)

Map:   0%|          | 0/1337 [00:00<?, ? examples/s]

### Warm-starting the Encoder-Decoder Model
The error messages are normal. they show that some weights are being randomly initialized

In [None]:
%%capture
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "bert-base-cased")

In [None]:
# set special tokens
bert2bert.config.decoder_start_token_id = tokenizer.bos_token_id
bert2bert.config.eos_token_id = tokenizer.eos_token_id
bert2bert.config.pad_token_id = tokenizer.pad_token_id

# sensible parameters for beam search
bert2bert.config.vocab_size = bert2bert.config.decoder.vocab_size
bert2bert.config.max_length = 142
bert2bert.config.min_length = 56
bert2bert.config.no_repeat_ngram_size = 3
bert2bert.config.early_stopping = True
bert2bert.config.length_penalty = 2.0
bert2bert.config.num_beams = 4

### **Fine-Tuning Warm-Started Encoder-Decoder Models**

For the `EncoderDecoderModel` framework, we will use the `Seq2SeqTrainingArguments` and the `Seq2SeqTrainer` which have been imported already.

### Load rouge for validation

In [None]:
rouge = datasets.load_metric("rouge", trust_remote_code=True)

  rouge = datasets.load_metric("rouge", trust_remote_code=True)


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

**Setting training arguments**

For our model training, we referred to Patrick von Platen's demo of the BERT2BERT model to guide our parameter choices. The parameters initially suggested were as follows:

training_args = Seq2SeqTrainingArguments(  
    output_dir="./",  
    evaluation_strategy="steps",  
    per_device_train_batch_size=batch_size,  
    per_device_eval_batch_size=batch_size,  
    predict_with_generate=True,  
    logging_steps=2,  # set to 1000 for full training  
    save_steps=16,  # set to 500 for full training  
    eval_steps=4,  # set to 8000 for full training  
    warmup_steps=1,  # set to 2000 for full training  
    max_steps=16, # delete for full training  
    overwrite_output_dir=True,  
    save_total_limit=3,  
    fp16=True,  
)  

Due to GPU space limitations, we were unable to perform a grid search for hyperparameter tuning. Therefore, we had to adjust some of these parameters to fit our computational resources and training constraints. We consulted with ChatGPT and Faculty Intern Kweku Yamoah to fine-tune these settings.

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    evaluation_strategy = 'steps',
    do_train=True,
    do_eval=True,
    logging_steps=1000,  # set to 1000 for full training
    save_steps=1000,  # set to 500 for full training
    eval_steps=8000,  # set to 8000 for full training
    warmup_steps=2000,  # set to 2000 for full training
    overwrite_output_dir=True,
    save_total_limit=3,
    fp16=True,
    report_to="wandb"  # Enable W&B logging
)



**Instantiate trainer**

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
bert2bert.to(device)

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

In [None]:
trainer = Seq2SeqTrainer(
    model=bert2bert,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)

Cool! Finally, we start training.

**Train!**

In [None]:
trainer.train()

  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)


Step,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure
8000,2.8637,3.408133,0.0622,0.0861,0.0701
16000,3.1021,2.937377,0.0864,0.1228,0.0984
24000,2.9059,2.806555,0.0924,0.1388,0.1077
32000,2.7952,2.713385,0.0966,0.1423,0.1114
40000,2.4716,2.675352,0.0965,0.1438,0.1115
48000,2.4718,2.62947,0.0991,0.1503,0.1162


  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids

Step,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure
8000,2.8637,3.408133,0.0622,0.0861,0.0701
16000,3.1021,2.937377,0.0864,0.1228,0.0984
24000,2.9059,2.806555,0.0924,0.1388,0.1077
32000,2.7952,2.713385,0.0966,0.1423,0.1114
40000,2.4716,2.675352,0.0965,0.1438,0.1115
48000,2.4718,2.62947,0.0991,0.1503,0.1162
56000,2.4049,2.573213,0.1056,0.1578,0.1227
64000,2.368,2.520559,0.1048,0.154,0.1204
72000,2.0752,2.530947,0.1059,0.1564,0.1224
80000,2.0817,2.503957,0.1064,0.1577,0.1232


  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids

TrainOutput(global_step=105000, training_loss=2.4967903680710566, metrics={'train_runtime': 24469.1844, 'train_samples_per_second': 17.164, 'train_steps_per_second': 4.291, 'total_flos': 2.5764875329536e+17, 'train_loss': 2.4967903680710566, 'epoch': 3.0})

The model achieves a ROUGE-2 score of **18.22**, which is even a little better than reported in the paper arvixx.  
**Be sure to edit this part**

# Task 3
Evaluate the fine-tuned model with the ROUGE-2 Metric

### Evaluation
**Rouge?** Rouge is an N-gram Co-Occurrence Statistic  
**Brief on Rouge**:
ROUGE-2 specifically measures the overlap of bigrams (two-word sequences) between the candidate summary and the reference summary.
Key Points of ROUGE-2
Bigram Overlap: ROUGE-2 evaluates the similarity based on two-word sequences present in both the candidate and reference summaries.

**Interpretation of Scores**
Higher ROUGE-2 Score:
Indicates better quality of the generated summary.
Suggests a higher degree of similarity between the bigrams in the generated summary and the reference summary.
Implies that the model is capturing more of the important bi-word sequences from the reference summary [4].  

We finished training our model. Let's now evaluate the model on the test data. We make use of the dataset's handy `.map()` function to generate a summary of each sample of the test data.

### Loading test data

In [None]:
test_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="test")

In [None]:
# only use 16 training examples for notebook - DELETE LINE FOR FULL TRAINING
test_data = test_data.select(range(5700))

### Load the fine tuned model

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
bert2bertMK = EncoderDecoderModel.from_pretrained("/content/checkpoint-105000")
bert2bertMK.gradient_checkpointing_enable()
bert2bertMK.to("cuda")

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [None]:
batch_size = 16 #originally 16

# map data correctly
def generate_summary(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to('cuda')
    attention_mask = inputs.attention_mask.to('cuda')

    with autocast():
        outputs = bert2bertMK.generate(input_ids, attention_mask=attention_mask)

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch

results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])

pred_str = results["pred"]
label_str = results["highlights"]

rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

print(rouge_output)

Map:   0%|          | 0/5700 [00:00<?, ? examples/s]

Score(precision=0.15225139568423074, recall=0.1674391018626793, fmeasure=0.15436972362742502)


### Show some sample summaries with the fine-tuned model

**Function to decode tensors**

In [None]:
def tensor_to_text(tensor_ids):
    # Convert tensor IDs to text
    text = tokenizer.decode(tensor_ids, skip_special_tokens=True)
    return text

In [None]:
def summarize_article(article, tokenizer, summarizer):
    inputs = tokenizer(article, return_tensors="pt", max_length=512, truncation=True)
    inputs = {key: val.to("cuda") for key, val in inputs.items()}  # Ensure inputs are on GPU
    summary_ids = summarizer.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=150,  # Adjust based on your summarization length
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True,
        do_sample=True  # Set to True for more diverse summaries
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

**Generating a few sample summaries with it**

In [None]:
news_articles = [tensor_to_text(example["input_ids"]) for example in train_data.select(range(3))]

Summarize the articles and store summaries in summaries list. We set `do_sample` to `True` to avoid greedy decoding and provide a more natural, human-like summary.

In [None]:
fine_tuned_summaries = []
for article in news_articles:
    try:
        summary = summarize_article(article, tokenizer, bert2bertMK)
        fine_tuned_summaries.append(summary)
    except Exception as e:
        fine_tuned_summaries.append("Error, article was too long")

Create a dataframe containing the fine tuned BERT summaries and the target summaries from our fine-tuning dataset

In [None]:
summaries = pd.DataFrame({
    'Fine-tuned bert2BERTMK  Summaries': fine_tuned_summaries,
    'Target Summaries': target_summaries
})

In [None]:
fine_tuned_summaries

['Harry Potter star Daniel Radcliffe earns $ 41. 1 million fortune. The 18 - year - old will be able to gamble in a casino or see the horror film " Hostel : Part II " Radcliffe\'s earnings from the first five Potter films have been held in a trust fund. He\'ll also appear in " December Boys " and " December Boy "',
 'Miami - Dade pretrial detention facility dubbed " the forgotten floor " Judge Steven Leifman says inmates with the most severe mental illnesses are incarcerated. Inmate : " I am the son of the president of the presidency " Prosecutors say one - third of inmates are mentally ill.',
 'NEW : " I probably had a 30 -, 35 - foot free fall, " survivor says. " I could see the whole bridge as it was going down, " a passenger says. Rescue efforts were controlled and organized, emergency room physician says. The Minnesota bridge fell all the way down, hitting a broken - off section.']

In [None]:
target_summaries

["Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund .",
 'Mentally ill inmates in Miami are housed on the "forgotten floor"\nJudge Steven Leifman says most are there as a result of "avoidable felonies"\nWhile CNN tours facility, patient shouts: "I am the son of the president"\nLeifman says the system is unjust and he\'s fighting for change .',
 'NEW: "I thought I was going to die," driver says .\nMan says pickup truck was folded in half; he just has cut on face .\nDriver: "I probably had a 30-, 35-foot free fall"\nMinnesota bridge collapsed during rush hour Wednesday .']

In [None]:
summaries

Unnamed: 0,Fine-tuned bert2BERTMK Summaries,Target Summaries
0,"Harry Potter star Daniel Radcliffe earns $ 41. 1 million fortune. The 18 - year - old will be able to gamble in a casino or see the horror film "" Hostel : Part II "" Radcliffe's earnings from the first five Potter films have been held in a trust fund. He'll also appear in "" December Boys "" and "" December Boy """,Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund .
1,"Miami - Dade pretrial detention facility dubbed "" the forgotten floor "" Judge Steven Leifman says inmates with the most severe mental illnesses are incarcerated. Inmate : "" I am the son of the president of the presidency "" Prosecutors say one - third of inmates are mentally ill.","Mentally ill inmates in Miami are housed on the ""forgotten floor""\nJudge Steven Leifman says most are there as a result of ""avoidable felonies""\nWhile CNN tours facility, patient shouts: ""I am the son of the president""\nLeifman says the system is unjust and he's fighting for change ."
2,"NEW : "" I probably had a 30 -, 35 - foot free fall, "" survivor says. "" I could see the whole bridge as it was going down, "" a passenger says. Rescue efforts were controlled and organized, emergency room physician says. The Minnesota bridge fell all the way down, hitting a broken - off section.","NEW: ""I thought I was going to die,"" driver says .\nMan says pickup truck was folded in half; he just has cut on face .\nDriver: ""I probably had a 30-, 35-foot free fall""\nMinnesota bridge collapsed during rush hour Wednesday ."


We see that our model performs great on summaries with a Rouge-2 score of **15.5.** which is somewhat close to what Patrick Von Platen reports on hugging face **18.22**.

# Task 4
Save the fine-tuned model

###  Save Fine-Tuned Model
Save the model and tokenizer to the 'model' folder in the current directory

In [None]:
bert2bert.save_pretrained('./model')

### Save Tokenizer

In [None]:
tokenizer.save_pretrained('./model')

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.txt',
 './model/added_tokens.json',
 './model/tokenizer.json')

### Split model into parts for GitHub upload

In [None]:
def split_file(file_path, chunk_size_mb=50):
    chunk_size = chunk_size_mb * 1024 * 1024  # Convert MB to bytes
    with open(file_path, 'rb') as f:
        chunk = f.read(chunk_size)
        i = 0
        while chunk:
            with open(f"{file_path}.part{i}", 'wb') as chunk_file:
                chunk_file.write(chunk)
            i += 1
            chunk = f.read(chunk_size)

In [None]:
model_dir = './model'

In [None]:
# Split only the .bin file
for filename in os.listdir(model_dir):
    if filename.endswith('.bin'):  # Only split .bin files
        file_path = os.path.join(model_dir, filename)
        split_file(file_path, chunk_size_mb=50)  # Split into 50 MB chunks

In [None]:
zip_file_name = './model.zip'
shutil.make_archive(zip_file_name.replace('.zip', ''), 'zip', model_dir)

print(f"Model saved and zipped at {zip_file_name}")

Model saved and zipped at ./model.zip


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now our model is trained and saved we can use it in our deployed application to summarize news articles in real-time!

# References

**Bibliography**  
[1]Snowflake Inc. 2024. Connect Streamlit to Google Cloud Storage - Streamlit Docs. docs.streamlit.io. Retrieved July 19, 2024 from https://docs.streamlit.io/develop/tutorials/databases/gcs  
[2]Keras Team. 2024. Keras documentation: InceptionV3. keras.io. Retrieved July 19, 2024 from https://keras.io/api/applications/inceptionv3/  
[3]TensorFlow. 2024. Load video data | TensorFlow Core. TensorFlow. Retrieved July 19, 2024 from https://www.tensorflow.org/tutorials/load_data/video#create_frames_from_each_video_file  
[4]TensorFlow. 2024. tf.keras.applications.inception_v3.decode_predictions | TensorFlow v2.16.1. TensorFlow. Retrieved July 19, 2024 from https://www.tensorflow.org/api_docs/python/tf/keras/applications/inception_v3/decode_predictions