In [23]:
! pip install bert-extractive-summarizer datasets rouge-score

In [11]:
from summarizer import Summarizer
from datasets import load_metric

import numpy as np
import pandas as pd

from tqdm.auto import tqdm
tqdm.pandas()

# Data

In [7]:
df = pd.read_csv('/data/extractive_data.csv')
df.head()

Unnamed: 0,text,summary
0,European losses hit GM's profits\n\nGeneral Mo...,"For the whole of 2004, GM earned $3.7bn, down ..."
1,Yangtze Electric's profits double\n\nYangtze E...,"Yangtze Electric Power, the operator of China'..."
2,Why few targets are better than many\n\nThe ec...,That's why the Kok report recommends that the ...
3,Lufthansa flies back to profit\n\nGerman airli...,German airline Lufthansa has returned to profi...
4,Japanese growth grinds to a halt\n\nGrowth in ...,"The growth falls well short of expectations, b..."


# Model

In [6]:
model = Summarizer()

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [12]:
df['prediction'] = df.text.progress_map(model)

  0%|          | 0/2225 [00:00<?, ?it/s]

# Comparison - dummy result

We'll just take the first sentence of each text as the summary.

In [19]:
df['dummy'] = df.text.progress_map(lambda text: text.split('.')[0])

  0%|          | 0/2225 [00:00<?, ?it/s]

# Performance

In [36]:
metric = load_metric("rouge")

def rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [37]:
rouge_scores(df.dummy, df.summary)

{'rouge1': 24.0, 'rouge2': 18.4, 'rougeL': 21.5, 'rougeLsum': 21.9}

In [39]:
rouge_scores(df.prediction, df.summary)

{'rouge1': 50.6, 'rouge2': 37.9, 'rougeL': 36.5, 'rougeLsum': 37.2}

Not bad!

# Variations

In [33]:
print(model(df.text[0], ratio=0.3))

European losses hit GM's profits

General Motors (GM) saw its net profits fall 37% in the last quarter of 2004, as it continued to be hit by losses at its European operations. GM's revenues rose 4.7% to $51.2bn from $48.8bn a year earlier. GM reported solid overall results in 2004, despite challenging competitive conditions in many markets around the globe," GM chairman and chief executive Rick Wagoner said in a statement. The company recently announced that it expected profits in 2005 to be lower than in 2004.


In [35]:
opt_k = model.calculate_optimal_k(df.text[0], k_max=10)
print(f'Optimal K: {opt_k}\n----------------')
print(model(df.text[0], num_sentences=3))

Optimal K: 3
----------------
European losses hit GM's profits

General Motors (GM) saw its net profits fall 37% in the last quarter of 2004, as it continued to be hit by losses at its European operations. GM's revenues rose 4.7% to $51.2bn from $48.8bn a year earlier. GM reported solid overall results in 2004, despite challenging competitive conditions in many markets around the globe," GM chairman and chief executive Rick Wagoner said in a statement.
