### Install packages

In [1]:
!pip install transformers
!pip install bert-extractive-summarizer



Collecting spacy==2.0.12
  Downloading spacy-2.0.12.tar.gz (22.0 MB)
[K     |████████████████████████████████| 22.0 MB 4.1 MB/s eta 0:00:01
[?25hCollecting cymem<1.32,>=1.30
  Downloading cymem-1.31.2.tar.gz (33 kB)
Collecting dill<0.3,>=0.2
  Downloading dill-0.2.9.tar.gz (150 kB)
[K     |████████████████████████████████| 150 kB 4.8 MB/s eta 0:00:01
[?25hCollecting murmurhash<0.29,>=0.28
  Downloading murmurhash-0.28.0.tar.gz (23 kB)
Collecting plac<1.0.0,>=0.9.6
  Downloading plac-0.9.6-py2.py3-none-any.whl (20 kB)
Collecting preshed<2.0.0,>=1.0.0
  Downloading preshed-1.0.1.tar.gz (112 kB)
[K     |████████████████████████████████| 112 kB 5.7 MB/s eta 0:00:01
[?25hCollecting regex==2017.4.5
  Downloading regex-2017.04.05.tar.gz (601 kB)
[K     |████████████████████████████████| 601 kB 4.2 MB/s eta 0:00:01
Collecting thinc<6.11.0,>=6.10.3
  Downloading thinc-6.10.3.tar.gz (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.5 MB/s eta 0:00:01
[?25hCollecting ujson>=1.3

In [15]:
!pip install --upgrade transformers
!pip install torch

Requirement already up-to-date: transformers in /home/vasil/.local/lib/python3.8/site-packages (4.26.1)


### Imports and load dataset

In [2]:
from summarizer import Summarizer,TransformerSummarizer

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import numpy as np
import pandas as pd
import nltk
import re
df = pd.read_csv('../../WikiHow-Dataset/wikihowAll.csv', delimiter=',')

### Preprocessing

In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize

def sent_tokenize_summaries(summary):
    summary = re.sub(r'[.]+[,]+[\n]', ".\n", summary)
    summary = re.sub(r'[\n]+', "", summary)
    return sent_tokenize(summary)

### Evaluation

In [5]:
from rouge import Rouge

def evaluate_rouge_score(reference_summaries, generated_summaries):
    rouge_scores = []
    for reference_summary, generated_summary in zip(reference_summaries, generated_summaries):
        reference_summary = ' '.join(reference_summary)
        generated_summary = ' '.join(generated_summary)
        rouge_scores.append(Rouge().get_scores(generated_summary, reference_summary, avg=True))
    return rouge_scores    


In [6]:
from nltk.translate.bleu_score import sentence_bleu

def evaluate_bleu_score(reference_summaries, generated_summaries):
    bleu_scores = []
    for reference_summary, generated_summary in zip(reference_summaries, generated_summaries):
        reference_summary = ' '.join(reference_summary)
        generated_summary = ' '.join(generated_summary)
        bleu_scores.append(sentence_bleu([reference_summary], generated_summary))
        
    return bleu_scores

### BERT summarizer

In [7]:
df['headline'][0]

'\nKeep related supplies in the same area.,\nMake an effort to clean a dedicated workspace after every session.,\nPlace loose supplies in large, clearly visible containers.,\nUse clotheslines and clips to hang sketches, photos, and reference material.,\nUse every inch of the room for storage, especially vertical space.,\nUse chalkboard paint to make space for drafting ideas right on the walls.,\nPurchase a label maker to make your organization strategy semi-permanent.,\nMake a habit of throwing out old, excess, or useless stuff each month.'

In [8]:
bert_model = Summarizer()
bert_summary = ''.join(bert_model(df['text'][0], min_length=60))
print(bert_summary)

reference_summaries = [sent_tokenize_summaries(df['headline'][0])]
generated_summaries = [sent_tokenize(bert_summary)]

print("----------------------------")
print(reference_summaries)
print("----------------------------")
print(generated_summaries)
print("----------------------------")
print(evaluate_rouge_score(reference_summaries, generated_summaries))

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


If you're a photographer, keep all the necessary lens, cords, and batteries in the same quadrant of your home or studio. , As visual people, a lot of artist clutter comes from a desire to keep track of supplies visually instead of tucked out of sight. Instead of spending all of your mental energy looking for or storing things, you can just follow the labels, freeing your mind to think about art., If it isn't essential or part of a project, either throw it out or file it away for later. Artists are constantly making new things, experimenting, and making a mess. This is a good thing, but only if you set aside time to declutter.
----------------------------
[['\nKeep related supplies in the same area.', 'Make an effort to clean a dedicated workspace after every session.', 'Place loose supplies in large, clearly visible containers.', 'Use clotheslines and clips to hang sketches, photos, and reference material.', 'Use every inch of the room for storage, especially vertical space.', 'Use cha

In [31]:
from transformers import pipeline

# Load the pre-trained BERT-based summarization model
summarizer = pipeline("summarization")

input_text = df['text'][1]

# Generate summary using the pre-trained BERT model
summary = summarizer(input_text, max_length=120, min_length=30, do_sample=False)

# Print the summary
print(summary[0]['summary_text'])
generated_summaries = [sent_tokenize_summaries(summary[0]['summary_text'])]
reference_summaries = [sent_tokenize_summaries(df['headline'][1])]

print(evaluate_rouge_score(reference_summaries, generated_summaries))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


 Painting a NeoPopRealist mural requires a suitable location, with the right surface that can be painted . Painting a mural always requires some preparation. You‘ll need equipment and effort, but planning and attention to detail will help you succeed .
[{'rouge-1': {'r': 0.15294117647058825, 'p': 0.40625, 'f': 0.22222221824822855}, 'rouge-2': {'r': 0.032, 'p': 0.1111111111111111, 'f': 0.0496894375217008}, 'rouge-l': {'r': 0.15294117647058825, 'p': 0.40625, 'f': 0.22222221824822855}}]


In [16]:
def summarize(article):
    summary = summarizer(article, max_length=120, min_length=30, do_sample=False)
    return sent_tokenize_summaries(summary[0]['summary_text'])

In [34]:
len(word_tokenize(input_text))

702

In [26]:
headlines = []
articles = []
i = 0
for index, row in df.iterrows():
    abstract = row['headline']
    article = row['text']
    if i > 10:
        break
    if isinstance(article, str) and isinstance(abstract, str):
        if len(abstract) < (0.75 * len(article)) and len(word_tokenize(article)) < 800:
            # remove extra commas in abstracts
            abstract = re.sub(r'[.]+[,]+[\n]', ".\n", abstract)
            abstract = abstract.replace(".,", ".")
            # remove extra commas in articles
            article = re.sub(r'[.]+[\n]+[,]', ".\n", article)
            
            headlines.append(abstract)
            articles.append(article)
            i+=1
print("Total number of documents: ", i)

Total number of documents:  11


In [27]:
from transformers import pipeline

reference_summaries = [sent_tokenize_summaries(summary) for summary in headlines]

print("Generating summaries")
summarizer = pipeline("summarization")
generated_summaries = [summarize(text) for text in articles]

print("Evaluating rouge scores")
# Evaluate the generated summaries using the ROUGE score
rouge_scores = evaluate_rouge_score(reference_summaries, generated_summaries)

total_precision_1 = 0

total_recall_1 = 0

total_f_1 = 0

total_precision_l = 0

total_recall_l = 0

total_f_l = 0

for k in rouge_scores:
    total_precision_1 += k['rouge-1']['p']
    total_recall_1 += k['rouge-1']['r']
    total_f_1 += k['rouge-1']['f']
    total_precision_l += k['rouge-l']['p']
    total_recall_l += k['rouge-l']['r']
    total_f_l += k['rouge-l']['f']

print('Average Rouge-1 score precision:', total_precision_1 / i)
print('Average Rouge-1 score recall:', total_recall_1 / i)
print('Average Rouge-1 score f :', total_f_1 / i)

print('Average Rouge-l score precision:', total_precision_l / i)
print('Average Rouge-l score recall:', total_recall_l / i)
print('Average Rouge-l score f :', total_f_l / i)


print("Evaluating BLEU scores")
# Evaluate the generated summaries using the BLEU score
bleu_scores = evaluate_bleu_score(reference_summaries, generated_summaries)

# Average BLEU score
avg_bleu_score = sum(bleu_scores) / len(bleu_scores)
print("Average BLEU score:", avg_bleu_score)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Generating summaries
Evaluating rouge scores
Average Rouge-1 score precision: 0.2582834356253031
Average Rouge-1 score recall: 0.2356337252095572
Average Rouge-1 score f : 0.2299792950322822
Average Rouge-l score precision: 0.2513941256527422
Average Rouge-l score recall: 0.23140426483036416
Average Rouge-l score f : 0.22474324734620088
Evaluating BLEU scores
Average BLEU score: 0.24368166918165093


In [19]:
print(generated_summaries)

[[" As you start planning for a project or work, you'll likely be gathering scraps of inspiration and test sketches .", 'Organizing your work and progress frees your mind to actually be creative, instead of worrying about logistics .'], [' Your reel is a short video showcasing the breadth and depth of your skills as an artist .', 'If you are enrolled in a VFX program, talk to your career counselors to see what opportunities might be available .', 'Reach out to studios to see if they have any spots for paid or unpaid interns .'], [' Most VFX work situations require that you communicate with a number of people as you complete a task .', 'Join an industry group, such as the Visual Effects Society (VES) Follow their activities and attend events when you can .']]


In [20]:
print(reference_summaries)

[['\nKeep your reference materials, sketches, articles, photos, etc, in one easy to find place.', 'Make "studies," or practice sketches, to organize effectively for larger projects.', 'Limit the supplies you leave out to the project at hand.', 'Keep an updated list of all of the necessary supplies, and the quantities of each.', 'Break down bigger works into more easily completed parts.'], ['\nCreate a compelling reel or portfolio.', 'Land an internship.', 'Consider self-employment.', 'Sign on with a design company or studio.', 'Move up to a supervisor position.'], ['\nJoin a professional society.', 'Enjoy working with a team.', 'Expect long work hours.', 'Spend time on a TV or film set.']]
