<a href="https://colab.research.google.com/github/kevinetienne/liveprojectnlp/blob/main/2_summarise_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarize news articles with NLP

## 2. Extracting Summaries from the Stories
### 2.1 Preprocessing the Stories and Their Summaries



In [3]:
import pickle


# main_list holds list of dict
# main_list = [{"news": "...", "highlights": ["..."]}]
main_list = []
with open("workflow_1.pkl", "rb") as f:
    main_list = pickle.load(f)

len(main_list)

92579

In [24]:
# Remove punctuations and digits

import string


chars_to_remove = string.punctuation + string.digits
table = str.maketrans(dict.fromkeys(chars_to_remove))

def preprocess(text):
    """Preprocess the text.

    Remove whitespaces with `strip`
    Use `translate` to remove punctuations and digits.
    Then use `lower` to convert it to lowercase.
    And replace "cnn" with "office".
    """
    return text.strip().translate(table).lower().replace("cnn", "")

preprocessed_main_list = []
for story in main_list:
    # split the story
    sentences = story["news"].split(".")
    preprocessed_news = []
    for s in sentences:
        s_cleaned = preprocess(s)
        if len(s_cleaned) > 5:
            preprocessed_news.append(s_cleaned)
    preprocessed_highlights = [preprocess(h) for h in story["highlights"] if h]
    preprocessed_story = {
        "news": preprocessed_news,
        "highlights": preprocessed_highlights,
    }
    preprocessed_main_list.append(preprocessed_story)

# check list is still the same size
len(preprocessed_main_list)

92579

In [25]:
# check first two items
preprocessed_main_list[:2]

[{'news': ['your guide to the best new childrens and young adult literature is here',
   'the winners of the  newbery caldecott printz coretta scott king and other prestigious youth media awards were announced monday morning by the american library association',
   'in addition to books these awards highlight videos and other creative materials produced for children over the past year',
   'diverse authors and titles from  such as the crossover by kwame alexander and brown girl dreaming by jacqueline woodson were highlighted with many awards causing the audience to cheer the choices by the committees',
   'the caldecott medal went to the adventures of beekle the unimaginary friend illustrated and written by dan santat which follows the journey of an imaginary friend in search of his perfect match',
   'the newbery medal was awarded to the crossover by kwame alexander a story about family and brotherhood told through verse by yearold twin basketball players josh and jordan bell',
   'pa

### 2.2 Extracting Summaries from the Stories with ROUGE Score

Install rouge

In [None]:
!pip install rouge

Extracting f1 score for every news sentences (hypothesis) against each highlights (references)

In [26]:
import time

from rouge import Rouge


rouge = Rouge()

# for each sentences in the story and for each hypothesis calculate rouge score.
results = []
size = len(preprocessed_main_list)
tic = time.perf_counter()
for idx, story in enumerate(preprocessed_main_list):
    if ((idx + 1) % 1000 == 0):
        toc = time.perf_counter()
        t = toc - tic
        print("Processing {}/{} ({} secs)".format(idx + 1, size, t))
        tic = time.perf_counter()
    hypothesis = [x for x in story["news"]]
    hypothesis_size = len(hypothesis) 
    story_highlight_scores = {}

    for reference in story["highlights"]:
        scores = rouge.get_scores(hypothesis, [reference] * hypothesis_size)

        # collect only f1 score
        scoresf1 = [score["rouge-1"]["f"] for score in scores]
        top5 = sorted(scoresf1, reverse=True)[:5]
        scores = []
        for top in top5:
          idx = scoresf1.index(top)
          scores.append({hypothesis[idx]: top})
        story_highlight_scores[reference] = scores

    results.append(story_highlight_scores)

Processing 1000/92579 (64.05570191101287 secs)
Processing 2000/92579 (55.00178514199797 secs)
Processing 3000/92579 (34.859819326986326 secs)
Processing 4000/92579 (38.42961579200346 secs)
Processing 5000/92579 (39.32746518799104 secs)
Processing 6000/92579 (38.90023352098069 secs)
Processing 7000/92579 (39.837506090989336 secs)
Processing 8000/92579 (39.3243066839932 secs)
Processing 9000/92579 (38.42691326097702 secs)
Processing 10000/92579 (38.97155493500759 secs)
Processing 11000/92579 (38.092573325004196 secs)
Processing 12000/92579 (39.972597640997265 secs)
Processing 13000/92579 (37.486394456995185 secs)
Processing 14000/92579 (37.77527989799273 secs)
Processing 15000/92579 (40.63450039498275 secs)
Processing 16000/92579 (40.920778077997966 secs)
Processing 17000/92579 (37.13438914300059 secs)
Processing 18000/92579 (42.85024805399007 secs)
Processing 19000/92579 (44.72600366099505 secs)
Processing 20000/92579 (45.31561582698487 secs)
Processing 21000/92579 (38.8217434649996 sec

Display first 2 stories scores

In [27]:
results[:2]

[{'the winners of the newbery caldecott printz and other prestigious awards were announced': [{'the winners of the  newbery caldecott printz coretta scott king and other prestigious youth media awards were announced monday morning by the american library association': 0.6842105218144046},
   {'the ala youth media awards were announced during the organizations winter meeting in chicago and selected by a national judging committee of librarians and childrens literature experts': 0.34999999561250006},
   {'in addition to books these awards highlight videos and other creative materials produced for children over the past year': 0.2499999951757813},
   {'the caldecott medal went to the adventures of beekle the unimaginary friend illustrated and written by dan santat which follows the journey of an imaginary friend in search of his perfect match': 0.2222222181135803},
   {'diverse authors and titles from  such as the crossover by kwame alexander and brown girl dreaming by jacqueline woodson 

Prepare the list for the dataframe

In [50]:
# story id, sentence id, sentence, summary candidate
data = []

for story_id, story in enumerate(preprocessed_main_list):
    for sentence_id, sentence in enumerate(story["news"]):
        # check if the sentence appears in one of the top 5 scores for each reference (highlights)
        summary_candidate = 0
        for _, top5_scores in results[story_id].items():
            top5_sentences = {k for d in top5_scores for k in d.keys()}
            if sentence in top5_sentences:
                data.append([story_id, sentence_id, sentence, 1])
            else:
                data.append([story_id, sentence_id, sentence, 0])

# check first two rows
data[:2]

[[0,
  0,
  'your guide to the best new childrens and young adult literature is here',
  0],
 [0,
  1,
  'the winners of the  newbery caldecott printz coretta scott king and other prestigious youth media awards were announced monday morning by the american library association',
  1]]

Build the dataframe

In [51]:
import pandas as pd


df = pd.DataFrame(data, columns=["story_id", "sentence_id", "sentence", "summary_candidate"])

# check first story for summary candidates
df[(df["story_id"] == 0) & (df["summary_candidate"] == 1)]

Unnamed: 0,story_id,sentence_id,sentence,summary_candidate
1,0,1,the winners of the newbery caldecott printz c...,1
2,0,2,in addition to books these awards highlight vi...,1
3,0,3,diverse authors and titles from such as the c...,1
4,0,4,the caldecott medal went to the adventures of ...,1
7,0,7,the ala youth media awards were announced duri...,1


Save in pickle format

In [52]:
import pickle


main_list_pickled = pickle.dumps(df)
with open("workflow_2.pkl", "wb") as f:
    f.write(main_list_pickled)
    f.close()