# Label Data For Summarization
This notebook uses data from the ["cnn_dailymail" dataset from Tensorflow Dataset](https://www.tensorflow.org/datasets/catalog/cnn_dailymail).

The goal of this project is to simplify the summarization task. Instead of generating abstract summaries, I will instead extract sentences which best summarize the document. To define how to best summarize the document I use the rouge1 score. For this I am using the rouge-scorer python library. 

To classify the sentences as part of the summary or not, this notebook uses a greedy algorithm. Like described in the paper below. It first looks at the rouge score for each sentence in the document compared with the summary. It selects the one that improves the rouge score the most. Then it tries to find another sentence that further improves this metric. It does this until it has found N or the rouge metric is not improving. 

The CNN/Daily Mail dataset is split into train/validation/test subsets. This notebook will create `<name>.labeled.parquet` files for each subset. The data comes from Tensorflow Datasets [cnn_dailymail](https://www.tensorflow.org/datasets/catalog/cnn_dailymail)

This can take a while to run, mainly because of the rouge score optimization. The dataset has around 300 thousand examples which takes about 2 hours to process on a 2019 8 core MacBook Pro. 

This can also creates data sets for 3 different sentence tokenizers so we can evaluate if there is a better option. They are Spark NLP, Spacy and NLTK. See the `sent_splitters` dict to set which ones you want to use. 

The idea for labeling like this comes from this paper:

SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents

Ramesh Nallapati, Feifei Zhai, Bowen Zhou

https://arxiv.org/pdf/1611.04230.pdf


In [None]:
# Max number of sentences per article
# Using 1 for now 
MAX_SUMMARY_SENTENCES_PER_ARTICLE = 1

In [5]:
from rouge_score import rouge_scorer
import spacy
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd
import tqdm
import pyarrow
import os
import multiprocessing
from multiprocessing import Pool
from pyspark.sql import SparkSession
import nltk
nltk.download('punkt')
from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import SentenceDetector
from pyspark.ml import Pipeline
import tabulate


[nltk_data] Downloading package punkt to /Users/jzeimen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
def label_example(sents, summary, scorer, n=MAX_SUMMARY_SENTENCES_PER_ARTICLE):
    """Greedily find the three sentences that optimize the Rouge1 fmeasure score """
    best_sents = []
    best_sents_concat = ""
    best_rouge = 0
    best_rouge_precision = 0
    best_rouge_recall = 0
    for i in range(0,n):
        next_best_sent = 0
        found_better_sent = False
        for index, sent in enumerate(sents):
            if index in best_sents:
                continue
            score = scorer.score(sent + " " + best_sents_concat,summary)
            fmeasure = score['rouge1'].fmeasure
            if fmeasure >= best_rouge:
                best_rouge = fmeasure
                best_rouge_precision = score['rouge1'].precision
                best_rouge_recall = score['rouge1'].recall
                found_better_sent=True
                next_best_sent = index

        if found_better_sent:
            best_sents.append(next_best_sent)
            best_sents_concat += " " + sents[next_best_sent]
        else:
            break
    label = [0] * len(sents)
    for i in best_sents:
        label[i] = 1
    return label, best_rouge, best_rouge_precision, best_rouge_recall

def start_spark():
    builder = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "1000M") \
        .config("spark.driver.maxResultSize", "20G") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.3") # Change this updating spark-nlp
    return builder.getOrCreate()

def split_sentences_spacy(df):
    nlp = spacy.load('en_core_web_sm')
    df['sentences'] = df['article'].map(lambda x: [str(i) for i in list(nlp(x, disable=['tagger', 'ner']).sents)])
    return df

def split_sentences_nltk(df):
    df['sentences'] = df['article'].map(lambda x: nltk.tokenize.sent_tokenize(x))
    return df

def split_sentences_spark_nlp(df):
    spark = start_spark()
    sdf = spark.createDataFrame(df)
    da = DocumentAssembler().setInputCol('article')
    sentenceDetector = SentenceDetector().setInputCols(['document']).setOutputCol('sentences').setExplodeSentences(False)
    fin = Finisher().setInputCols(['sentences']).setOutputCols(['sentences'])
    pipeline = Pipeline(stages=[da, sentenceDetector, fin])
    model = pipeline.fit(sdf)
    return model.transform(sdf).toPandas()

def label_sentences_row(row):
    """Create a label for each sentence in df.sentences"""
    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    row['labels'], row['rouge_fmeasure'], row['rouge_precision'], row['rouge_recall'] = label_example(row['sentences'], row['summary'],scorer)
    return row

def label_sentences(df):
    return df.apply(label_sentences_row, axis=1)


## Uncomment the one you want to process. Since I have found spacy to work best for my purposes
# I just leave that uncommented
sent_splitters = {
    #'spark_nlp': split_sentences_spark_nlp,
    'spacy_fmeasure-1': split_sentences_spacy
    #'nltk': split_sentences_nltk
}

In [7]:
def process_tfds(dataset, sentence_splitter_name, n_cores=multiprocessing.cpu_count()):
    """Turns Tensorflow Dataset into labeled pandas dataframe using multiprocessing for speedup"""
    n_jobs=100
    examples = [{'article': i['article'].numpy().decode(), 'summary':i['highlights'].numpy().decode()} for i in dataset]
    if(len(examples) > 200000):
        n_jobs=2000
    print("{} examples to process in {} batches. (~{} examples per batch)".format(len(examples), n_jobs, len(examples)//n_jobs))
    print("Tokenizing Sentences")
    df = pd.DataFrame(examples)
    if sentence_splitter_name == 'spark_nlp':
        df = sent_splitters[sentence_splitter_name](df).copy()
    else:
        pool = Pool(n_cores)
        df = pd.DataFrame(examples)
        df_split = np.array_split(df, n_jobs)
        df = pd.concat(tqdm.tqdm(pool.imap_unordered(sent_splitters[sentence_splitter_name], df_split), total=n_jobs, unit="batch", smoothing=0))
        pool.close()
        pool.join()
        
    
    print("Labeling Sentences")
    pool = Pool(n_cores)
    df_split = np.array_split(df, n_jobs)
    df = pd.concat(tqdm.tqdm(pool.imap_unordered(label_sentences, df_split), total=n_jobs, unit="batch", smoothing=0))
    pool.close()
    pool.join()
    return df

In [8]:
datasets = tfds.load("cnn_dailymail")

In [9]:
for name in sent_splitters.keys():
    print("Using {} to tokenize sentences".format(name))
    if not os.path.exists(name):
        os.mkdir(name)
    for key in datasets.keys():
        file_name = os.path.join(name, key + ".labeled.parquet")
        if os.path.exists(file_name):
            print("{} dataset is already done".format(key))
            continue
        print("Processing {} dataset".format(key))
        df = process_tfds(datasets[key], name)
        df.to_parquet(file_name)

Using spacy_fmeasure-1 to tokenize sentences
Processing test dataset
11490 examples to process in 100 batches. (~114 examples per batch)
Tokenizing Sentences


100%|██████████| 100/100 [01:08<00:00,  1.47batch/s]


Labeling Sentences


100%|██████████| 100/100 [01:21<00:00,  1.23batch/s]


Processing train dataset
287113 examples to process in 2000 batches. (~143 examples per batch)
Tokenizing Sentences


100%|██████████| 2000/2000 [35:40<00:00,  1.07s/batch]


Labeling Sentences


100%|██████████| 2000/2000 [35:46<00:00,  1.07s/batch]


Processing validation dataset
13368 examples to process in 100 batches. (~133 examples per batch)
Tokenizing Sentences


100%|██████████| 100/100 [01:30<00:00,  1.11batch/s]


Labeling Sentences


100%|██████████| 100/100 [01:31<00:00,  1.10batch/s]


In [6]:
pd.read_parquet("spark_nlp/test.labeled.parquet")

Unnamed: 0,article,summary,sentences,labels,rouge
2,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...,[Dougie Freedman is on the verge of agreeing a...,"[0, 1, 1, 0, 0, 1]",0.575758
3,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...,[Liverpool target Neto is also wanted by PSG a...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]",0.675000
0,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...,[Ever noticed how plane seats appear to be get...,"[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]",0.735294
1,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...,[A drunk teenage boy had to be rescued by secu...,"[1, 0, 1, 1]",0.972222
6,The amount of time people spend listening to B...,Figures show that while millions still tune in...,[The amount of time people spend listening to ...,"[0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.505495
...,...,...,...,...,...
197,I yield to no one in my love of the old days —...,The weekend saw BBC's FA Cup coverage compete ...,[I yield to no one in my love of the old days ...,"[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.627907
198,An Israeli tourist captured the hilarious mome...,Video was captured at the Ngorongoro Conservat...,[An Israeli tourist captured the hilarious mom...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, ...",0.728814
199,The Tennessee Supreme Court postponed executio...,Tennessee Supreme Court vacates execution date...,[The Tennessee Supreme Court postponed executi...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.878049
180,Robbie Knievel has been arrested for allegedly...,"The daredevil, 52, 'was speeding in an SUV whe...",[Robbie Knievel has been arrested for allegedl...,"[1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0.793103


In [10]:
import matplotlib.pyplot as plt
for splitter in sent_splitters.keys():
    print(splitter)
    df = pd.read_parquet(os.path.join(splitter,"train.labeled.parquet"))
    print("Mean number of sentences: {}".format(df.sentences.map(len).mean()))
    print("Mean rouge precision: {}".format(df.rouge_precision.mean()))
    print("Mean rouge recall: {}".format(df.rouge_recall.mean()))
    print("Mean rouge F1 score: {}".format(df.rouge_fmeasure.mean()))
    print("")

spacy_fmeasure-1
Mean number of sentences: 42.84176961684076
Mean rouge precision: 0.34828482673063893
Mean rouge recall: 0.6143906642289676
Mean rouge F1 score: 0.42733644090887524



In [59]:
spark_nlp_df = pd.read_parquet(os.path.join("spark_nlp","train.labeled.parquet"))
spacy_df = pd.read_parquet(os.path.join("spacy", "train.labeled.parquet"))

In [60]:
def show_example(index):
    summary = spark_nlp_df.iloc[index].summary
    spark_sentences = spark_nlp_df.iloc[index].sentences
    spacy_sentences = spacy_df[spacy_df.summary == summary].iloc[0].sentences
    return tabulate.tabulate({"spark nlp sentences":spark_sentences, "spacy sentences": spacy_sentences}, headers="keys", tablefmt='html')

Below is an example where Spark NLP doesn't do as good of a job at parsing the sentences out. It will tend to think things in quotes should be kept in 1 sentence. Those long quote's are pretty common and will make for poor summarization so this leads me to go with spacy's sentence tokenization.

In [61]:
show_example(10)

spark nlp sentences,spacy sentences
A new plane seat design looks set to revolutionise long haul flights.,A new plane seat design looks set to revolutionise long haul flights.
"Designed for use in premium economy and business class, at the flick of a switch, the seat can be transformed from single upright seat, to four-foot 'couch' and even into a full length double bed.","Designed for use in premium economy and business class, at the flick of a switch, the seat can be transformed from single upright seat, to four-foot 'couch' and even into a full length double bed."
"It will allow young passengers in premium economy to lie flat, and will stagger seating to avoid the dreaded elbow clash - a very common complaint on long-haul flights.","It will allow young passengers in premium economy to lie flat, and will stagger seating to avoid the dreaded elbow clash - a very common complaint on long-haul flights."
"Hong Kong designer James SH Lee of Paperclip Design, came up with the concept, named Butterfly seating, which recently one first prize at IATA's 2014 Passenger Innovation Awards.","Hong Kong designer James SH Lee of Paperclip Design, came up with the concept, named Butterfly seating, which recently one first prize at IATA's 2014 Passenger Innovation Awards."
A business class seat in upright mode.,A business class seat in upright mode.
Passengers get a spacious private suite with a seat plus a side couch and direct aisle access .,Passengers get a spacious private suite with a seat plus a side couch and direct aisle access .
The idea of Butterfly seating is to allow for individual demand of the cabin on each individual flight - despite the limited space on an aircraft.,The idea of Butterfly seating is to allow for individual demand of the cabin on each individual flight - despite the limited space on an aircraft.
"While only the business class passengers will be able to transform their seats into a full length bed, this is the first time that premium economy travellers will have access to a 'couch' allowing younger passengers the chance to sleep more comfortably.","While only the business class passengers will be able to transform their seats into a full length bed, this is the first time that premium economy travellers will have access to a 'couch' allowing younger passengers the chance to sleep more comfortably."
"Lee told MailOnline Travel: 'This flexibility allows airlines to make use of resource more efficiently so that the cost is lowered. The layout maximises bed space, with the sleeping surface utilizing nearly every inch of available floor area . 'They can also react to fluctuations much quicker than before, making them much more resilient to the risks of changing market conditions.' 'For passengers this equates to more stable fares long term, and the flexibility means more options for them. 'In the past if the business class cabin is full, then it's sold out. But with butterfly they can turn some premium economy seats into business class, and vice versa, if there's a need. At 53cm, the seating will be as wide as many current business class seats in a 777 sized cabin. Fliers in business class can transform the seat into a large bed, accommodating various sleeping positions . In premium economy, young passengers will be able to lie flat as their seat can be transformed into a 'couch' The seats will also feature large cocktail trays, seat pockets on the side and an adjustable ottoman.",Lee told MailOnline Travel: 'This flexibility allows airlines to make use of resource more efficiently so that the cost is lowered.
"Meanwhile, passengers in business class will be able to transform the seat into one of the largest bed surfaces currently available.","The layout maximises bed space, with the sleeping surface utilizing nearly every inch of available floor area . '"


In this example below the spacy tokenizer would tokenize thigns after colons, which doesn't really make sense. 

In [62]:
show_example(6)

spark nlp sentences,spacy sentences
"Taxi company Uber's low-cost carpooling service, UberPOP, is set to be banned in France from January next year, the government said.","Taxi company Uber's low-cost carpooling service, UberPOP, is set to be banned in France from January next year, the government said."
The ruling comes after hundreds of taxi drivers blocked roads around Paris to protest what they claim are its unfair business practices.,The ruling comes after hundreds of taxi drivers blocked roads around Paris to protest what they claim are its unfair business practices.
"Drivers blocked the roads heading from the Roissy Charles de Gaulle airport, then inched toward the French capital in their latest protest of the ride-sharing company.","Drivers blocked the roads heading from the Roissy Charles de Gaulle airport, then inched toward the French capital in their latest protest of the ride-sharing company."
"Un appy: Uber's low-cost carpooling service, UberPOP, will be banned in France from January 1 .",Un appy:
"The new law tightening regulations for chauffeured rides will effectively ban the UberPOP service as of January 1st, Pierre-Henry Brandet, spokesman for France's Interior Ministry, said.","Uber's low-cost carpooling service, UberPOP, will be banned in France from January 1 ."
"'Currently, people who use UberPop are not protected if there is an accident. So not only is it illegal to offer this service but for the consumer there is a real danger,' Brandet told the BFM television network. France is the latest of several places where Uber has faced challenges to its service, which matches people seeking rides with drivers through a cellphone app. Traditional taxis say Uber has an unfair advantage because its drivers don't face the same requirements, insurance and taxes. On Friday, a French court stopped short of banning the company but ordered Uber to make changes, including omitting 'all mention suggesting it is legal' for its drivers to act like taxis — that is, driving around and waiting for clients.","The new law tightening regulations for chauffeured rides will effectively ban the UberPOP service as of January 1st, Pierre-Henry Brandet, spokesman for France's Interior Ministry, said. '"
French motorcycle police escort striking Paris taxis which take part in a demonstration over the Paris ring road heading into the capital from the Roissy Charles de Gaulle airport .,"Currently, people who use UberPop are not protected if there is an accident."
"Parisian taxi drivers are fed up with what they see as unfair competition from Uber's UberPOP, which uses non-professional drivers using their own cars to take on passengers at budget rate .","So not only is it illegal to offer this service but for the consumer there is a real danger,'"
New rules: A French court stopped short of banning the company but ordered Uber to make changes .,Brandet told the BFM television network.
"This comes after Uber services were banned in Spain, Holland and the Indian capital New Delhi just last week.","France is the latest of several places where Uber has faced challenges to its service, which matches people seeking rides with drivers through a cellphone app."
