# Extractive and Abstractive Text Summarisation

Text summarization can be done using multiple ways however all the algorithms can be categorized under extractive or abstractive methods. Extractive methods essentially use the same words in the text while Abstractive methods attempt to extract content which may not be directly present in the text. I have attempted to use text rank algorithm for extractive methods and LSTM based encoder decoder architecture for abstractive methods.
The data set used for text extraction is a Kaggle dataset available at https://www.kaggle.com/sunnysai12345/news-summary

Please note that exractive text summariser takes about 20 minutes to generate the summary however abstractive deep learning summariser takes close to 24 hours to complete.

In [1]:
# import required packages
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import re
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import gensim
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Activation, RepeatVector, Input, \
concatenate, Permute, Flatten, Multiply, TimeDistributed, Dropout
from keras.models import Sequential
from keras.callbacks import LambdaCallback
from keras.models import Model
import math
import keras.backend as K

Using TensorFlow backend.


In [2]:
# dataset downloaded from kaggle - https://www.kaggle.com/sunnysai12345/news-summary
df_small = pd.read_csv('./news_summary.csv', encoding='latin1')
# utilize only first 700 rows of dataframe because of long execution time.
df_small = df_small.head(700)
df_small.head()

Unnamed: 0.1,Unnamed: 0,author,date,headlines,read_more,text,ctext
0,0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


#### Similarity score

I have used the concept of synsets to calculate similarity between two sets. The similarity score is used as a metric of comparision between two texts.

In [3]:
# utility function to generate pos tags for text
def generate_pos_tags(text):
    pos_text = nltk.pos_tag(nltk.word_tokenize(text.lower()))
    return pos_text

# utility function to convert penn tree bank tag to wordnet
def penn_to_wn(tag):
    if tag.startswith('J'):
        return 'a'
    
    if tag.startswith('N'):
        return 'n'
    
    if tag.startswith('R'):
        return 'r'
 
    if tag.startswith('V'):
        return 'v'
 
    return None

# get synset from wordnet
def get_synset(word, tag):
    wn_tag = penn_to_wn(tag)
    if wn_tag is None:
        return None
    try:
        return wn.synsets(word, wn_tag)[0]
    except:
        return None

In [4]:
# source https://www.aaai.org/Papers/AAAI/2006/AAAI06-123.pdf
def calculate_similarity(text1, text2):
    '''Utility function to calculate similarity'''
    text1 = generate_pos_tags(text1)
    text2 = generate_pos_tags(text2)
    
    synsets1 = [get_synset(*tagged_word) for tagged_word in text1]
    synsets2 = [get_synset(*tagged_word) for tagged_word in text2]
 
    # Filter out the Nones
    synsets1 = [ss for ss in synsets1 if ss]
    synsets2 = [ss for ss in synsets2 if ss]
 
    score, count = 0.0, 0
 
    for synset in synsets1:
        similarity_values = []
        
        best_score = 0.0
        for ss in synsets2:
            similarity = synset.path_similarity(ss)
            if similarity is not None:
                similarity_values.append(similarity)
            if len(similarity_values):
                # take the max similarity score
                best_score = max(similarity_values)
        if best_score > 0.0:
            score += best_score
            count += 1
    if count > 0:
        # calculate the average similarity score
        score /= count
    return score

In [5]:
# assuming corp is a dataframe which has columns with names 'text' and 'ctext'
def generate_summary_and_similarity_score(corp, corp_summarizer, summariser_name):
    '''This function joins the final summary and calculates the similarity score.
    It further adds the summary and the similarity score to the dataset. 
    Please note that this function only joins the summary'''
    summary_list = []
    similarity_list = []
    for ctext, text in zip(corp['ctext'], corp['text']):
        summary = ''
        similarity = 0.0
        if ctext is not None and str(ctext) != 'NA':
            summary = summary.join(corp_summarizer(ctext))
            similarity = calculate_similarity(summary, str(text))
        summary_list.append(summary)
        similarity_list.append(similarity)
    corp['summary-'+summariser_name] = summary_list
    corp['similarity-score-'+summariser_name] = similarity_list

# using dissimilarity score like a loss function - minimise dissimilarity to minimise loss        
def get_cumulative_similarity_and_dissimilarity_score(corp, summariser_name):
    '''Calculate cumulative dissimilarity and similarity scores.'''
    similarity = 0.0
    dissimilarity = 0.0
    for similarity_score in corp['similarity-score-'+summariser_name]:
        if similarity_score is not None:
            similarity+= similarity_score
            # dissimilarity score is assumed to be 1 - similarity score for each row
            dissimilarity+= (1 - similarity_score)
    return dissimilarity, similarity

In [6]:
# On inspecting the dataset we find that in many cases, there is no space between 
# full point at the end of one sentence and the begining of the next sentence
def preprocess_data(corp):
    '''Preprocess the dataset'''
    sanitized_text = []
    corp['headlines'].fillna('NA')
    corp['text'].fillna('NA')
    corp['ctext'].fillna('NA')
    for ctext in corp['ctext']:
        ctext_str = re.sub(r'\.', '. ',str(ctext))
        ctext_str = re.sub(r'don\'t', 'do not',str(ctext))
        ctext_str = re.sub(r'isn\'t', 'is not',str(ctext))
        ctext_str = re.sub(r'won\'t', 'will not',str(ctext))
        ctext_str = re.sub(r'i\'m', 'i am',str(ctext))
        ctext_str = re.sub(r'who\'re', 'who are',str(ctext))
        ctext_str = re.sub(r'all\'s', 'all is',str(ctext))
        ctext_str = re.sub(r'couldn\'t', 'could not',str(ctext))
        ctext_str = re.sub(r'you\'d', 'you would',str(ctext))
        ctext_str = re.sub(r'don\'ts', 'donts',str(ctext))
        ctext_str = re.sub(r'they\'d', 'they would',str(ctext))
        ctext_str = re.sub(r'b\'coz', 'because',str(ctext))
        ctext_str = re.sub(r'do\'s', 'dos',str(ctext))
        ctext_str = re.sub(r'you\'re', 'you are',str(ctext))
        ctext_str = re.sub(r'he\'d', 'he would',str(ctext))
        ctext_str = re.sub(r'aren\'t', 'are not',str(ctext))
        ctext_str = re.sub(r'hasn\'t', 'has not',str(ctext))
        ctext_str = re.sub(r'he\'ll', 'he will',str(ctext))
        ctext_str = re.sub(r'aren\'t', 'are not',str(ctext))
        ctext_str = re.sub(r'ain\'t', 'am not',str(ctext))
        ctext_str = re.sub(r'i\'d', 'i would',str(ctext))
        ctext_str = re.sub(r'they\'re', 'they are',str(ctext))
        ctext_str = re.sub(r'she\'d', 'she would',str(ctext))
        ctext_str = re.sub(r'wouldn\'t', 'would not',str(ctext))
        ctext_str = re.sub(r'y\'all', 'you',str(ctext))
        ctext_str = re.sub(r'they\'ll', 'they will',str(ctext))
        ctext_str = re.sub(r'would\'ve', 'would have',str(ctext))
        ctext_str = re.sub(r'you\'ll', 'you will',str(ctext))
        ctext_str = re.sub(r'weren\'t', 'were not',str(ctext))
        ctext_str = re.sub(r'ma\'am', 'madam',str(ctext))
        ctext_str = re.sub(r'didn\'t', 'did not',str(ctext))
        ctext_str = re.sub(r'hon\'ble', 'honorable',str(ctext))
        ctext_str = re.sub(r'it\'ll', 'it will',str(ctext))
        ctext_str = re.sub(r'li\'l', 'little',str(ctext))
        ctext_str = re.sub(r'i\'ll', 'i will',str(ctext))
        ctext_str = re.sub(r'\'', '',str(ctext))
        sanitized_text.append(ctext_str.lower())
    corp['ctext'] = sanitized_text

preprocess_data(df_small)

### A. Extraction using text rank algorithm

Text rank algorithm uses the word frequency distribution to get the most common words and hence the sentences which have these common words. This algorithm does a good job tosummarise the content. I tried to use the comparison of similarity and dissimilarity score to indicate whether the summary is effective. For a dataset like which has news articles, this is a very effective method. This model took about 20 minutes to generate the summary for all texts in the corpus

In [7]:
#A: Extractive algorithms: Summariser based on word frequency - text rank algorithm
def get_word_freq(ctext):
    '''Function to get the word frequency distribution'''
    words = nltk.word_tokenize(str(ctext).lower())
    word_list = []
    for word in words:
        if word not in stopwords.words('english') and word not in punctuation:
            word_list.append(word)
    freq = nltk.FreqDist(word_list)
    return freq             

get_word_freq(df_small)

# get top 3 sentences with 5 most common words
def summarize_word_frequencies(ctext):
    '''This function uses the text rank algorithm to extract the summary'''
    sentences = nltk.sent_tokenize(str(ctext).lower())
    sent_scores = defaultdict(int)
    common_words = []
    fdist = get_word_freq(ctext)
    # get the 5 most frequent words in the text
    for word_freq in fdist.most_common(5):
        common_words.append(word_freq[0])
    sent_count = -1
    # get the top 3 sentences based on word frequency distribution of most common words
    for sentence in sentences:
        sent_count += 1
        sent_score = 0
        for word in nltk.word_tokenize(sentence):
            if word in common_words:
                sent_score += fdist[word]
        # scores are calculated for all sentences    
        sent_scores[sent_count] = sent_score
    # top 3 sentences based on scores are selected as the summary
    imp_sents_index = nlargest(3, sent_scores, key=sent_scores.get)
    imp_sents_list = []
    for index in imp_sents_index:
        imp_sents_list.append(sentences[index])
    return ' '.join(imp_sents_list)
        
generate_summary_and_similarity_score(df_small, summarize_word_frequencies, 'word-freq')
# printed the scores for comparison
get_cumulative_similarity_and_dissimilarity_score(df_small, 'word-freq')

(316.25684940943626, 383.7431505905638)

### B. Summariser using Abstractive deep learning methods

Abstractive summarization will help to get more than just summary of the text. The intention is to capture the meaning of the text to a certain extent. I have used the following approach for the same:
1. Generate vectors of the available corpus using word2vec
2. Split data set into train and test sets
3. Use LSTM based encoder / decoder model using 2 encoder layers and one decoder layer - I also tried to use attention based model however was unable to integrate the attention layer in the model
4. Train the model and generate the summary

Although I expected this to work however the model generated summary which was unrelated to the given text and was  gramatically incorrect. Also this takes a very long time to train. I had to run the model for more than 24 hours to complete 100 epochs.

In [8]:
#B: Summariser using Abstractive deep learning methods
#get all sentences from data and summary to generate word2vec model
def generate_word2vec_model(corp):
    '''Generate word2vec model for the given corpus'''
    sentences_for_model = []
    for sent in corp['ctext'] + corp['text']:
        sents = nltk.sent_tokenize(sent)
        for sent in sents:
            words_in_sent = []
            for word in nltk.word_tokenize(sent):
                if word not in punctuation:
                    words_in_sent.append(word.lower())
            sentences_for_model.append(words_in_sent)
    model = gensim.models.Word2Vec(sentences_for_model, min_count=1, size=128)
    model.save('model.embeddings')
    corp_model = gensim.models.Word2Vec.load('model.embeddings')
    return corp_model

def word2idx(word, model):
    '''Utility function to convert a given word to index'''
    if(word in model.wv.vocab):
        return model.wv.vocab[word].index
    return 0

def idx2word(idx, model):
    '''Utility function to convert index to word'''
    return model.wv.index2word[idx]

def format_data(corp, model):
    '''This function converts text to vector data'''
    x_train = []
    y_train = []
    max_num_words_ctext = max([len(nltk.word_tokenize(sent)) for sent in corp['ctext']])
    max_num_words_text = max([len(nltk.word_tokenize(sent)) for sent in corp['text']])
    x_train = np.zeros([corp.shape[0], max_num_words_ctext], dtype=np.int32)
    y_train = np.zeros([corp.shape[0], max_num_words_text], dtype=np.int32)
    print(max_num_words_ctext)
    print(max_num_words_text)
    # generate vectors for article data
    for i, text in enumerate(corp['ctext']):
        for w, word in enumerate(nltk.word_tokenize(text)):
            x_train[i, w] = word2idx(word.lower(), model)
    # generate vectors for summary data        
    for i, text in enumerate(corp['text']):
        for w, word in enumerate(nltk.word_tokenize(text)):
            y_train[i, w] = word2idx(word.lower(), model)

    return x_train, y_train, max_num_words_text

# these parameters will be used to contruct the model
corp_model = generate_word2vec_model(df_small)
pretrained_weights = corp_model.wv.vectors
vocab_size, embedding_size = pretrained_weights.shape
# standarize the word vectors so that the vectors have values between 0 and 1
x_train, y_train, max_num_words_text = format_data(df_small, corp_model)
x_train = x_train/vocab_size
y_train = y_train/vocab_size

validation_size = 0.2
test_size = 0.1
train_index = math.floor(x_train.shape[0]*(1-validation_size-test_size))

x_test = x_train[train_index:]

# print the orinigal article and the given summary to compare after every 20 eopchs
x_sample = x_train[train_index:train_index+1]
print('-----> Original article')
print(df_small.iloc[train_index+1]['ctext'])
print('------> Given Summary')
print(df_small.iloc[train_index+1]['text'])

x_train = x_train[:train_index]
y_train = y_train[:train_index]

def print_prediction(prediction):
    '''This function will be executed after every 20 epochs to print the summary for a sample text '''
    prediction = prediction*vocab_size
    summary = []
    for s in prediction:
        sent = []
        for w in s:
            sent.append(idx2word(int(w), corp_model))
        summary.append(' '.join(sent))
    print('---->')
    print('. '.join(summary))
    return ' '.join(summary)

def on_epoch_end(epoch, _):
    '''Function call at the end of each epoch'''
    if (epoch+1)%20 == 0:
        print('\n Generating Summary after epoch: %d' % epoch)
        prediction = model.predict(x=[x_sample, x_sample])
        print_prediction(prediction)

# define a custom loss function to play around with Keras
def custom_loss(y_true, y_pred):
    return y_true - y_pred
    

# article input model
# input layer
inputs1 = Input(shape=(x_train.shape[1],))
# embedding with pretrained weights with the news article text
article1 = Embedding(input_dim=vocab_size, output_dim=embedding_size, weights=[pretrained_weights])(inputs1)
# generate LSTM model with the above embedding
article2 = LSTM(units=embedding_size, dropout=0.05)(article1)
# repeat vector to connect to the next layer
article3 = RepeatVector(x_train.shape[1])(article2)
# summary input model
# input shape for the second input layer
inputs2 = Input(shape=(x_train.shape[1],))
# embedding with pretrained weights with the news article text for second input layer
summ1 = Embedding(input_dim=vocab_size, output_dim=embedding_size, weights=[pretrained_weights])(inputs2)
# decoder model
# concatenate layer to connect encoder and decoder
decoder1 = concatenate([article3, summ1])
# LSTM model for decoder
decoder2 = LSTM(units=embedding_size, dropout=0.05)(decoder1)
# output layer using softmax activation
outputs = Dense(y_train.shape[1], activation='softmax')(decoder2)
# attention layer
#attention = TimeDistributed(Dense(1, activation = 'tanh'))(inputs2)
#attention = Flatten()(attention)
#attention = Multiply()([outputs, attention])
#attention = Activation('softmax')(attention)
#attention = Permute([2, 1])(attention)
 
#decoder_dense = Dense(vocab_size,activation='softmax')
#outputs = decoder_dense(attention)

# tie it together [article, summary] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
# compile and fit the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=[custom_loss])

model.fit([x_train, x_train], y_train, batch_size=128, 
          epochs=100, callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])

4426
90
-----> Original article
in dramatic developments in bihar, nitish kumar on wednesday resigned as chief minister, dumping the rjd and congress to stitch a new alliance with bjp, which quickly announced support to a new government under him. kumar, whose resignation was immediately accepted by governor keshari nath tripathi, will take oath at the chief minister at 5pm on thursday.?in the circumstances that prevail in bihar, it became difficult to run the grand alliance government,? kumar told reporters outside raj bhavan after submitting his resignation to governor keshri nath tripathi.prime minister narendra modi hailed kumar?s resignation, saying in doing so he has joined the fight against corruption. immediately after kumar announced his resignation, modi tweeted: ?congratulations! mr nitish kumar for joining the fight against corruption.?here are wednesday?s highlights on the political crisis in bihar:12.30am: lalu prasad says: ?nitish kumar has underlined that he is an oppor

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100

 Generating Summary after epoch: 19
---->
white tweeted find education role himself hours charges complaint face -- wanted 27 passengers trying son claimed meet ministry 1 each resignation deputy health end your sushil raj held night directed along really plan started hours leaders till doing said.the ministers general sent issued rashtrapati raj lot others past saturday traffic deputy due getting tejashwi end great development hours great cm filed pradesh month life team however singh how being being if at or had who are it on on s a that in in that in for a for
Epoch 21/100
Ep

Epoch 59/100
Epoch 60/100

 Generating Summary after epoch: 59
---->
union agency justice investigation tax north received video really 2016 education twitter railway 2 face possible raj 2017 sushil village 12 night top major lot near raj grand rajya lok war force outside call night near water making recently war sent august raised news always statement claimed ministers money election each area killed ram won election station raj airport authorities among really pradesh khan later come today crore work delhi film them from all we who we with is is with to that to to that to and to to
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100

 Generating Summary after epoch: 79
---->
fight corporation leave release child 12 rajya sent information open ministry once inside having press ye

<keras.callbacks.History at 0x144c1def0>

In [9]:
# Comparison with extractive summary vs generated abstractive summary
print('-----> Original article')
print(df_small.iloc[train_index+1]['ctext'])
print('------> Given Summary')
print(df_small.iloc[train_index+1]['text'])
print('------> Extractive Summary')
print(df_small.iloc[train_index+1]['summary-word-freq'])
print('------> Abstractive Summary')
prediction = model.predict(x=[x_sample, x_sample])
prediction = print_prediction(prediction)

-----> Original article
in dramatic developments in bihar, nitish kumar on wednesday resigned as chief minister, dumping the rjd and congress to stitch a new alliance with bjp, which quickly announced support to a new government under him. kumar, whose resignation was immediately accepted by governor keshari nath tripathi, will take oath at the chief minister at 5pm on thursday.?in the circumstances that prevail in bihar, it became difficult to run the grand alliance government,? kumar told reporters outside raj bhavan after submitting his resignation to governor keshri nath tripathi.prime minister narendra modi hailed kumar?s resignation, saying in doing so he has joined the fight against corruption. immediately after kumar announced his resignation, modi tweeted: ?congratulations! mr nitish kumar for joining the fight against corruption.?here are wednesday?s highlights on the political crisis in bihar:12.30am: lalu prasad says: ?nitish kumar has underlined that he is an opportunist. 

---->
wednesday capital late fire authorities authorities lot raised yet comes lok online member issued htshowbiz really commission killed services 1 election child among filed social received social rajya read resignation yet children already really started here raj following outside shared charges building open himself call statement lot north here lot north past working cricket must end himself raj went role due yet without maharashtra well governor taken woman found so my them be when we '' we with on was it the that the the that the of the the


### Conclusion

As we see above, extractive summary did a good job to summarise the content. Infact it gave more details than the actual summary. However abstractive method generated unrelated words and was unable to summarise the text. 