### Extractive text summarization of news articles using TextRank algorithm

#### Aim:

In this project, the aim is to create a system which will summarize the news articles from various sources such as BBC, CNN, NewYork Times, and etc. using one of the extractive text summarization approaches - <b><i>TextRank</i></b> algorithm into smaller texts which could be used by people for their own needs.

#### Dataset:
The data used for both text summarization and text classification tasks are news articles obtained from British Broadcasting Corporation News (BBC News) which were used by Greene and Cunningham in their work - Kernel Document Clustering. The data consisted of 2225 documents of news articles with their summaries for the period 2004-2005 under 5 different labels such as business, entertainment, politics, sports, and technology. We are taking input as text files (.txt).

#### Theory and Methodology:

The text summarization is the process of providing concise and meaningful version of texts from various sources such as Twitter, Facebook, books and etc. There are two kinds of text summarization which are:
* Extractive Text Summarization.
* Abstractive Text Summarization.
<br><br>
<b><u>Extractive Text Summarization</u></b>: <br>
This method generates summary based on using phrases and sentences from the given text. Hence, finding right sentences and phrases for a summary generation is an important task in this method. <br><br>
<b><u>Abstractive Text Summarization</u></b>: <br>
This method generates entirely new summary using Natural Language processing techniques. The tokens present in the generated summary may not be available in the given text.

But, in this motebook, I am going to demonstrate the implementation of <b>Extractive Text Summarization</b> using <b>TextRank</b> algorithm and preserving <b> Abstractive Text Summarization</b> for performing comparative study between these two text summarization techniques for my master thesis. <br>

##### TextRank Algorithm

We need to have an understanding of <b>PageRank</b> algorithm before understanding TextRank algorithm because PageRank algorithm's working is similar to TextRank algorithm. PageRank algorithm's main purpose is to rank web pages based on the online search results. In order to rank the web pages, we need a <b>PageRank score</b> which is the probability of user visiting that web page. The scores are stored in a matrix of n X n dimension where n is the number of web pages.
The score presented in the matrix representes the user's transistion from one page to another. Likewise, TextRank uses sentences to instead of web pages.

<img src="TextRank_working.png" align="center" style = "width : 800px">
<div style="text-align:center">
    Source : <a href="https://licor.me/post/introduction-to-automatic-text-summarization/">Introduction to Automatic Text Summarization</a>
</div>

In [1]:
import numpy as np
import pandas as pd
import nltk
import spacy
from nltk import word_tokenize,sent_tokenize
#nltk.download('punkt') # one time execution
import os
from pickle import load,dump #to load or dump files
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

In [2]:
categories = os.listdir("Data/BBC_News_Summary/News_Articles")
del categories[0] #because the list contains .DS.Store folder name which is system generated
categories

['business', 'sport', 'politics', 'tech']

In [3]:
# Reading articles
articles = []
summaries = []
for newsType in categories:
    article_Files = os.listdir("Data/BBC_News_Summary/News_Articles/"+newsType)
    summary_Files = os.listdir("Data/BBC_News_Summary/Summaries/"+newsType)
    for article_File in article_Files:
        with open("Data/BBC_News_Summary/News_Articles/"+newsType+"/"+article_File,"r",encoding='utf-8',errors='ignore') as source_1:
            article = source_1.read().strip().splitlines()
        articles.append(" ".join(article))
    for summary_File in summary_Files:
        with open("Data/BBC_News_Summary/Summaries/"+newsType+"/"+summary_File,"r",encoding='utf-8',errors='ignore') as source_2:
            summary = source_2.read().strip().splitlines()
        summaries.append(" ".join(summary))
article_summary_dict = {"articles":articles,"summaries":summaries}
news_summary_data = pd.DataFrame.from_dict(article_summary_dict)
news_summary_data.head(10)

Unnamed: 0,articles,summaries
0,UK economy facing 'major risks' The UK manufa...,"""Despite some positive news for the export sec..."
1,Aids and climate top Davos agenda Climate cha...,"At the same time, about 100,000 people are exp..."
2,Asian quake hits European shares Shares in Eu...,The unfolding scale of the disaster in south A...
3,India power shares jump on debut Shares in In...,"Shares in India's largest power producer, Nati..."
4,Lacroix label bought by US firm Luxury goods ...,LVMH said the French designer's haute couture ...
5,Insurance bosses plead guilty Another three U...,A Marsh spokeswoman said Mr Bewlay was no long...
6,Turkey-Iran mobile deal 'at risk' Turkey's in...,Turkcell's mobile deal is the second Turkish i...
7,"Parmalat to return to stockmarket Parmalat, t...",The plan is likely to give creditors of Parmal...
8,WorldCom director admits lying The former chi...,"Mr Sullivan, 42, has already pleaded guilty to..."
9,Ebbers denies WorldCom fraud Former WorldCom ...,Mr Ebbers relationship to Mr Sullivan is key t...


In [4]:
news_summary_data.iloc[0,1]

'"Despite some positive news for the export sector, there are worrying signs for manufacturing," the BCC said.The BCC found confidence increased in the quarter across both the manufacturing and service sectors although overall it failed to reach the levels at the start of 2004.The outlook for the service sector was "uncertain" despite an increase in exports and orders over the quarter, the BCC noted.The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said.However, the BCC found the whole UK economy still faced "major risks" and warned that growth is set to slow.The reduced threat of interest rate increases had contributed to improved confidence, it said.The rise came despite exchange rates being cited as a major concern.'

### Text Preprocessing

In [5]:
#nltk.download('stopwords')
# if stopwords downloaded, no need to execute the above command

In [6]:
import string
import contractions
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

punct =[]
punct += list(string.punctuation)
punct += '’'
punct.remove("'")

def remove_punctuations(text):
    no_punct = ""
    for char in text:
        if char not in punct:
            no_punct = no_punct + char
    return no_punct

In [7]:
def cleaned_Text(Sent):
    cleaned_desc = []
    for sentences in Sent:
        encoding = sentences.encode("ascii","ignore")
        sent = encoding.decode().split()
        cleaned_Sent = [text.lower() for text in sent]
        cleaned_Sent = [contractions.fix(text) for text in cleaned_Sent]
        cleaned_Sent = [remove_punctuations(text) for text in cleaned_Sent]
        cleaned_Sent = [text for text in cleaned_Sent if(text.isalpha())]
        cleaned_Sent = [text for text in cleaned_Sent if text not in stop_words]
        cleaned_Sent = " ".join(cleaned_Sent)
        cleaned_desc.append(cleaned_Sent)
    return cleaned_desc

In [8]:
sample_Input = article_summary_dict['articles'][0]
sample_Input

'UK economy facing \'major risks\'  The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said.  The group\'s quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced "major risks" and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.  Manufacturers\' domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.  "Despite some positive news for the export sector, there are worrying signs for manufacturing," the BCC said. "These results reinforce our concern over the sector\'s persisten

### Sentence tokenizing

In [9]:
sentences = sent_tokenize(sample_Input)
sentences

['UK economy facing \'major risks\'  The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said.',
 "The group's quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years.",
 'The rise came despite exchange rates being cited as a major concern.',
 'However, the BCC found the whole UK economy still faced "major risks" and warned that growth is set to slow.',
 'It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.',
 "Manufacturers' domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found.",
 'Employment in manufacturing also fell and job expectations were at their lowest level for a year.',
 '"Despite some positive news for the export sector, there are worrying signs for manufacturing," the BCC said.',
 '"These results reinforce our concern 

### Extracting Word Embeddings

In [19]:
word_embeddings = {}
f = open('data/glove.6B/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()
len(word_embeddings)

400000

In [20]:
dump(word_embeddings, open('Data/word_embeddings.p', 'wb'))
#word_embeddings = load(open("Data/word_embeddings.p","rb"))

### Vector representation of sentences

In [11]:
#Cleaned a sample article from the dataframe
sentences = cleaned_Text(sentences)
sentences

['uk economy facing uk manufacturing sector continue face serious challenges next two years british chamber commerce bcc said',
 'quarterly survey companies found exports picked last three months best levels eight years',
 'rise came despite exchange rates cited major concern',
 'however bcc found whole uk economy still faced major risks warned growth set slow',
 'recently forecast economic growth slow little',
 'domestic sales growth fell back slightly quarter survey firms found',
 'employment manufacturing also fell job expectations lowest level year',
 'despite positive news export sector worrying signs manufacturing bcc said',
 'results reinforce concern persistent inability sustain recovery',
 'outlook service sector uncertain despite increase exports orders quarter bcc noted',
 'bcc found confidence increased quarter across manufacturing service sectors although overall failed reach levels start',
 'reduced threat interest rate increases contributed improved confidence said',
 'b

In [12]:
sentence_vectors = []
for i in sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((300,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((300,))
    sentence_vectors.append(v)

### Similarity Matrix

In [13]:
similarity_matrix = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]

### PageRank Algorithm

In [14]:
graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(graph)

### Ranking the sentences and generating summary

In [15]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [13]:
def generate_Summary(limit, ranked_Sentences):
    summary = []
    for i in range(limit):
        summary.append(ranked_Sentences[i][1])
    return ". ".join(summary)

### Original Summary

In [17]:
article_summary_dict['summaries'][0]

'"Despite some positive news for the export sector, there are worrying signs for manufacturing," the BCC said.The BCC found confidence increased in the quarter across both the manufacturing and service sectors although overall it failed to reach the levels at the start of 2004.The outlook for the service sector was "uncertain" despite an increase in exports and orders over the quarter, the BCC noted.The UK manufacturing sector will continue to face "serious challenges" over the next two years, the British Chamber of Commerce (BCC) has said.However, the BCC found the whole UK economy still faced "major risks" and warned that growth is set to slow.The reduced threat of interest rate increases had contributed to improved confidence, it said.The rise came despite exchange rates being cited as a major concern.'

### Generated Summary

In [18]:
generated_Summary = generate_Summary(5, ranked_sentences)
generated_Summary

'bcc found confidence increased quarter across manufacturing service sectors although overall failed reach levels start however bcc found whole uk economy still faced major risks warned growth set slow rates kept hold since amid signs falling consumer confidence slowdown output outlook service sector uncertain despite increase exports orders quarter bcc noted consumer spending set decelerate significantly next months unlikely investment exports rise sufficiently strongly pick slack'

### Similarity between original and generated summary

In [19]:
nlp = spacy.load("en_core_web_lg")
round(nlp(generated_Summary).similarity(nlp(article_summary_dict['summaries'][0])),2)

0.92

### Creating function to generate summary for all inputs

In [14]:
def summarize_News(article):
    limit = 5 #Number of sentences to be present in a summary, 5 is default.
    word_embeddings = load(open("Data/word_embeddings.p","rb"))
    sentences = sent_tokenize(article)
    sentences = cleaned_Text(sentences)
    sentence_vectors = []
    
    #Vector representation
    for i in sentences:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vectors.append(v)
    
    #Similarity matrix
    similarity_matrix = np.zeros([len(sentences), len(sentences)])
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    
    #PageRank Algorithm
    graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(graph)
    
    #Ranking documents
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    
    if(len(ranked_sentences) > 0):
        #returning generated summary
        return generate_Summary(limit, ranked_sentences)

In [15]:
len(article_summary_dict['articles'])

1839

In [16]:
summaries = [summarize_News(article_summary_dict['articles'][data]) for data in range(5)]
summaries

['bcc found confidence increased quarter across manufacturing service sectors although overall failed reach levels start. however bcc found whole uk economy still faced major risks warned growth set slow. rise came despite exchange rates cited major concern. outlook service sector uncertain despite increase exports orders quarter bcc noted. consumer spending set decelerate significantly next months unlikely investment exports rise sufficiently strongly pick slack',
 'themes stressed mr blair whose attendance announced last minute wants dominate chairmanship group industrialised states. business political leaders around globe listen uk prime minister tony opening speech wednesday. survey published eve conference pricewaterhousecoopers said four ten business leaders confident companies would see sales rise. aids climate top davos agenda climate change fight aids leading list concerns first day world economic forum swiss resort davos. aside mr blair world leaders expected attend including

### Evaluation

In [17]:
from rouge import Rouge

In [18]:
rouge = Rouge()
rouge_Scores = [rouge.get_scores(summaries[i],article_summary_dict['summaries'][i]) for i in range(5)]

In [19]:
rouge_Scores

[[{'rouge-1': {'f': 0.4040403997015611,
    'p': 0.6349206349206349,
    'r': 0.2962962962962963},
   'rouge-2': {'f': 0.14285713853186185,
    'p': 0.22580645161290322,
    'r': 0.1044776119402985},
   'rouge-l': {'f': 0.4966442907184362,
    'p': 0.6981132075471698,
    'r': 0.3854166666666667}}],
 [{'rouge-1': {'f': 0.26997244790018904,
    'p': 0.5104166666666666,
    'r': 0.18352059925093633},
   'rouge-2': {'f': 0.09972298781163451,
    'p': 0.18947368421052632,
    'r': 0.06766917293233082},
   'rouge-l': {'f': 0.342205318846304,
    'p': 0.5357142857142857,
    'r': 0.25139664804469275}}],
 [{'rouge-1': {'f': 0.28242074612827944,
    'p': 0.7205882352941176,
    'r': 0.17562724014336917},
   'rouge-2': {'f': 0.09275362005864325,
    'p': 0.23880597014925373,
    'r': 0.05755395683453238},
   'rouge-l': {'f': 0.3809523770656472,
    'p': 0.7213114754098361,
    'r': 0.25882352941176473}}],
 [{'rouge-1': {'f': 0.3731343234127869,
    'p': 0.4166666666666667,
    'r': 0.3378378378

In [20]:
from nltk.translate.bleu_score import sentence_bleu
bleu_score = [sentence_bleu(article_summary_dict['summaries'][i].split(),summaries[i]) for i in range(5)]
bleu_score

[0.21318332348136937,
 0.20180804966447832,
 0.24290136946927754,
 0.20035324211525896,
 0.15562331756599762]