https://medium.com/analytics-vidhya/an-introduction-to-text-summarization-using-the-textrank-algorithm-with-python-implementation-2370c39d0c60

(original) https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

<b>Text Summarization</b> can broadly be divided into two categories — Extractive Summarization and Abstractive Summarization.

* Extractive Summarization: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.
* Abstractive Summarization: These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text.

We will be focusing on the <b>extractive summarization</b> technique in this notebook.

<b>Problem Statement</b><br>
I always try to keep myself updated with what’s happening in the sport by religiously going through as many online tennis updates as possible. However, this has proven to be a rather difficult job! There are way too many resources and time is a constraint.

Therefore, I decided to design a system that could prepare a bullet-point summary for me by scanning through multiple articles. We will apply the TextRank algorithm on a dataset of scraped articles with the aim of creating a nice and concise summary.

This is essentially a <b>single-domain-multiple-documents</b> summarization task, i.e., we will take multiple articles as input and generate a single bullet-point summary. Multi-domain text summarization is not covered in this notebook.

<b>Steps</b><br>
* Read the data
* Split the text into sentences
* Text preprocessing - Remove punctuations, convert to lower case, remove stopwords
* Vector representation of sentences using Glove word vectors
* Similarity Matrix Preparation
* Applying Page Rank Algorithm
* Summary Extraction

In [3]:
import numpy as np
import pandas as pd
import nltk
import re

In [None]:
nltk.download('punkt') # one time execution

In [4]:
df = pd.read_csv("tennis_articles_v4.csv")
df.head()

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [5]:
df['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same 

In [6]:
from nltk.tokenize import sent_tokenize
sentences = [] 
for s in df['article_text']:
    sentences.append(sent_tokenize(s))
# flatten list
sentences = [y for x in sentences for y in x]

In [7]:
sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl."]

In [1]:
!wget 'http://nlp.stanford.edu/data/glove.6B.zip'
!unzip glove*.zip

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = D:\GNUWin32/etc/wgetrc
'http://nlp.stanford.edu/data/glove.6B.zip': Unsupported scheme.
'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [8]:
# Extract word vectors 
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8') 
for line in f: 
    values = line.split() 
    word = values[0] 
    coefs = np.asarray(values[1:], dtype='float32')   
    word_embeddings[word] = coefs 
f.close()

In [13]:
word_embeddings['the']

array([-0.038194  , -0.24487001,  0.72812003, -0.39961001,  0.083172  ,
        0.043953  , -0.39140999,  0.3344    , -0.57545   ,  0.087459  ,
        0.28786999, -0.06731   ,  0.30906001, -0.26383999, -0.13231   ,
       -0.20757   ,  0.33395001, -0.33848   , -0.31742999, -0.48335999,
        0.1464    , -0.37303999,  0.34577   ,  0.052041  ,  0.44946   ,
       -0.46970999,  0.02628   , -0.54154998, -0.15518001, -0.14106999,
       -0.039722  ,  0.28277001,  0.14393   ,  0.23464   , -0.31020999,
        0.086173  ,  0.20397   ,  0.52623999,  0.17163999, -0.082378  ,
       -0.71787   , -0.41531   ,  0.20334999, -0.12763   ,  0.41367   ,
        0.55186999,  0.57907999, -0.33476999, -0.36559001, -0.54856998,
       -0.062892  ,  0.26583999,  0.30204999,  0.99774998, -0.80480999,
       -3.0243001 ,  0.01254   , -0.36941999,  2.21670008,  0.72201002,
       -0.24978   ,  0.92136002,  0.034514  ,  0.46744999,  1.10790002,
       -0.19358   , -0.074575  ,  0.23353   , -0.052062  , -0.22

In [14]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ") 
# make alphabets lowercase 
clean_sentences = [s.lower() for s in clean_sentences]

In [15]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\285850\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [16]:
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

In [17]:
# function to remove stopwords 
def remove_stopwords(sen):     
    sen_new = " ".join([i for i in sen if i not in stop_words])          
    return sen_new

In [18]:
# remove stopwords from the sentences 
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [45]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else: 
        v = np.zeros((100,)) 
    sentence_vectors.append(v)

In [46]:
len(sentence_vectors)

119

In [30]:
sentence_vectors[0]

array([ -2.08140463e-01,   2.64891267e-01,   3.44418883e-01,
        -2.13079229e-01,   2.34621353e-02,   1.87433153e-01,
         1.67914018e-01,   3.02313745e-01,  -1.20291933e-01,
        -2.33991519e-01,   2.22425133e-01,   9.25958529e-02,
         1.49705827e-01,  -5.10089956e-02,   9.38714854e-03,
        -6.55026212e-02,  -1.49192229e-01,   2.92601977e-02,
        -3.40117455e-01,   2.51018763e-01,   4.25288707e-01,
         2.04429641e-01,   1.91092212e-02,   4.39891545e-03,
         1.83485880e-01,   2.51849461e-02,  -3.12460393e-01,
        -4.94591355e-01,   9.84878764e-02,  -2.68122971e-01,
        -1.23669095e-01,   2.27788553e-01,   5.55112027e-04,
        -6.91322237e-02,   2.67278194e-01,   3.98785323e-01,
        -3.26870799e-01,   3.44963744e-02,  -1.63259227e-02,
        -2.70772815e-01,  -2.78630346e-01,  -1.43367171e-01,
         3.88507903e-01,  -3.47312719e-01,  -1.81084722e-01,
        -2.34724343e-01,   4.04098988e-01,  -5.19586146e-01,
         3.77993017e-01,

In [47]:
# similarity matrix 
sim_mat = np.zeros([len(sentences), len(sentences)])

In [48]:
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [54]:
import networkx as nx #import graph
nx_graph = nx.from_numpy_matrix(sim_mat) 
scores = nx.pagerank(nx_graph)

In [59]:
scores

{0: 0.008072651865276066,
 1: 0.008501993259365037,
 2: 0.0078119318398216,
 3: 0.009293791260564146,
 4: 0.007500319295916385,
 5: 0.008146814785247897,
 6: 0.008477413381565426,
 7: 0.008251000819455925,
 8: 0.008596957762357726,
 9: 0.008257144250233068,
 10: 0.0012695751770095795,
 11: 0.008860552409260931,
 12: 0.00808354331891815,
 13: 0.008156804650453691,
 14: 0.008443316877797879,
 15: 0.008556893043719335,
 16: 0.007812826653904838,
 17: 0.008071958049751223,
 18: 0.008406020961271342,
 19: 0.0088478922486596,
 20: 0.008860865186249187,
 21: 0.007421917083656914,
 22: 0.008223434004980818,
 23: 0.008991766451813816,
 24: 0.00846397039711992,
 25: 0.006701898152599973,
 26: 0.008232471647417278,
 27: 0.008913135600535109,
 28: 0.009061682997321345,
 29: 0.009093905696447463,
 30: 0.00924452161472398,
 31: 0.008994323963323616,
 32: 0.0072368691033659,
 33: 0.008709093081740912,
 34: 0.00891913055441074,
 35: 0.009097421351821915,
 36: 0.007715970774354713,
 37: 0.0088834520756

In [66]:
ranked_sentences = sorted(((scores[i],s) for i,s in 
                           enumerate(sentences)), reverse=True)
# Extract top 10 sentences as the summary 
for i in range(10):
    print(ranked_sentences[i][1])

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London 

<b>What's Next</b>

Problem-specific:

* Multiple domain text summarization
* Single document summarization
* Cross-language text summarization (source in some language and summary in another language)

Algorithm-specific:

* Text summarization using RNNs and LSTM
* Text summarization using Reinforcement Learning
* Text summarization using Generative Adversarial Networks (GANs)