In [None]:
# https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/
# https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/
# https://www.analyticsvidhya.com/blog/2018/09/introduction-graph-theory-applications-python/     ---- Graph Theory

In [14]:
import numpy as np
import pandas as pd
import nltk
#nltk.download('punkt') # one time execution
import re
from nltk.tokenize import sent_tokenize

In [5]:
# Read the Data
df = pd.read_csv("tennis_articles.csv",encoding='latin-1')

In [6]:
# Inspect the Data
df.head()

Unnamed: 0,article_id,article_title,article_text,source
0,1,"I do not have friends in tennis, says Maria Sh...",Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,Federer defeats Medvedev to advance to 14th Sw...,"BASEL, Switzerland (AP)  Roger Federer advanc...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Tennis: Roger Federer ignored deadline set by ...,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Nishikori to face off against Anderson in Vien...,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,Roger Federer has made this huge change to ten...,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [7]:
df['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same

In [8]:
df['article_text'][1]

"BASEL, Switzerland (AP) \x97 Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had his first match-point chance to break serve again at 5-1. He then dropped his serve to love, and let another match point slip in Medvedev's next service game by netting a backhand. He clinched on his fourth chance when Medvedev netted from the baseline. Copil upset expectations of a Federer final against Alexander Zverev in a 6-3, 6-7 (6), 6-4 win over the fifth-ranked German in the earlier semifinal. The Romanian aims for a first title after arriving at Basel without a career win over a top-10 opponent. Copil has two after also beating No. 6 Marin Cilic in the second round. Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours 

In [9]:
df['article_text'][2]

'Roger Federer has revealed that organisers of the re-launched and condensed Davis Cup gave him three days to decide if he would commit to the controversial competition. Speaking at the Swiss Indoors tournament where he will play in Sunday\x92s final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment. \x93They only left me three days to decide,\x94 Federer said. \x93I didn\x92t to have time to consult with all the people I had to consult. \x93I could not make a decision in that time, so I told them to do what they wanted.\x94 The 20-time Grand Slam champion has voiced doubts about the wisdom of the one-week format to be introduced by organisers Kosmos, who have promised the International Tennis Federation up to $3 billion in prize money over the next quarter-century. The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will repl

In [16]:
#Sentence tokenization using nltk sent_token
sentences = []
for i in df['article_text']:
    sentences.extend(sent_tokenize(i))

In [12]:
# Extract word vectors
# extract the words embeddings or word vectors.
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()
len(word_embeddings)

400000

In [18]:
# Text Preprocessing
# It is always a good practice to make your textual data noise-free as much as possible. So, let’s do some basic text cleaning.
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [19]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [20]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [21]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [22]:
#  We will first fetch vectors (each of size 100 elements) for the constituent words in a sentence and  
#then take mean/average of those vectors to arrive at a consolidated vector for the sentence.

sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [23]:
sentence_vectors[0]

array([ 0.09423097,  0.13669517,  0.58544935,  0.20381502, -0.1671801 ,
        0.34834924,  0.17839483,  0.37101227, -0.20376579, -0.12805886,
        0.36699214,  0.07576428, -0.03720036,  0.22964005, -0.31566828,
       -0.0996928 , -0.00833809,  0.27326583, -0.24724656,  0.4564281 ,
       -0.12987752,  0.33414586,  0.12960431,  0.72200727,  0.2367894 ,
       -0.08360017,  0.1354242 , -0.9088701 ,  0.59379951,  0.1741106 ,
       -0.31913759,  0.25016489,  0.22933419, -0.19942921,  0.35909751,
       -0.09988502, -0.43280267,  0.33715736, -0.17687914, -0.12392576,
       -0.09833796, -0.183877  ,  0.21569655, -0.36719825,  0.11650406,
       -0.23125742, -0.18450755, -0.0596488 ,  0.45691286, -0.21181901,
       -0.16230433, -0.30980917,  0.14066741,  0.19803775,  0.09769779,
       -1.35030617, -0.01634588,  0.26587548,  0.40476317,  0.68378705,
       -0.23487752,  0.42370602, -0.24793846,  0.25880627, -0.07375703,
       -0.17501676,  0.03179123,  0.43157982,  0.10041995, -0.11

In [25]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [26]:
# We will use Cosine Similarity to compute the similarity between a pair of sentences.
from sklearn.metrics.pairwise import cosine_similarity

In [27]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [29]:
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [33]:
# Finally, it’s time to extract the top N sentences based on their rankings for summary generation.
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
ranked_sentences

[(0.008798702471018703,
  '\x93I was on a nice trajectorythen,\x94 Reid recalled.\x93If I hadn\x92t got sick, I think I could have started pushing towards the second week at the slams and then who knows.\x94 Duringa comeback attempt some five years later, Reid added Bernard Tomic and 2018 US Open Federer slayer John Millman to his list of career scalps.'),
 (0.008749096541587417,
  '\x93Full effort Nick could live out his tennis like a (Tomas) Berdych or (Jo- Wilfried) Tsonga, consistently making second week,quarters, semis, finals of slams - and then hopefully more.'),
 (0.0087181363423051,
  'Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.'),
 (0.008671127349766841,
  "\x93I just felt like it really kind of changed where people were a little bit, definitely in the '90s, a lot more quiet, into themselves, and then it started to become better.\x94 Meanwhile, Federer is hoping he 

In [32]:
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])

I was on a nice trajectorythen, Reid recalled.If I hadnt got sick, I think I could have started pushing towards the second week at the slams and then who knows. Duringa comeback attempt some five years later, Reid added Bernard Tomic and 2018 US Open Federer slayer John Millman to his list of career scalps.
Full effort Nick could live out his tennis like a (Tomas) Berdych or (Jo- Wilfried) Tsonga, consistently making second week,quarters, semis, finals of slams - and then hopefully more.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
I just felt like it really kind of changed where people were a little bit, definitely in the '90s, a lot more quiet, into themselves, and then it started to become better. Meanwhile, Federer is hoping he can improve his service game as he hunts his ninth Swiss Indoors title this week.
I felt like the best weeks that I had to get to know pla