In [69]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

## **Read Dataset**

In [41]:
df = pd.read_csv("tennis_articles_v4.csv")
df

Unnamed: 0,article_text
0,Maria Sharapova has basically no friends as te...
1,"BASEL, Switzerland (AP), Roger Federer advance..."
2,Roger Federer has revealed that organisers of ...
3,Kei Nishikori will try to end his long losing ...
4,"Federer, 37, first broke through on tour over ..."
5,Nadal has not played tennis since he was force...
6,"Tennis giveth, and tennis taketh away. The end..."
7,Federer won the Swiss Indoors last week by bea...


In [42]:
df['article_text'][7]

'Federer won the Swiss Indoors last week by beating Romanian qualifier Marius Copil in the final. The 37-year-old claimed his 99th ATP title and is hunting the century in the French capital this week. Federer has been handed a difficult draw where could could come across Kevin Anderson, Novak Djokovic and Rafael Nadal in the latter rounds. But first the 20-time Grand Slam winner wants to train on the Paris Masters court this afternoon before deciding whether to appear for his opening match against either Milos Raonic or Jo-Wilfried Tsonga. "On Monday, I am free and will look how I feel," Federer said after winning the Swiss Indoors. "On Tuesday I will fly to Paris and train in the afternoon to be ready for my first match on Wednesday night. "I felt good all week and better every day. "We also had the impression that at this stage it might be better to play matches than to train. "And as long as I fear no injury, I play." Federer\'s success in Basel last week was the ninth time he has w

## **Sentences Splitting**

In [43]:
#split text in all articles into sentences
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

In [44]:
print(df['article_text'][0])

Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same s

In [45]:
print(len(df['article_text'][0]))

1561


In [46]:
sentences

[['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
  "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
  'I think everyone knows this is my job here.',
  "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
  "I'm a pretty competitive girl.",
  "I say my hellos, but I'm not sending any players flowers as well.",
  "Uhm, I'm not really friendly or close to many players.",
  "I have not a lot of friends away from the courts.'",
  'When she said she is not really close to a lot of players, is that something strategic that she is doing?',
  "Is it different on the men's tour than the women's tour?",
  "'No, 

In [47]:
len(sentences)

8

In [48]:
#flatten sentences
sentences = [y for x in sentences for y in x]
sentences

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl.",
 "I say my hellos, but I'm not sending any players flowers as well.",
 "Uhm, I'm not really friendly or close to many players.",
 "I have not a lot of friends away from the courts.'",
 'When she said she is not really close to a lot of players, is that something strategic that she is doing?',
 "Is it different on the men's tour than the women's tour?",
 "'No, not at all.

In [49]:
sentences[1]

"The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much."

In [50]:
len(sentences)

119

## **Text Preprocessing**

### **1- Remove Punctuation, numbers & Special Characters**

In [51]:
cleaned_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

  cleaned_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")


In [52]:
cleaned_sentences

0      Maria Sharapova has basically no friends as te...
1      The Russian player has no problems in openly s...
2            I think everyone knows this is my job here 
3      When I m on the courts or when I m on the cour...
4                         I m a pretty competitive girl 
                             ...                        
114    It makes me incredibly happy to win my home to...
115     I do not know if it s maybe my last title  so...
116     Maybe I should celebrate as if it were my las...
117     There are very touching moments  seeing the b...
118    Because it was not always easy in the last wee...
Length: 119, dtype: object

In [53]:
#make all characters in lowercase
cleaned_sentences = [s.lower() for s in cleaned_sentences]

In [54]:
cleaned_sentences

['maria sharapova has basically no friends as tennis players on the wta tour ',
 'the russian player has no problems in openly speaking about it and in a recent interview she said   i don t really hide any feelings too much ',
 'i think everyone knows this is my job here ',
 'when i m on the courts or when i m on the court playing  i m a competitor and i want to beat every single person whether they re in the locker room or across the net so i m not the one to strike up a conversation about the weather and know that in the next few minutes i have to go and try to win a tennis match ',
 'i m a pretty competitive girl ',
 'i say my hellos  but i m not sending any players flowers as well ',
 'uhm  i m not really friendly or close to many players ',
 'i have not a lot of friends away from the courts  ',
 'when she said she is not really close to a lot of players  is that something strategic that she is doing ',
 'is it different on the men s tour than the women s tour ',
 ' no  not at all 

### **2- Remove Stop Words**

In [55]:
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [56]:
def remove_stopwords(sentence):
    new_sentences = " ".join([i for i in sentence if i not in stop_words])
    return new_sentences

In [59]:
cleaned_sentences = [remove_stopwords(i.split()) for i in cleaned_sentences]

In [60]:
cleaned_sentences

['maria sharapova basically friends tennis players wta tour',
 'russian player problems openly speaking recent interview said really hide feelings much',
 'think everyone knows job',
 'courts court playing competitor want beat every single person whether locker room across net one strike conversation weather know next minutes go try win tennis match',
 'pretty competitive girl',
 'say hellos sending players flowers well',
 'uhm really friendly close many players',
 'lot friends away courts',
 'said really close lot players something strategic',
 'different men tour women tour',
 '',
 'think sport mean friends everyone categorized tennis player going get along tennis players',
 'think every person different interests',
 'friends completely different jobs interests met different parts life',
 'think everyone thinks tennis players greatest friends',
 'ultimately tennis small part',
 'many things interested',
 'basel switzerland ap roger federer advanced th swiss indoors final career beati

### **Use Glove Pre-Trained Word Vector**

In [62]:
word_embedding = {}
file = open('glove.6B.100d.txt',encoding  = 'utf-8')

In [63]:
for line in file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype = 'float32')
    word_embedding[word] = coefs
file.close()

In [64]:
sentence_vectors = []
for i in cleaned_sentences:
    if len(i) != 0:
        vector = sum([word_embedding.get(w, np.zeros((100,))) for w in i.split()]) / len(i.split())+ 0.001
    else:
        vector = np.zeros((100,))
    sentence_vectors.append(vector)
print(sentence_vectors)

[array([ 0.05248899,  0.1115585 ,  0.6960863 ,  0.19019175, -0.09481975,
        0.32132903,  0.27169585,  0.54350865, -0.30497628, -0.15538362,
        0.3711739 ,  0.08195937,  0.00941499,  0.24860251, -0.368389  ,
       -0.07511401,  0.08186837,  0.2316725 , -0.26943624,  0.51489264,
       -0.0602625 ,  0.38894886,  0.10413426,  0.7735913 ,  0.26099274,
       -0.07861687,  0.14316137, -0.961765  ,  0.75599873,  0.06133361,
       -0.45762748,  0.23780991,  0.23018129, -0.15547289,  0.3986824 ,
       -0.022275  , -0.50458425,  0.4143045 , -0.28479502, -0.13424838,
       -0.13611525, -0.14799123,  0.33857974, -0.34858418,  0.15450363,
       -0.23237082, -0.19748561, -0.1268375 ,  0.50912744, -0.367683  ,
       -0.22750087, -0.31434616,  0.1371665 ,  0.22353502,  0.120515  ,
       -1.7092874 , -0.10341668,  0.34638995,  0.55548877,  0.79233503,
       -0.2626269 ,  0.50224644, -0.15393819,  0.24079238, -0.048445  ,
       -0.13842238, -0.0059623 ,  0.4533    ,  0.14551626, -0.1

In [65]:
len(sentence_vectors)

119

## **Initialize Similarity Matrix**

In [70]:
similarity_matrix = np.zeros([len(sentences), len(sentences)])

In [72]:
#generate similarity matrix of sentences
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [73]:
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

In [74]:
ranked_sentences  = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse = True)

## **Generate Summary**

In [79]:
number_of_summary = 5
for i in range(number_of_summary):
    print(ranked_sentences[i][1])

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event in London 