<a href="https://colab.research.google.com/github/mahyak/Sample_NLP/blob/master/TextRankSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TextRank is an extractive and unsupervised text summarization technique. The flow of the TextRank algorithm:


*   Combine the articles
*   Split the texts to indivisula sentences
*   Word embedding
*   Calculate the similarities between sentences
*   Convert similarity matrix to graph and score them
*   Use top ranked sentences for summary

Domain: single domain multiple documents summarization task

In [4]:
from google.colab import drive
drive.mount('/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /drive


In [32]:
import numpy as np
import pandas as pd
import re
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

from nltk.tokenize import sent_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
df = pd.read_csv('/drive/My Drive/Colab Notebooks/tennis_articles_v4.csv')

sentences = []

for sentence in df['article_text']:
  sentences.append(sent_tokenize(sentence))

sentences = [y for x in sentences for y in x]

word embedding: GloVe

Wikipedia + Gigaword 5 GloVe vectors

In [12]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2020-05-04 15:48:40--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-05-04 15:48:40--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-05-04 15:48:41--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

Text preprocessing:


*   Remove punctuations
*   Remove stop words

In [0]:
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]

stop_words = stopwords.words('english')

def remove_stopwords(sen):
  sentence_new = " ".join([i for i in sen if i not in stop_words])
  return sentence_new

clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [0]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

Create vector for our sentences
1- fetch vectors (each of size 100 elements) for the constituent words in a sentence 
2- take mean/average of those vectors to arrive at a consolidated vector for the sentence

In [0]:
sentences_vectors = []

for sentence in clean_sentences:
  if len(sentence) != 0:
    vector = sum([word_embeddings.get(word, np.zeros((100,))) for word in sentence.split()])/(len(sentence.split())+0.001)
  else:
    vector = np.zeros((100,))

  sentences_vectors.append(vector)

Use cosine similarity approach for finding the similarities between sentences

In [0]:
similarity_matrix = np.zeros([len(sentences), len(sentences)])

for i in range(len(sentence)):
  for j in range(len(sentence)):
    if i != j:
      similarity_matrix[i][j] = cosine_similarity(sentences_vectors[i].reshape(1,100), sentences_vectors[j].reshape(1,100))[0,0]


Convert similarity matrix to graph

*   Node: sentences
*   Edge: similarity score between two nodes



In [0]:
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

Sort sentences based on their scores and seperate top 10

In [45]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

summary = ''

for i in range(10):
  summary += " "+ranked_sentences[i][1]

print(summary)

 When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players. When she said she is not really close to a lot of players, is that something strategic that she is doing? I think everyone just thinks because we're tennis players we should be the greatest of friends. Uhm, I'm not really friendly or close to many players. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. But ultimately tennis is just a very small part of what we do. He then dropped