In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from nltk import tokenize
from nltk.corpus import stopwords
import re
import contractions

# TextRank

TextRank is an extractive text summarization algorithm based on the PageRank algorithm utilized by Google. The algorithm looks at sentence vectors (in our case, Word2Vec embeddings) and calculates the similarities between these vectors using cosine similarity. The similarity matrix is then converted to a graph, with the sentence vectors as verticies and similarities being the edges. 

The output of PageRank is a probability distribution with the most-likely pages to be visited with a set of random clicks. In our case, we substitute this likelihood with the likelihood that a sentence is of importance. We then take the top k of these sentences and concatenate them into a summary.

In [2]:
data = pd.read_csv('data/cleaned_data.csv')
data = data[['Summary', 'Text']]

In [3]:
""" Here we set text to lower case, remove plurals, 
    expand contractions, remove punctuation, remove stopwords, and remove short words 
    (could also remove parentheticals)"""

stop = set(stopwords.words('english'))
def clean_text(text):
    ret = text.lower()
    ret = contractions.fix(text)
    ret = re.sub(r'\([^)]*\)', '', ret)
    ret = re.sub('"','', ret)
    ret = re.sub(r"'s\b","", ret)
    ret = re.sub("[^a-zA-Z]", " ", ret) 
    
    #Remove any words shorter than 2 letters
    tokens = [w for w in ret.split() if not w in stop]
    long_words=[]
    for i in tokens:
        if len(i)>=3:                 
            long_words.append(i)   
    return (" ".join(long_words)).strip()

In [4]:
test = data['Text'][0]
label = data['Summary'][0]
text = test.replace('\\', '').replace('/', '').replace('.,', '.').replace('.;,', '.')
lbl = label.replace('\\', '').replace('/', '').replace('.,', '. ').replace('.;,', '. ')

In [5]:
from nltk.tokenize import sent_tokenize
articles = []

sentence = sent_tokenize(text)
clean = []
for sen in sentence:
    clean.append(clean_text(sen))

#Word2Vec
from gensim.models import Word2Vec
from nltk.corpus import stopwords

words = []
all_words = [i.split() for i in clean]
model = Word2Vec(all_words, min_count=1, vector_size=300)

sent_vector=[]
for i in clean:
    plus=0
    for j in i.split():
        plus+= model.wv[j]
    plus = plus/len(i.split())
    sent_vector.append(plus)

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

#Generate Similarity Matrix
mat = np.zeros([len(sentence), len(sentence)])
for i in tqdm(range(len(sentence))):
    for j in range(len(sentence)):
        if i != j:
            mat[i][j] = cosine_similarity(sent_vector[i].reshape(1,-1), sent_vector[j].reshape(1, -1))[0,0]

100%|███████████████████████████████████████████| 29/29 [00:00<00:00, 80.85it/s]


In [7]:
import networkx as net

#Implement TextRank using pre-built PageRank
graph = net.from_numpy_array(mat)
scores = net.pagerank_numpy(graph)

  scores = net.pagerank_numpy(graph)
NetworkX version 3.0.
  M = google_matrix(


In [18]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentence)), reverse=True)

result = ""
for i in range(5):
    result += " " + ranked_sentences[i][1]
print(result)

 Turning one wall into a chalkboard gives you a perfect space for ideas, sketches, and planning without requiring extra equipment or space. Paint over jars or storage equipment, allowing you to relabel them with chalk as your needs change. This usually has the opposite effect, leading to lost items and uncertainty when cleaning, but an afternoon with a label maker can solve everything. The upper reaches of the room are often the most under-utilized, but provide vital space for all your tools and materials. Discard trash or unnecessary materials and wipe down dirty surfaces.


In [32]:
!pip3 install rouge/requirements.txt
!pip3 install rouge-score

[31mERROR: Invalid requirement: 'rouge/requirements.txt'
Hint: It looks like a path. File 'rouge/requirements.txt' does not exist.[0m[31m


# Scoring

For scoring, I first extract the keywords in the output from TextRank and the human-written summary (the label). Then I look at the overlap of those keywords to determine how many match -- this will give us an accuracy and precision score, and thus an F1 score. I will also calculate the ROUGE1 score between both the concatenated list of keywords and the summaries themselves.

In [49]:
data

Unnamed: 0,Summary,Text
0,"Keep related supplies in the same area.,Make a...","If you're a photographer, keep all the necess..."
1,Create a sketch in the NeoPopRealist manner of...,See the image for how this drawing develops s...
2,"Get a bachelor’s degree.,Enroll in a studio-ba...",It is possible to become a VFX artist without...
3,Start with some experience or interest in art....,The best art investors do their research on t...
4,"Keep your reference materials, sketches, artic...","As you start planning for a project or work, ..."
...,...,...
214288,"Consider changing the spelling of your name.,A...","If you have a name that you like, you might f..."
214289,"Try out your name.,Don’t legally change your n...",Your name might sound great to you when you s...
214290,"Understand the process of relief printing.,Exa...",Relief printing is the oldest and most tradit...
214291,"Understand the process of intaglio printing.,L...","Intaglio is Italian for ""incis­ing,"" and corr..."


In [16]:
#Create corpus for tf-idf
corp = data["Text"][:10000]
#for row in data["Text"][:1000]:
#    corp = corp + " " + row
corp = corp.apply(clean_text).to_list()

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

class TF_IDF():
    def __init__(self, corpus):
        self.text = corpus
        self.stopwords = set(stopwords.words("english"))
        self.cv = CountVectorizer(max_df=0.85, stop_words=self.stopwords)
        self.wordcount = self.cv.fit_transform(corpus)
    
        self.transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
        self.transformer.fit(self.wordcount)
    
    def sort_vals(self, matrix):
        tuples = zip(matrix.col, matrix.data)
        return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
    
    def extract_top_k(self, feature_names, items, k=10):
        items = items[:k]

        scores = []
        features = []

        for idx, score in items:
            scores.append(round(score, 3))
            features.append(feature_names[idx])
        
        results = {}
        for idx in range(len(features)):
            results[features[idx]] = scores[idx]
        
        return results

    def extract_keywords(self, doc, k=10):
        feature_names = self.cv.get_feature_names()
        tf_idf_vector = self.transformer.transform(self.cv.transform([doc]))

        sort_items = self.sort_vals(tf_idf_vector.tocoo())
        keywords = self.extract_top_k(feature_names, sort_items, k)

        print("\nDocument")
        print(doc)
        print("\nKeywords")
        for k in keywords:
            print(k, keywords[k])
        
        return keywords

In [26]:
tf = TF_IDF(corp)
res_keyword = tf.extract_keywords(result, 10)
lbl_keyword = tf.extract_keywords(label, 10)


Document
 Turning one wall into a chalkboard gives you a perfect space for ideas, sketches, and planning without requiring extra equipment or space. Paint over jars or storage equipment, allowing you to relabel them with chalk as your needs change. This usually has the opposite effect, leading to lost items and uncertainty when cleaning, but an afternoon with a label maker can solve everything. The upper reaches of the room are often the most under-utilized, but provide vital space for all your tools and materials. Discard trash or unnecessary materials and wipe down dirty surfaces.

Keywords
equipment 0.248
relabel 0.246
space 0.245
materials 0.233
chalkboard 0.207
utilized 0.2
jars 0.2
maker 0.196
sketches 0.187
uncertainty 0.181

Document
Keep related supplies in the same area.,Make an effort to clean a dedicated workspace after every session.,Place loose supplies in large, clearly visible containers.,Use clotheslines and clips to hang sketches, photos, and reference material.,Use 

In [24]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(result,
                      label)

In [21]:
print(result)
print("\n")
print(lbl)

 Turning one wall into a chalkboard gives you a perfect space for ideas, sketches, and planning without requiring extra equipment or space. Paint over jars or storage equipment, allowing you to relabel them with chalk as your needs change. This usually has the opposite effect, leading to lost items and uncertainty when cleaning, but an afternoon with a label maker can solve everything. The upper reaches of the room are often the most under-utilized, but provide vital space for all your tools and materials. Discard trash or unnecessary materials and wipe down dirty surfaces.


Keep related supplies in the same area. Make an effort to clean a dedicated workspace after every session. Place loose supplies in large, clearly visible containers. Use clotheslines and clips to hang sketches, photos, and reference material. Use every inch of the room for storage, especially vertical space. Use chalkboard paint to make space for drafting ideas right on the walls. Purchase a label maker to make yo

In [22]:
scores

{'rouge1': Score(precision=0.3411764705882353, recall=0.30526315789473685, fmeasure=0.32222222222222224),
 'rougeL': Score(precision=0.1411764705882353, recall=0.12631578947368421, fmeasure=0.13333333333333333)}