# Automatic Text Summarization


**Goal:** Generate a concise summary of the texts.

**How:** Unsupervised text summarization (extractive summarization) using TextRank Alg.

**Data:** Manually Annotated Sub-Corpus First Release.


In [1]:
# importing packages & libraries
import numpy as np
import pandas as pd
import nltk
import re
import networkx as nx

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity

# nltk.download('punkt')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', -1)
pd.set_option('max_colwidth', 800)

In [2]:
! pip install networkx



### Data

In [3]:
# reading data
df = pd.read_csv('./data/all_data_new.csv')
df = df[['file_name','text']]

In [4]:
df.head(2)

Unnamed: 0,file_name,text
0,110CYL069.txt,"November 15, 1996 Dear Personal Donor: In the short while since Goodwill helped him find his job, Robert has learned to thoroughly clean a motel room in about 40 minutes. His job objectives call for him to do it in 30. He has no time to waste. Neither do we. With the help of friends like you, Goodwill has continued to adapt our services to meet the human needs of our changing society. We don't waste time as we are helping the community. And we don't waste money. The gift that I am asking you to make will be used to continue our mission of helping people prepare for, find and keep jobs. In their December, 1995 review of the nation's best charities, U.S. News & World Report called Goodwill one of th..."
1,110CYL068.txt,"Dear , A few months ago you received a letter from me telling the success stories of people who got jobs with Goodwill's help. Here's another story of success from what might seem like an unlikely source: Goodwill's controller, Juli. She tells me that the 3,666 people we helped find jobs in 1998 earned approximately $49 million dollars. In addition to that, by helping them find jobs, Goodwill reduced the state's Public Support tab by an estimated $4 million. Your gift to Goodwill will help us do even more this year because your gift will be used to directly support our work. What kind of work does Goodwill do? Goodwill finds jobs for people with mental and physical disabilities. After Maureen's job coach taug..."


### Getting setences

In [5]:
# splitting text into sentences
sentences = []
for s in df['text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] 

In [6]:
# printing 2 sentences
sentences[:2]

['      November 15, 1996       Dear Personal Donor:       In the short while since Goodwill helped him find his job, Robert has       learned to thoroughly clean a motel room in about 40 minutes.',
 'His job       objectives call for him to do it in 30.']

### Word Embeddings

**Good:** GloVe keeps word order. 

**Not so good:** Size of 822 MB.

In [7]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-12-21 16:35:11--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu... 171.64.67.140
Connecting to nlp.stanford.edu|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-12-21 16:35:12--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-12-21 16:35:13--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu... 171.64.64.22
Connecting to downloads.cs.stanford.edu|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: 'glove.6B.zip.3'


2020-12-21 16:46:17 (1.24 MB/s) - 'glove.6B.zip.3' saved [862182613/862182613]



In [9]:
!unzip glove*.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [11]:
# extracting word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [12]:
# getting the lengh of word embeddings
len(word_embeddings)

400000

In [13]:
# removing punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# making alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [14]:
# defining stop words
stop_words = stopwords.words('english')

In [15]:
# removing stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [16]:
# removing stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

### Vectors

In [17]:
# extracting word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [18]:
# creating vectors for sentences
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

### Similarity Matrix

In [19]:
# defining a similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [20]:
# initializing similarity matrix
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

### Text Rank Alg

In [22]:
# converting similarity matrix to a graph
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
# scores = nx.textrank(nx_graph)

In [23]:
# defining ranked sentences
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [24]:
# extracting top 10 sentences as the summary
for i in range(10):
    print(ranked_sentences[i][1])

Some are quite unique and their reporters cannot imagine their origin; others, like Penn Station , “what one family terms a child's misinterpretation of a famous line or phrase,” are clear: the generic term comes from the Lord's Prayer—“And lead us not into Penn Station.”  Obviously, that works only for kids familiar with New York City.
Much of it, of course, using the standards of top newspapers, cannot be used since it is largely gossip we are repeating, although it certainly could make for some very good stories.
With this in mind, I am asking each one of you to make a personal       contribution of $50, $100 or even $1,000 to show that you believe in the       work that we do and are willing to support it with both your time and       your finances.
And he saw this as one way of using this event as a way to                 deal with the Iraq problem."
A handful might have led colorful existences, some are objects of interest because they died early, committed suicide, were related 

In [25]:
ranked_sentences

[(0.0005315330584533033,
  "Some are quite unique and their reporters cannot imagine their origin; others, like Penn Station , “what one family terms a child's misinterpretation of a famous line or phrase,” are clear: the generic term comes from the Lord's Prayer—“And lead us not into Penn Station.”  Obviously, that works only for kids familiar with New York City."),
 (0.0005304741547640367,
  'Much of it, of course, using the standards of top newspapers, cannot be used since it is largely gossip we are repeating, although it certainly could make for some very good stories.'),
 (0.0005304110893663992,
  'With this in mind, I am asking each one of you to make a personal       contribution of $50, $100 or even $1,000 to show that you believe in the       work that we do and are willing to support it with both your time and       your finances.'),
 (0.0005292909040685781,
  'And he saw this as one way of using this event as a way to                 deal with the Iraq problem."'),
 (0.0005

In [26]:
len(ranked_sentences)

2229