<a href="https://colab.research.google.com/github/mowillia/phantom_pen/blob/master/text_summarizer_2_word_freq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Text Summarizer (#2) - Word Frequencies
**(June 17, 2019)**

Extractive Text Summarizer described in https://stackabuse.com/text-summarization-with-nltk-in-python/

In [1]:
#!/usr/bin/env python
# coding: utf-8

import numpy as np
import pandas as pd
import textwrap

# used in the count of words
import string

import nltk
nltk.download('stopwords')

import nltk.data # natural language tool kit

from nltk.tokenize import sent_tokenize, word_tokenize, WhitespaceTokenizer # $ pip install nltk
nltk.download('punkt')

from nltk.cluster.util import cosine_distance
import networkx as nx

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
## Function that outputs paragraphs from text file
def text_to_para(filename):
    
    para_list = open(filename).read().splitlines()
    
    para_list[:] = (value for value in para_list if value != '')
    
    return para_list

## function that outputs the sentences in a paragraph
def sents(para): 
    
    return sent_tokenize(para)

### function takes in a file and outputs a sentence length trajectory

## vector of sentences in a piece 
def raw_sents(filename):
    
    sent = []
    
    paragraphs = text_to_para(filename)
    
    for paragraph in paragraphs:
        sent += sents(paragraph)
        
    return sent

# Takes in a sentence string and outputs the words in the sentence as a list
def words_sent(sentence): 
    
    # selects out words in sentence and takes the punctuation as well
    word_list_punct = WhitespaceTokenizer().tokenize(sentence)
    
    # removed the punctuation in word list
    word_list = [elem.translate(str.maketrans('', '', string.punctuation)) 
                 for elem in word_list_punct]  
    return word_list


## vector of words in a piece
def raw_words(filename):
    
    word = []
    
    paragraphs = text_to_para(filename)
    
    for paragraph in paragraphs:
        for sent in sents(paragraph):
            word += words_sent(sent)
        
    return word


In [0]:
# define filename; from google drive
filename = '/content/sample_essay.txt'

In [0]:
# create list of sentences
sentence_list = raw_sents(filename)

In [0]:
# list of stop "words" in english
stopwords = nltk.corpus.stopwords.words('english')

# computes the number of occurence of each word

# create empty dictionary where words and number of occurrence are stored
word_frequencies = {}  

# loop through words in the text
for word in raw_words(filename):  
    
    # if word is not in stop words, 
    # word becomes key for dictionary and its value is incremented by 1    
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [0]:
# computes the frequency of occurence of each word 
# by normalizing by the the maximum number of occurences

maximum_frequncy = max(word_frequencies.values())

# loop through words in dictionary
for word in word_frequencies.keys():  
    
    # normalize by max frequency
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [0]:
# computes the score for a sentence by adding the 
# frequencies for each word in the sentence

#  create empty dictionary where keys are sentences and values are scores
sentence_scores = {}  

# loop through sentences
for sent in sentence_list:  
    
    #tokenize sentences into words 
    for word in nltk.word_tokenize(sent.lower()):
        
        # word occurs in the frequency key dictionary
        if word in word_frequencies.keys():
            
            # only consider sentences with less than 30 words
            if len(sent.split(' ')) < 30:
                
                # if sentence doesn't exist we add it to the dictionary as a value 
                # and add the frequency-score of the first word as a value
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                    
                # if sentence already exists we add to its score the frequency-score
                # of the next word
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [8]:
#
import heapq  

summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)  
print(textwrap.fill(summary, 50))

How could so many identify with her work even
amidst the clear space between Rand’s truth and
the lived truth of people’s lives? Frustration,
confusion, and hopelessness are the negative end
results when one applies narratives outside of the
simple regimes of applicability in which these
stories were first created. In spite of the name,
this framing has nothing superficially — and
perhaps everything deeply — to do with the history
of racism in this country. What unites autocrats
and novelists is the simultaneous — and thus
dangerous — fungibility and potency of their
narratives. And what they take from the
audience — validation of themselves — is what the
audience believes, inaccurately, “Los dictadores”
are providing to them. I imagine that many people
see Díaz’s writing as an outcropping of the
current ethnic and cultural zeitgeist that grips
this country. But what audiences are really
receiving is a constructed reality which fails
constantly to mesh sensibly with an outside world.
