# PyAbst Example

## Introduction

In [6]:
import nltk
from collections import Counter
from nltk.corpus import stopwords
import re

### Example News Article: Emma Haruka Iwao smashes pi world record with Google help

The value of the number pi has been calculated to a new world record length of 31 trillion digits, far past the previous record of 22 trillion. Emma Haruka Iwao, a Google employee from Japan, found the new digits with the help of the companys cloud computing service.<br>

Pi is the number you get when you divide a circles circumference by its diameter. The first digits, 3.14, are well known but the number is infinitely long. Extending the known sequence of digits in pi is very difficult because the number follows no set pattern.<br>

Pi is used in engineering, physics, supercomputing and space exploration - because its value can be used in calculations for waves, circles and cylinders. The pursuit of longer versions of pi is a long-standing pastime among mathematicians. And Ms Iwao said she had been fascinated by the number since she had been a child.<br>

The calculation required 170TB of data (for comparison, 200,000 music tracks take up 1TB) and took 25 virtual machines 121 days to complete.<br>
<br>
<i>Source: https://www.bbc.co.uk/news/technology-47524760</i>


### Example News Article Ingest

In [4]:
input_text = '''The value of the number pi has been calculated to a new world record length of 31 trillion digits, far past the previous record of 22 trillion. Emma Haruka Iwao, a Google employee from Japan, found the new digits with the help of the companys cloud computing service. Pi is the number you get when you divide a circles circumference by its diameter. The first digits, 3.14, are well known but the number is infinitely long. Extending the known sequence of digits in pi is very difficult because the number follows no set pattern. Pi is used in engineering, physics, supercomputing and space exploration - because its value can be used in calculations for waves, circles and cylinders. The pursuit of longer versions of pi is a long-standing pastime among mathematicians. And Ms Iwao said she had been fascinated by the number since she had been a child. The calculation required 170TB of data (for comparison, 200,000 music tracks take up 1TB) and took 25 virtual machines 121 days to complete.'''

### PyAbst Source Code

In [17]:
def PyAbst(text, target_words=[], word_weight=1):
    '''A function which returns the most important sentences from a list of sentences using common word weighting.'''


    # Define StopWords corpus.
    new_stopwords = ['said', 'so']  # Additional StopWords.
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords = stopwords + new_stopwords


    # Evaluate upper, lower and capitalised combinations of target words.
    target_words_combinations = []
    for i in target_words:
        upper = i.upper() # Evaluate upper.
        target_words_combinations.append(upper)
        lower = i.lower() # Evaluate lower.
        target_words_combinations.append(lower)
        capitalise = i.capitalize() # Evaluate capitalised.
        target_words_combinations.append(capitalise)


    # Unprocessed text.
    input_text = text # Defined here to evaluate reduction percentage.
    text = text.replace('?', '.') # Replace ? character with period.
    text = text.replace('!', '.') # Replace ! character with period.
    text = [x.split('.') for x in text.split('.')]  # Use period to create list of lists with period separation.
    text = [[x.lstrip() for x in listx] for listx in text]  # Remove heading whitespace text for each list element.
    text = [[x + '.' for x in listx] for listx in text]  # Add a period at the end of each list.
    text = text[:-1]  # Remove list containing [.] at the end of list of lists.

    processed_text = [] #NOTE: This is a duplicate of unprocessed text with additional processing methods.
    for i in text:
        i = [re.sub(r'[^\w\s]', '', j).lower() for j in i]  # Remove punctuation and lower all words.
        i = [nltk.word_tokenize(j) for j in i]  # Tokenize words using NLTK.
        i = [item for sublist in i for item in sublist]  # Flatten the list of lists.
        i = [j for j in i if j not in stopwords]  # Remove StopWords using NLTK.
        processed_text.append(i)


    sentences_unpacked = [item for sublist in processed_text for item in sublist] # Unpacked list of lists.


    def replace_list_dict(list, dictionary):
        '''A function which replaces list elements with corresponding dictionary key-values.'''
        replaced = [(item, Counter(sentences_unpacked).get(item, item)) for item in list]
        return replaced


    sentences_list = []
    for i in processed_text: # Converts list (word) element to (word, frequency).
        sentences_list.append(replace_list_dict(i, Counter(sentences_unpacked)))


    weighted_sentences_list = []
    for i in sentences_list:  # Replaces list (word, frequency) element with (word, frequency * weight) if word is in target list.
        weighted_sentences_list.append([(t[0], t[1] * word_weight) if t[0] in target_words_combinations else (t[0], t[1]) for t in i])


    sentences_list_scores = []
    for i in weighted_sentences_list:
        sum_score = sum(x[1] for x in i) # Evalute the sum of frequency for each sentence (list within list of lists).
        sentences_list_scores.append(sum_score) # Evaluate the sum of frequencies for each sentence.


    sentences_length = []
    for i in processed_text:
        sentence_length = len(i) # Evaluate how many words are in each sentence.
        sentences_length.append(sentence_length)


    weighted_scores = [(x, x//y) for x, y in zip(sentences_list_scores, sentences_length)] # Evaluate (score, weighted score).
    index = int(len(processed_text) * 0.4) # Evaluate fraction of sentences to return.
    reduced_indexes = sorted((sorted(range(len(sentences_list_scores)), key=lambda i: sentences_list_scores[i])[-index:]))


    reduced_text = list(text[i] for i in reduced_indexes) # Only return the sentences with index in reduced_indexes.
    reduced_text = [item for sublist in reduced_text for item in sublist] # Unpacked list of lists.
    reduced_text = ' '.join(reduced_text)


    # Evaluate text reduction percentage.
    original_length = len(input_text)
    reduced_length = len(reduced_text)
    percentage_diff = str(int(((original_length - reduced_length) / original_length) * 100)) + '%'


    return reduced_text, percentage_diff

### Example 1: Default Arguments

In [18]:
print(PyAbst(input_text, [], 1)[0], '\n')
print('Original text reduced by: ', PyAbst(input_text, [], 1)[1])

The value of the number pi has been calculated to a new world record length of 31 trillion digits, far past the previous record of 22 trillion. Emma Haruka Iwao, a Google employee from Japan, found the new digits with the help of the companys cloud computing service. Extending the known sequence of digits in pi is very difficult because the number follows no set pattern. Pi is used in engineering, physics, supercomputing and space exploration - because its value can be used in calculations for waves, circles and cylinders. 

Original text reduced by:  46%


### Example 2: Suppressing the word 'Japan'

In [19]:
print(PyAbst(input_text, ['Japan'], -20000)[0], '\n')
print('Original text reduced by: ', PyAbst(input_text, ['Japan'], -20000)[1])

The value of the number pi has been calculated to a new world record length of 31 trillion digits, far past the previous record of 22 trillion. Extending the known sequence of digits in pi is very difficult because the number follows no set pattern. Pi is used in engineering, physics, supercomputing and space exploration - because its value can be used in calculations for waves, circles and cylinders. The calculation required 170TB of data (for comparison, 200,000 music tracks take up 1TB) and took 25 virtual machines 121 days to complete. 

Original text reduced by:  45%


### Example 3: Amplifying the word 'number'

In [20]:
print(PyAbst(input_text, ['number'], 200)[0], '\n')
print('Original text reduced by: ', PyAbst(input_text, ['number'], 200)[1])

The value of the number pi has been calculated to a new world record length of 31 trillion digits, far past the previous record of 22 trillion. Pi is the number you get when you divide a circles circumference by its diameter. Extending the known sequence of digits in pi is very difficult because the number follows no set pattern. And Ms Iwao said she had been fascinated by the number since she had been a child. 

Original text reduced by:  58%


### Summry Comparison

When you ingest the same text into Smmry ([www.smmry.com](www.smmry.com)). with select a 5 sentence argument we find the results to match Example 3. However, PyAbs replaces <i>'The first digits, 3.14, are well known but the number is infinitely long.'</i> with <i>'And Ms Iwao said she had been fascinated by the number since she had been a child.'</i>.

The value of the number pi has been calculated to a new world record length of 31 trillion digits, far past the previous record of 22 trillion. Pi is the number you get when you divide a circles circumference by its diameter. The first digits, 3.14, are well known but the number is infinitely long. Extending the known sequence of digits in pi is very difficult because the number follows no set pattern.