In [24]:
import os
import string
from string import punctuation
import torch
import tensorflow as tf
import transformers
import summa
from summa.summarizer import summarize
import benepar # requires Tensorflow, although we'll use torch otherwise
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize
import re
import spacy
import scipy

In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from sentence_transformers import SentenceTransformer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2") # we'll use GPT2 to generate sentences
# load BERT model
model_BERT = SentenceTransformer('bert-base-nli-mean-tokens') # we'll use BERT to filter sentences based on similarity

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=548118077.0), HTML(value='')))




100%|██████████| 405M/405M [00:52<00:00, 7.65MB/s] 


In [3]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/mark/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import transformers
import sentence_transformers
print(transformers.__version__)
print(sentence_transformers.__version__)

3.3.1
0.3.8


In [7]:
nlp = spacy.load("en")
#nltk.load("punkt")
benepar.download("benepar_en2")
benepar_parser = benepar.Parser("benepar_en2")

[nltk_data] Downloading package benepar_en2 to
[nltk_data]     /Users/mark/nltk_data...
[nltk_data]   Package benepar_en2 is already up-to-date!














### Part I: Get our text file and do some preprocessing

Let's load our document

In [8]:
text = ""
with open("../sample_texts/amazon.txt", "r") as f:
    text = f.read()

Let's strip any punctuation from our text:

In [9]:
def clean_text(text):
    """
        Wrapper function to perform any text cleaning 
        that we'd want to do
    """
    text = text.strip(punctuation)
    return text

In [10]:
# clean our text
text = clean_text(text)

### Part II: Let's summarize our text, using the summarizer

Here's the implementation, from the `summa` library, that we'll be loading in: 
https://github.com/summanlp/textrank/blob/master/summa/summarizer.py

It is an implementation of the TextRank algorithm, detailed in the following paper: 
https://www.aclweb.org/anthology/W04-3252.pdf

Here's a great summary detailing how the TextRank algorithm works (fun fact - it was inspired by the PageRank algorithm, which inspired the creation of Google by the algorithm's creators, Larry Page and Sergey Brin, cited in this paper: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf): https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [11]:
def get_sentences(text, ratio = 0.3):
    """
        Get our sentences to use. 
        
    """
    
    # sets up our sentence to be summarized
    sentences = summarize(text, ratio = ratio)
    
    # split by sentence, using sent_tokenize method
    sentences_list = tokenize.sent_tokenize(sentences)
    
    # do some regex cleaning
    cleaned_sentences_list = [re.split(r'[:;]+', x)[0] for x in sentences_list]
    
    return cleaned_sentences_list

In [12]:
# let's see how this looks
cleaned_text = get_sentences(text)
cleaned_text

['In 2002, the corporation started Amazon Web Services (AWS), which provided data on Web site popularity, Internet traffic patterns and other statistics for marketers and developers.',
 'That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies selling their belongings through the company internet site.',
 "Amazon.com's product lines available at its website include several media (books, DVDs, music CDs, videotapes and software), apparel, baby products, consumer electronics, beauty products, gourmet food, groceries, health and personal-care items, industrial & scientific supplies, kitchen items, jewelry, watches, lawn and garden items, musical instruments, sporting goods, tools, automotive items and toys & games.",
 'Amazon first launched its distribution network in 1997 with two fulfillment centers in Seattle and New Castle, Delaware.']

Let's remove the punctuation again. 

In [13]:
cleaned_text = [clean_text(x) for x in cleaned_text]

Let's see how this looks so far:

In [14]:
for sentence in cleaned_text:
    print(sentence)
    print("\n")

In 2002, the corporation started Amazon Web Services (AWS), which provided data on Web site popularity, Internet traffic patterns and other statistics for marketers and developers


That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies selling their belongings through the company internet site


Amazon.com's product lines available at its website include several media (books, DVDs, music CDs, videotapes and software), apparel, baby products, consumer electronics, beauty products, gourmet food, groceries, health and personal-care items, industrial & scientific supplies, kitchen items, jewelry, watches, lawn and garden items, musical instruments, sporting goods, tools, automotive items and toys & games


Amazon first launched its distribution network in 1997 with two fulfillment centers in Seattle and New Castle, Delaware




### Part III: Split up our sentences, using the Berkley Constituency parser

We're going to use the Berkley Constituency parser to split a sentence at the ending verb phrase or noun phrase. 

For example, if the sentence were:

`Jeff Bezos was working on Wall Street before he started Amazon, but he moved west because of the opportunities there`, 

we can take several approaches to generate false sentences based off this, such as changing a verb (e.g., changing "started" to "left"), changing a noun (e.g., changing "Amazon" to "Microsoft"), adding negation (e.g., changing "working on Wall Street" to "not working on Wall Street") or changing a named entity (e.g,. changing "Jeff Bezos" to "Bill Gates")

For our use case, we'll start by just changing a noun phrase or changing a verb phrase. In our particular implementation, we'll split the sentence at the ending verb phrase or noun phrase. 

If we were to split at the end of the last verb phrase for our example above, we'd get something like:

`["Jeff Bezos was working on Wall Street before he started Amazon, but he", "moved west because of the opportunities there"]`

Now, what we can do is take the first part of our phrase (`"Jeff Bezos was working on Wall Street before he started Amazon, but he"`) and we can ask GPT2 to complete the sentence for us. 

Similarly, if we were to split at the end of the last noun phrase for our example, we'd get something like:

`["Jeff Bezos was working on Wall Street before he started Amazon, but he moved west because of the ", "opportunities there"]`

We can take the first part of the phrase (`"Jeff Bezos was working on Wall Street before he started Amazon, but he moved west because of the "`) and ask GPT2 to complete the sentence for us. 

In this way, we can create false sentences using GPT2. We leverage the Berkley Constituency parser because it parses our sentence in a way such that we'll be able to isolate the last verb phrase or last noun phrase (depending on what comes last in the tree)

In [15]:
def get_flattened(tree):
    
    """
        Flattens the tree structure that we'll get from the Berkley
        parser, to allow us to easily work with it
    """
    
    final_sentence_str = None
    if tree is not None:
        sent_str = [" ".join(x.leaves()) for x in list(tree)]
        final_sentence_str = [" ".join(sent_str)][0]
    return final_sentence_str
        

In [16]:
def get_last_portion(main_string, substring):
    
    """
        Here, we get a string of our last verbphrase or
        nounphrase. 
    """
    
    combined_substring = substring.replace(" ", "")
    
    main_string_list = main_string.split()
    
    last_index = len(main_string_list)
    
    for i in range(last_index):
        
        check_string_list = main_string_list[i:]
        
        check_string = "".join(check_string_list)
        
        if check_string == combined_substring:
            return " ".join(main_string_list[:i])
        
    return None

In [17]:
def get_rightmost_VP_or_NP(tree, last_NP = None, last_VP = None):
    
    """
    
        Recursive function, to get the rightmost verb phrase (VP) 
        or noun phrase (NP), which corresponds to the VP or NP 
        that occurs at the end of the sentence
        
    """
    
    # if we don't have more nodes to traverse, we know we've hit the end
    if len(tree.leaves()) == 1:
        return get_flattened(last_NP), get_flattened(last_VP)
    
    # get our last subtree
    last_subtree = tree[-1]
    
    # check if we either have NP or VP:
    if last_subtree.label() == "NP":
        last_NP = last_subtree
    elif last_subtree.label() == "VP":
        last_VP = last_subtree
        
    return get_rightmost_VP_or_NP(last_subtree, last_NP, last_VP)
    

In [18]:
def get_sentence_completions(all_sentences):
    
    """
        Returns a dictionary of our sentences as well 
        as the same sentences, just without their terminal
        VP or NP
    """
    
    sentence_completion_dict = {}
    
    # loop through all of our sentences
    for individual_sentence in all_sentences:

        # parse any additional punctuation
        sentence = individual_sentence.strip(r"?:!.,;") 
        
        # get parsed tree
        tree = benepar_parser.parse(sentence)
        
        last_NP, last_VP = get_rightmost_VP_or_NP(tree)
        
        phrases = []
        
        if last_VP is not None:
            VP_string = get_last_portion(sentence, last_VP)
            if VP_string is not None:
                phrases.append(VP_string)
            else:
                phrases.append("")
        if last_NP is not None:
            NP_string = get_last_portion(sentence, last_NP)
            if NP_string is not None:
                phrases.append(NP_string)
            else:
                phrases.append("")
             
        # get our sentence that we want GPT2 to complete
        longest_phrase = sorted(phrases, key=len, reverse=True)
        
        if len(longest_phrase) == 2:
            first_sentence_len = len(longest_phrase[0].split())
            second_sentence_len = len(longest_phrase[1].split())
            
            if (first_sentence_len - second_sentence_len) > 4:
                del longest_phrase[1]
                
        if len(longest_phrase) > 0:
            sentence_completion_dict[sentence] = longest_phrase
            
    return sentence_completion_dict
            

Now that we've defined the functions, let's get our sentences for GPT2 to complete:

In [19]:
sentence_completion_dict = get_sentence_completions(cleaned_text)

In [20]:
sentence_completion_dict

{'In 2002, the corporation started Amazon Web Services (AWS), which provided data on Web site popularity, Internet traffic patterns and other statistics for marketers and developers': ['In 2002, the corporation started Amazon Web Services (AWS), which provided data on Web site popularity, Internet traffic patterns and other statistics for'],
 'That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies selling their belongings through the company internet site': ['That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies selling their belongings through',
  'That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies'],
 "Amazon.com's product lines available at its website include several media (books, DVDs, music CDs, videotapes and software), apparel, baby products, consumer electronics, beauty product

### Part IV: Filter sentences and generate false sentences

For our use case, we'll use GPT2 to generate false sentences, and we'll use BERT to determine the similarity of our sentences (since we only want to keep the sentences that are not similar to our original sentence, so that they will be clearly, unequivocally false)


In [21]:
def sort_by_similarity(original_sentence, new_sentences_list, num_vals = 3):
    
    """
    
        Sort our GPT-2 generated sentences by how similar they are to our original sentence. 
        We want to select sentences that are not similar to our original sentence (since these
        are going to be the statements that are most clearly false)
        
        We will use BERT to perform the similarity calculation
        
        Args:
            • original_sentence: our original sentence
            • new_sentences_list: our new fake sentences
            • num_vals: number of dissimilar sentences that we want to use
        
    """
    
    # encode the sentences from GPT2 into BERT's format (each sentence is a 1-D vector with 768 columns)
    sentence_embeddings = model_BERT.encode(new_sentences_list)
    
    # do same for original sentence
    original_sentence_list = [original_sentence]
    original_sentence_embeddings = model_BERT.encode(original_sentence_list)
    
    # get number of matches, then loop through and sort by dissimilarity
    number_top_matches = len(new_sentences_list)
    
    dissimilar_sentences = []
    
    for query, query_embedding in zip(original_sentence_list, original_sentence_embeddings):

        # calculate distance between original sentence and false sentences
        distances = scipy.spatial.distance.cdist([query_embedding], sentence_embeddings, "cosine")[0]
        
        # get list of distances + indices, then sort
        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x:x[1])
        
        # get dissimilarity score (our distance so far shows)
        # how close they are, so we want embeddings
        # that are far away from the sentence embedding
        for idx, distance in reversed(results[0:number_top_matches]):
            
            score = 1 - distance
            
            if score < 0.9: # arbitrary threshold
                dissimilar_sentences.append(new_sentences_list[idx].strip())
                
    # sort the dissimilar sentences ascending, and get first n (so, lowest scores = furthest away)
    sorted_dissimilar_sentences = sorted(dissimilar_sentences, key = len)
    
    return sorted_dissimilar_sentences[:num_vals]
                
        
    

In [22]:
def generate_sentences(partial_sentence, full_sentence):
    
    """
        Generate false sentences, using GPT2, based off partial sentence
    """
    
    input_ids = torch.tensor([tokenizer.encode(partial_sentence)])
    
    maximum_length = len(partial_sentence.split()) + 80
    
    # get outputs
    sample_outputs = model.generate(input_ids, 
                                    do_sample = True,
                                    max_length = maximum_length, 
                                    top_p = 0.9, 
                                    top_k = 50, 
                                    repitition_penalty = 10.0,
                                    num_return_sequences = 10)
    generated_sentences = []
    
    for i, sample_output in enumerate(sample_outputs):
        
        decoded_sentences = tokenizer.decode(sample_output, skip_special_tokens = True)
        decoded_sentences_list = tokenize.sent_tokenize(decoded_sentences)
        generated_sentences.append(decoded_sentences_list[0])
        
    top_3_sentences = sort_by_similarity(full_sentence, generated_sentences)
    
    return top_3_sentences
    
    
    
    

In [25]:
index = 1
choice_list = ["a)","b)","c)","d)","e)","f)"]
for key_sentence in sentence_completion_dict:
    
    # get our partial sentence
    partial_sentences = sentence_completion_dict[key_sentence]
    
    # start creating false sentences
    false_sentences = []
    print(f"Our true sentence: {key_sentence}")
    
    # loop through partial sentencesf
    for partial_sentence in partial_sentences:
        
        # create our false sentences
        false_sents = generate_sentences(partial_sentence, key_sentence)
        
        false_sentences.extend(false_sents)
        
    print("False sentences (created by GPT2):")
    
    for idx, false_sent in enumerate(false_sentences):
        
        print(f"{choice_list[idx]} {false_sent}")
        
    index = index + 1
    
    print("\n\n")
    

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Our true sentence: In 2002, the corporation started Amazon Web Services (AWS), which provided data on Web site popularity, Internet traffic patterns and other statistics for marketers and developers


Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


False sentences (created by GPT2):



Our true sentence: That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies selling their belongings through the company internet site


Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


False sentences (created by GPT2):
a) That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies selling their belongings through an online marketplace and was now taking orders for $500.
b) That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies for the U.S. Government.
c) That same year, the company started Fulfillment by Amazon which managed the inventory of individuals and small companies in three countries: Mexico, South America and Japan.



Our true sentence: Amazon.com's product lines available at its website include several media (books, DVDs, music CDs, videotapes and software), apparel, baby products, consumer electronics, beauty products, gourmet food, groceries, health and personal-care items, industrial & scientific supplies, kitchen items, jewelry, watches, lawn and garden items, musical instruments, sporting goods, tools, automotive items and to

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


False sentences (created by GPT2):



Our true sentence: Amazon first launched its distribution network in 1997 with two fulfillment centers in Seattle and New Castle, Delaware
False sentences (created by GPT2):
a) Amazon first launched its distribution network in 1997 with two fulfillment centers in Seattle and New Castle, Ohio, and has been in operation for almost 20 years.
b) Amazon first launched its distribution network in 1997 with two fulfillment centers in Seattle and New Castle, Ga., where about 2,000 orders a week come from across the country.
c) Amazon first launched its distribution network in 1997 with two fulfillment centers in Seattle and New Castle, Pennsylvania, and now stores at Amazon.com and in over 140 countries.



