In [1]:
import os
import nltk
#nltk.download('stopwords')
#nltk.download('popular')


In [2]:
# load BERT's summarizer, which we'll use to make summaries of our text
from summarizer import Summarizer

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [3]:
import pprint
import itertools
import re
import pke
import string
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from flashtext import KeywordProcessor
import requests
import json
import random
import pywsd
from pywsd.similarity import max_similarity
from pywsd.lesk import adapted_lesk
from pywsd.lesk import simple_lesk
from pywsd.lesk import cosine_lesk
from nltk.corpus import wordnet as wn

Warming up PyWSD (takes ~10 secs)... took 30.87401008605957 secs.


### Part I: Let's load in our file

For the purposes of this demo, we'll load a file called `"australia.txt"`, which is just a snippet of the Wikipedia entry for Australia. 

In [4]:
with open("../../sample_texts/australia.txt") as f:
    text = f.read()

In [5]:
text

"Australia, officially known as the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. It is the largest country in Oceania and the world's sixth-largest country by total area. The population of 26 million is highly urbanised and heavily concentrated on the eastern seaboard. Australia's capital is Canberra, and its largest city is Sydney. The country's other major metropolitan areas are Melbourne, Brisbane, Perth, and Adelaide. Indigenous Australians inhabited the continent for about 65,000 years prior to the first arrival of Dutch explorers in the early 17th century, who named it New Holland. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day. The population grew steadily in subsequent decades, and by the time of an 1850

### Part II: Set up our BERT summarizer

In [6]:
model = Summarizer()

In [7]:
# wrap the text in a Bert Summarizer object
result = model(text, 
               min_length = 30, 
               max_length = 500, 
               ratio = 0.5)

In [8]:
summarized_text = ''.join(result)

In [9]:
print(summarized_text) # BERT Summarizer only uses a subject of the text

Australia, officially known as the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. Australia's capital is Canberra, and its largest city is Sydney. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day. The population grew steadily in subsequent decades, and by the time of an 1850s gold rush, most of the continent had been explored by European settlers and an additional five self-governing crown colonies established. On 1 January 1901, the six colonies federated, forming the Commonwealth of Australia.


### Part III: Keyword Extraction

Our next step is to determine the important keywords for the text. We need to find what these keywords are, then we'll map the sentences to these keywords to get the sentences that relate to a given keyword. 

In [10]:
def get_nouns_multipartite(text):
    
    """
    
        Determines what nouns are important by using the TopicRank algorithm, 
        which is implemented on a multipartite graph. It builds on top of the TextRank algorithm
        by implementing a graph-based model derived from the PageRank algorithm. 
        
        The following are relevant resources for this algorithm:
            https://www.aclweb.org/anthology/I13-1062.pdf
            https://smirnov-am.github.io/extracting-keyphrases-from-texts-unsupervised-algorithm-topicrank/
            https://github.com/smirnov-am/pytopicrank
        
        In short, the algorithm works as follows:
            1. Use nltk to identify part-of-speech (POS)
            2. Identify longest sequences of adjectives and nouns, and these will constitute our keyphrases
            3. Convert each keyphrase into term frequency vectors using Bag-of-Words (BOW)
            4. Find clusters of keyphrases, using Hierarchical Agglomerative Clustering (HAC)
            5. Use clusters as graph vertices, and sum of distances between each keyphare of topic pairs as edge weight
            6. Apply PageRank to identify most prominent topics
            7. For topN topics extract most significant keyphrases that represent this topic
    
    """
    
    output = []
    
    # initialize our multipartite graph keyphrase extraction model
    extractor = pke.unsupervised.MultipartiteRank()
    
    extractor.load_document(input=text)
    
    # get the POS that we're looking for
    pos = {'PROPN', 'ADJ', 'NOUN'}
    
    # get stoplist, words to avoid
    stoplist = list(string.punctuation)
    stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
    stoplist += stopwords.words('english')
    
    # select the longest sequence of nouns, adjectives, that do not contain punctuation marks or stopwords
    # and let's choose these as our candidates
    extractor.candidate_selection(pos=pos, stoplist=stoplist)
    
    # build Multipartite graph and rank candidates using random walk
    extractor.candidate_weighting(alpha=1.1,
                                  threshold=0.75,
                                  method='average')
    
    # get the 20-highest scored candidates, and let's use these as keyphrases
    keyphrases = extractor.get_n_best(n=20)
    
    for key in keyphrases:
        output.append(key[0])
        
    return output
    
    

In [28]:
get_nouns_multipartite(summarized_text)

['australia',
 'island',
 'colony',
 'australian continent',
 'january',
 'commonwealth',
 'tasmania',
 'capital',
 'new south wales',
 'canberra',
 'numerous smaller islands',
 'mainland',
 'largest city',
 'national day',
 'sydney',
 'eastern half',
 'sovereign country',
 'penal transportation',
 'date',
 'great britain']

Now, let's get our keywords that we'll use

In [11]:
keywords = get_nouns_multipartite(text) 
print (keywords)

['australia', 'sovereign country', 'australian continent', 'island', 'colony', 'eastern seaboard', 'commonwealth', 'largest country', 'new holland', 'population', 'mainland', 'continent', 'january', 'tasmania', 'capital', 'canberra', 'brisbane', 'concentrated', 'largest city', 'oceania']


In [12]:
# make sure that we're only getting words that were in our
# BERT summarized text (since BERT only uses a subset for its summarization)
filtered_keys = []
for keyword in keywords:
    if keyword.lower() in summarized_text.lower():
        filtered_keys.append(keyword)
        
print(filtered_keys)

['australia', 'sovereign country', 'australian continent', 'island', 'colony', 'commonwealth', 'population', 'mainland', 'continent', 'january', 'tasmania', 'capital', 'canberra', 'largest city']


### Part IV: Sentence Mapping

Now, for each keyword, let's get the phrases that contain the word

In [13]:
def tokenize_sentence(text):
    """
    
        Tokenizes our sentence
        
        e.g., "How are you today?" --> ["How", "are", "you", "today?"]
    
    """
    
    # separate our text into sentences
    sentences = [sent_tokenize(text)]
    sentences = [y for x in sentences for y in x]
    
    # strip away spaces at beginning and end
    sentences = [sentence.strip() for sentence in sentences]

    return sentences

In [14]:
def get_sentences_for_keyword(keywords, sentences):
    
    """
    
        For each keyword, find the sentence(s) that correspond to that keyword
    
    """
    
    keyword_processor = KeywordProcessor() # use this implementation as fast alternative to keyword matching
    keyword_sentences = {}
    
    # loop through all keywords
    for word in keywords:
        keyword_sentences[word] = []
        keyword_processor.add_keyword(word)
        
    # loop through each sentence and keyword
    for sentence in sentences:
        keywords_found = keyword_processor.extract_keywords(sentence)
        for key in keywords_found:
            keyword_sentences[key].append(sentence)
            
    for key in keyword_sentences.keys():
        values = keyword_sentences[key]
        values = sorted(values, key=len, reverse=True)
        keyword_sentences[key] = values
    
    return keyword_sentences

Now, let's get the sentences corresponding to our keywords

In [15]:
sentences = tokenize_sentence(summarized_text)

In [16]:
keyword_sentence_mapping = get_sentences_for_keyword(filtered_keys, sentences)

In [17]:
print(keyword_sentence_mapping)

{'australia': ["In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day.", "In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day.", 'Australia, officially known as the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.', 'Australia, officially known as the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.', 'On 1 January 1901, the six colonies federated, forming the Commonwealth of Australia.', "Australia's capital is Canberra, and its largest city is Sydney."], '

### Part V: Generate distractors (false MC options)

Now that we have the sentences that correspond to each keyword, let's now create our MC options by adding the false options. 

First, let's use WordNet to get our distractors. Wordnet is a lexical database that connects words to semantic relations (so, for example, words that are synonyms are grouped together, even if they're proper nouns). WordNet is a "generalization" of a thesaurus, where, for example, a table and a chair might be link together because they're part of the same set (here, a kitchen set).

Here's an explanation of how WordNet works: https://wordnet.princeton.edu/

In [18]:
def get_distractors_wordnet(syn, word):
    
    """
    
        Uses WordNet to find words that can be used as distractors for MC questions
    
    """
    
    distractors = []
    
    word = word.lower()
    
    orig_word = word
    
    if len(word.split()) > 0:
        word = word.replace(" ", "_")
        
    # get any hypernyms (words whose meaning includes the meaning of a more specific word)
    # e.g., "animal" is a hypernym of "elephant"
    # and any hyponyms (words that denote a subcategory of a more general class)
    # e.g., "elephant" is a hyponym of "animal"
    
    hypernym = syn.hypernyms()
    if len(hypernym) == 0:
        return distractors
    
    # find potential words that can be used as hypernyms/hyponyms
    for item in hypernym[0].hyponyms():
        name = item.lemmas()[0].name()
        
        if name == orig_word:
            continue
        name = name.replace("_", " ") # los_angeles -> los angeles
        name = " ".join(w.capitalize() for w in name.split()) # los angeles -> Los Angeles
        if name is not None and name not in distractors:
            distractors.append(name)
            
    return distractors
    

In [19]:
def get_wordsense(sent, word):
    
    """
    
        Get a sentence of the meaning of a word, in context, using (1) Lesk algorithm and (2) max similarity
        Useful for word sense disambiguation tasks (e.g., one word means different things, 
        based on context)
    
        Paper: https://thesai.org/Downloads/Volume11No3/Paper_30-Adapted_Lesk_Algorithm.pdf
        
        The goal here is to see if the word has synonyms (or words close in meaning)
        that we could potentially use as answer choices
        
    """
    
    word = word.lower()
    
    if len(word.split()) > 0:
        word = word.replace(" ", "_")
        
    # get set of synonyms
    synsets = wn.synsets(word, 'n')
    
    if synsets:
        
        # get similarity between possible synsets of all words in 
        # context sentence and possible synsets of ambiguous words, 
        # to determine "context" of the word of interest and what it 
        # "should" mean
        wup = max_similarity(sent, word, "wup", pos = 'n')
        
        # use Lesk algorithm, which will assume that words in the same
        # "neighborhood", or area of text, will tend to share the same topic. 
    
        adapted_lesk_output = adapted_lesk(sent, word, pos = "n")
        lowest_index = min(synsets.index(wup), synsets.index(adapted_lesk_output))
        return synsets[lowest_index]
    else:
        print(f"No synonyms found for the word {word}")
        return None
    
    
    

In [20]:
def get_distractors_conceptnet(word):
    
    """
    
        Get distractors using ConceptNet, which connects words and
        phrases in a knowledge graph, and uses distance metrics to 
        calculate similarity
        
        Links:
            http://conceptnet.io/
            https://arxiv.org/pdf/1612.03975.pdf
    
    """
    
    word = word.lower()
    original_word = word
    if len(word.split()) > 0:
        word = word.replace(" ", "_")

    distractor_list = []

    # get url to get ConceptNet graph
    url = f"http://api.conceptnet.io/query?node=/c/en/{word}/n&rel=/r/PartOf&start=/c/en/{word}&limit=5"

    obj = requests.get(url).json()
    
    for edge in obj["edges"]:
        
        link = edge["end"]["term"]
        
        url2 = f"http://api.conceptnet.io/query?node={link}&rel=/r/PartOf&end={link}&limit=10"
        
        obj2 = requests.get(url2).json()
        
        for edge in obj2["edges"]:
            word2 = edge["start"]["label"]
            
            if word2 not in distractor_list and original_word.lower() not in word2.lower():
                distractor_list.append(word2)
                
    return distractor_list
        
        
        

Now, we have our algorithms in place to get our distractors (our options that we want to use in the multiple-choice questions). Let's begin to implement them. 

In [21]:
key_distractor_list = {}

In [22]:
def get_distractors(keyword_sentence_mapping):
    
    """
    
        For each of our keywords (each of which denote a key "topic" 
        of our text, get distractors that we can use as alternative
        options for MC questions)
        
    """
    # get output dict to use
    key_distractor_list = {}
    
    # loop through our keywords and sentences for each keyword
    for keyword in keyword_sentence_mapping:

        # check to see if we're going to have synonyms to use
        wordsense = get_wordsense(keyword_sentence_mapping[keyword][0], keyword)

        # if we have synonyms, use WordNet to get hypernyms/hyponyms
        if wordsense:
            distractors = get_distractors_wordnet(wordsense, keyword)
            # if we can't get any from WordNet, use ConceptNet
            if len(distractors) == 0:
                distractors = get_distractors_conceptnet(keyword)
            if len(distractors) != 0:
                key_distractor_list[keyword] = distractors

        # otherwise, use ConceptNet
        else:
            distractors = get_distractors_conceptnet(keyword)

            if len(distractors) != 0:
                key_distractor_list[keyword] = distractors
                
    return key_distractor_list

Now, let's get our distractors

In [23]:
key_distractor_list = {}

In [24]:
key_distractor_list = get_distractors(keyword_sentence_mapping)

No synonyms found for the word sovereign_country
No synonyms found for the word australian_continent
No synonyms found for the word largest_city


Now that we've successfully obtained the distractors, let's see how the options look:

In [25]:
# initialize our question
question_num = 1

In [26]:
for each in key_distractor_list:
    
    # get the sentence that we want to ask about
    sentence = keyword_sentence_mapping[each][0]
    
    pattern = re.compile(each, re.IGNORECASE)
    
    # add blank for our answer
    output = pattern.sub("___________", sentence)
    
    # print out our question
    question = f"{question_num} {output}"
    print(question)
    
    # populate our choices
    choices = [each.capitalize()] + key_distractor_list[each]
    
    top_4_options = choices[:4]
    random.shuffle(top_4_options)
    option_choices = ['a', 'b', 'c', 'd']
    
    for idx, choice in enumerate(top_4_options):
        print(f"\t{option_choices[idx]}) {choice}")
        
    # let's see what the other options were:
    print(f"\nMore options: {choices[4:]}\n\n")
    
    # update the number for our question
    question_num = question_num + 1
    print("=====================================")
    
    

1 In 1770, ___________'s eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became ___________'s national day.
	a) Canberra
	b) Eyre
	c) Moreton Bay
	d) Australia

More options: ['Simpson Desert', 'Eyre Peninsula', 'Namoi', 'Darling', 'Nullarbor Plain', 'Northern Territory', 'South America', 'Gondwanaland', 'Africa', 'Eurasia', 'Old World']


2 Australia, officially known as the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the ___________ of Tasmania, and numerous smaller ___________s.
	a) Cape
	b) Beachfront
	c) Island
	d) Archipelago

More options: ['Coastal Plain', 'Floor', 'Foreland', 'Forest', 'Isthmus', 'Landmass', 'Mainland', 'Neck', 'Oxbow', 'Peninsula', 'Plain', 'Slash', 'Wonderland']


3 In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the ____

In [27]:
key_distractor_list

{'australia': ['Eyre',
  'Moreton Bay',
  'Canberra',
  'Simpson Desert',
  'Eyre Peninsula',
  'Namoi',
  'Darling',
  'Nullarbor Plain',
  'Northern Territory',
  'South America',
  'Gondwanaland',
  'Africa',
  'Eurasia',
  'Old World'],
 'island': ['Archipelago',
  'Beachfront',
  'Cape',
  'Coastal Plain',
  'Floor',
  'Foreland',
  'Forest',
  'Isthmus',
  'Landmass',
  'Mainland',
  'Neck',
  'Oxbow',
  'Peninsula',
  'Plain',
  'Slash',
  'Wonderland'],
 'colony': ['Administration',
  'Christendom',
  'Church',
  'College',
  'Constituency',
  'Corps',
  'Diaspora',
  'Electoral College',
  'Immigration',
  'Inspectorate',
  'Jury',
  'Leadership',
  'Membership',
  'Militia',
  'Occupational Group',
  'Opposition',
  'Panel',
  'Public',
  'Registration',
  'Representation',
  'Sacred College',
  'School',
  'Staff',
  'Ulema',
  'University',
  'Vote'],
 'commonwealth': ['American State',
  'Australian State',
  'Canadian Province',
  'Eparchy',
  'Italian Region',
  'Soviet 