<hr style="position: relative; border: none; top: 0px; height: 1px;   background: black;"/>

# Keyphrase Extraction Short Version

<hr style="position: relative; border: none; top: 0px; height: 1px;   background: black;"/>

The keyphrase extraction function as presented in the paper.

In [1]:
from nltk.util import ngrams
from collections import Counter
import numpy as np

In [2]:
# Method used to extract n-grams from a block of text.
# tokens is the list of words that need to be grouped into n-grams
# n_grams is the number of words each n-gram will have.
def get_ngrams(tokens: list, n_grams: int) -> list:
    n_grams_list = list(ngrams(tokens, n_grams))
    n_grams_norm = []
    for gram in n_grams_list:
        gram_norm = []
        for word in gram:
            if word not in gram_norm:
                gram_norm.append(word)
        n_grams_norm.append(" ".join(gram_norm))
    return n_grams_norm

# The proposed fucntion to generate automatically
# n-grams from a block of text.
# text is the text to extract n-grams
# top_words return the top x keyphrases from the list.
# start: the minimum words that a keyphrase can have.
# end: the maximum words that a keyphrase can have.
def get_keyphrase_list(text: str, top: int,
                     start: int, end: int):
    # Some clean up and split in tokens
    text = text.lower().replace('\n', ' ')
    tokens = text.split(' ')
    tokens = [x for x in tokens if x != '']
    # Variable containing n-grams
    com = {}
    for n_grams_count in range(end, start - 1, -1):
        # Get n-grams for text
        current = get_ngrams(tokens, n_grams_count)
        # Get the occurrence of each n-gram
        # Using collections.Counter
        cnt = Counter(current)
        # Determine the mean occurrence value
        # and get the words having more occurrence 
        # than the mean
        mean = np.mean([x for x in cnt.values()])
        mean = np.ceil(mean)
        common =    cnt.most_common()
        # Get words having more occurrence than mean
        cnt = [[x, b] for x, b in common if b > mean]
        # Check if the n-gram is already defined
        for word, count in cnt:
            # Get similar n-gram entries
            sims = [x for x in com.keys()
                 if len(set(word.split(' '))
                 .intersection(set(x.split(' ')))) > 0]
            if len(sims) == 0:
                com[word] = count
            else:
                for sim in sims:
                    com[sim] = com[sim] + count
    # Return only top words
    return dict(Counter(com).most_common(top))
    

In [3]:
text = """
Terry Pratchett's profoundly irreverent, bestselling novels have garnered him
a revered position in the halls of parody next to the likes of Mark Twain, Kurt Vonnegut, 
Douglas Adams, and Carl Hiaasen. The Color of Magic is Terry Pratchett's maiden 
voyage through the now-legendary land of Discworld. This is where it all begins -- with 
the tourist Twoflower and his wizard guide, Rincewind. On a world supported on the back of a 
giant turtle (sex unknown), a gleeful, explosive, wickedly eccentric expedition sets out. 
There's an avaricious but inept wizard, a naive tourist whose luggage moves on hundreds of dear 
little legs, dragons who only exist if you believe in them, and of course THE EDGE of the planet..
Born Terence David John Pratchett, Sir Terry Pratchett sol0d his first story when he was thirteen, 
which earned him enough money to buy a second-hand typewriter. His first novel, a humorous fantasy 
entitled The Carpet People, appeared in 1971 from the publisher Colin Smythe.
Terry worked for many years as a journalist and press officer, writing in his spare time and publishing a
number of novels, including his first Discworld novel, The Color of Magic, in 1983. In 1987, he turned to
writing full time.There are over 40 books in the Discworld series, of which four are written for children.
The first of these, The Amazing Maurice and His Educated Rodents, won the Carnegie Medal.
A non-Discworld book, Good Omens, his 1990 collaboration with Neil Gaiman, has been a longtime bestseller 
and was reissued in hardcover by William Morrow in early 2006 (it is also available as a mass market 
paperback - Harper Torch, 2006 - and trade paperback - Harper Paperbacks, 2006).
In 2008, Harper Children's published Terry's standalone non-Discworld YA novel, Nation. Terry
published Snuff in October 2011. Regarded as one of the most significant contemporary English-language 
satirists, Pratchett has won numerous literary awards, was named an Officer of the British Empire (OBE) 
“for services to literature” in 1998, and has received honorary doctorates from the University of Warwick 
in 1999, the University of Portsmouth in 2001, the University of Bath in 2003, the University of Bristol in 
2004, Buckinghamshire New University in 2008, the University of Dublin in 2008, Bradford University in 2009, 
the University of Winchester in 2009, and The Open University in 2013 for his contribution to Public Service.
In Dec. of 2007, Pratchett disclosed that he had been diagnosed with Alzheimer's disease. On 18 Feb, 2009, 
he was knighted by Queen Elizabeth II.He was awarded the World Fantasy Life Achievement Award in 2010.0
Sir Terry Pratchett passed away on 12th March 2015. (
""".lower()

In [4]:
get_keyphrase_list(text, 10, 1, 3)

{'the university of': 79,
 'in 2008,': 32,
 'his first': 15,
 'a': 12,
 'and': 10,
 'terry': 6,
 'to': 5,
 'on': 5,
 'was': 5,
 'pratchett': 4}