# Tip of the week: "The KEYNG (read *king*) is dead, long live the KEYNG!"


For a long time the go-to library for graph-based keyword extraction has been Gensim. However, [version 4.0.0](https://github.com/RaRe-Technologies/gensim/releases)‚Äîwhich was *just* released and provides amazing performance improvements‚Äî[removed the entire summarisation module](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#12-removed-gensimsummarization), which includes the keyword extraction functionality.

This motivates an update on **keyword extraction** from a previous Tip of the week on extractive summarization which included a section about keyword extraction with Gensim.

This notebook gives a brief overview of **pke** an **open-source keyphrase extraction toolkit**, that is easy to use, provides a wide range of keyword extraction methods which makes it easy to benchmark different approaches in order to choose the right algorithm for the problem at hand.

In contrast to Gensim, which only provided keyword extraction using the widely known TextRank algorithm, `pke` offers [statistical models](https://boudinfl.github.io/pke/build/html/unsupervised.html#statistical-models), a [variety of TextRank flavours](https://boudinfl.github.io/pke/build/html/unsupervised.html#graph-based-models), as well as [simple supervised methods](https://boudinfl.github.io/pke/build/html/supervised.html) which even come pre-trained.

In `pke` **preprocessing is built-in** using **spaCy**. This means the times of tormenting non-English languages with numerous preprocessing steps in order to make it look like English are finally over. *Bon vent !*

Follow the code in this notebook to see how to use `pke` to extract keywords and a comparison of the extracted keywords.


## üèó Getting started: Install packages & download models

The below cells will set up everything that is required to get started with keyword extraction:

* Install packages
* Download additional resources

In [None]:
# Install pke
!pip install --quiet git+https://github.com/boudinfl/pke.git

# Download additional resources
!python -m nltk.downloader stopwords
!python -m nltk.downloader universal_tagset
!python -m spacy download en # Download English model

## üß∞ Keyword Extraction using pke

`pke` provides implementations of the following keyword extraction algorithms:

* Statistical models:
    * TF‚ÄìIDF
    * KPMiner
    * YAKE
* Graph-based models:
    * TextRank
    * SingleRank
    * TopicRank
    * TopicalPageRank
    * PositionRank
    * MultipartiteRank
* Supervised models
    * Kea
    * WINGNUS

The code below wraps several of these extraction methods into convenience functions that use the default parameters and only require an (English) text from which keywords will be extracted.

In [None]:
import string
from itertools import zip_longest

import pke
import pandas as pd
from nltk.corpus import stopwords

# Convenience functions for pke keyword extraction

## Supervised models
def extract_kea_keywords(text, top_n=10, language='en', normalization=None, 
                         only_keywords=True):
    stoplist = stopwords.words('english')
    extractor = pke.supervised.Kea()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(stoplist=stoplist)
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

## Statistical models
def extract_tfidf_keywords(text, top_n=10, language='en', normalization=None, 
                           n_grams=3, only_keywords=True):
    stoplist = list(string.punctuation)
    stoplist += stopwords.words('english')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(n=n_grams, stoplist=stoplist)
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_kp_miner_keywords(text, top_n=10, language='en', normalization=None, 
                              lasf=2, cutoff=200, alpha=2.3, sigma=3.0, 
                              only_keywords=True):
    extractor = pke.unsupervised.KPMiner()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(lasf=lasf, cutoff=cutoff)
    extractor.candidate_weighting(alpha=alpha, sigma=sigma)
    keyphrases = extractor.get_n_best(top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_yake_keywords(text, top_n=10, normalization=None, window=2, 
                          threshold=0.8, language='en', n=3, use_stems=False, 
                          only_keywords=True):
    stoplist = stopwords.words('english')
    extractor = pke.unsupervised.YAKE()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(n=n, stoplist=stoplist)
    extractor.candidate_weighting(window=window, stoplist=stoplist, use_stems=use_stems)
    keyphrases = extractor.get_n_best(n=top_n, threshold=threshold)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases


## Graph based algorithms
def extract_textrank_keywords(text, top_n=10, language='en', normalization=None, 
                              window=2, top_percent=0.33, only_keywords=True):
    pos = {'NOUN', 'PROPN', 'ADJ'}
    extractor = pke.unsupervised.TextRank()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_weighting(window=window, pos=pos, top_percent=top_percent)
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_singlerank_keywords(text, top_n=10, language='en', normalization=None,
                                window=10,only_keywords=True):
    pos = {'NOUN', 'PROPN', 'ADJ'}
    extractor = pke.unsupervised.SingleRank()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(pos=pos)
    extractor.candidate_weighting(window=window, pos=pos)
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_topicrank_keywords(text, top_n=10, language='en', only_keywords=True):
    extractor = pke.unsupervised.TopicRank()
    extractor.load_document(input=text, language=language)
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_multipartiterank_keywords(text, top_n=10, language='en', alpha=1.1, 
                                      threshold=0.74, method='average', 
                                      only_keywords=True):
    stoplist = list(string.punctuation)
    stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
    stoplist += stopwords.words('english')
    pos = {'NOUN', 'PROPN', 'ADJ'}
    extractor = pke.unsupervised.MultipartiteRank()
    extractor.load_document(input=text, language=language)
    extractor.candidate_selection(pos=pos, stoplist=stoplist)
    extractor.candidate_weighting(alpha=alpha, threshold=threshold, method=method)
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

The next cell:

* **collects the above functions** for keyword extraction together with a set of keyword arguments for easy access in a dictionary,
* sets a **default subset of extraction functions** to compare, and 
* defines a **convenience function** that simplifies the **comparison** of the different extraction methods.

In [None]:
# Define extraction functions, labels, and set parameters
top_n = 10

KEYWORD_EXTRACTION_FUNCTIONS = {
    # Statistical models
    'TFIDF': (
        extract_tfidf_keywords, 
        {'top_n': top_n},
    ),
    'KPMiner': (
        extract_kp_miner_keywords, 
        {'top_n': top_n},
    ),
    'YAKE': (
        extract_yake_keywords, 
        {'top_n': top_n},
    ),
    # Graph-based models
    'TextRank': (
        extract_textrank_keywords, 
        {'top_n': top_n ,'window': 2},
    ),
    'SingleRank': (
        extract_singlerank_keywords, 
        {'top_n': top_n, 'window': 10},
    ),
    'TopicRank': (
        extract_topicrank_keywords, 
        {'top_n': top_n},
    ),
    'MultipartiteRank': (
        extract_multipartiterank_keywords, 
        {'top_n': top_n},
    ),
    # Supervised
    'KEA': (
        extract_kea_keywords, 
        {'top_n': top_n},
    ),
}

DEFAULT_SELECTION = ['TFIDF', 'YAKE', 'TextRank', 'TopicRank', 'KEA']

def compare_keyword_extraction_algorithms(text, 
                                          keyword_extraction_functions=None,
                                          selection=None):
    """Convenience function compare extracted keywords from the given text.

    Args:
        text (str): Text to extract keywords from.
        keyword_extraction_functions (dict): Dict containing labels as keys and
            a tuple of (extraction_function, kwargs) as values. Defaults to None.
        selection (list): List of names of algorithm to use for keyword 
            extraction. See keyword_extraction_functions for possible values
            and/or to change arguments. Defaults to None.
    """
    if keyword_extraction_functions is None:
        keyword_extraction_functions = KEYWORD_EXTRACTION_FUNCTIONS
    if selection is None:
        selection = DEFAULT_SELECTION
    
    # Create DataFrame with extracted keywords
    all_keywords = pd.DataFrame(
        zip_longest(
            *(extraction_fn(text, **kwargs)
                for name, (extraction_fn, kwargs) in keyword_extraction_functions.items()
                if name in selection
            ),
            fillvalue="",
        ),
        columns=selection,
    )
    
    # Display table
    display(all_keywords)

With the keyword extractions functions implemented let's define a **few short example texts** which will be used below for keyword extraction.

In [None]:
texts = [
    # Dartmouth Workshop
    # https://en.wikipedia.org/wiki/Dartmouth_workshop
    (
        "The Dartmouth Summer Research Project on Artificial Intelligence was "
        "a 1956 summer workshop widely considered to be the founding event of "
        "artificial intelligence as a field. The project lasted approximately "
        "six to eight weeks and was essentially an extended brainstorming "
        "session. Eleven mathematicians and scientists originally planned to "
        "attend; not all of them attended, but more than ten others came for "
        "short times."
    ),
    # Abstract TextRank Paper
    (
        "In this paper, we introduce TextRank ‚Äì a graph-based ranking model " 
        "for text processing, and show how this model can be successfully "
        "used in natural language applications. In particular, we propose "
        "two innovative unsupervised methods for keyword and sentence "
        "extraction, and show that the results obtained compare favorably "
        "with previously published results on established benchmarks."
     ),
    # News
    # https://www.nytimes.com/live/2021/02/09/us/trump-impeachment-trial
    (
        "The House managers prosecuting former President Donald J. Trump "
        "opened his Senate impeachment trial on Tuesday with a vivid and "
        "graphic sequence of footage of his supporters storming the Capitol "
        "last month in an effort to prevent Congress from finalizing his "
        "election defeat.\n"
        "The managers wasted no time moving immediately to their most powerful "
        "evidence: the explicit visual record of the deadly Capitol siege "
        "that threatened the lives of former Vice President Mike Pence and "
        "members of both houses of Congress juxtaposed against Mr. Trump‚Äôs "
        "own words encouraging members of the mob at a rally beforehand.\n"
        "The scenes of mayhem and violence ‚Äî punctuated by expletives rarely "
        "heard on the floor of the Senate ‚Äî highlighted the drama of the "
        "trial in gut-punching fashion for the senators who lived through "
        "the events barely a month ago and now sit as quasi-jurors. On the "
        "screens, they saw enraged extremists storming barricades, beating "
        "police officers, setting up a gallows and yelling, ‚ÄúTake the "
        "building,‚Äù ‚ÄúFight for Trump‚Äù and ‚ÄúPence is a traitor! Traitor Pence!‚Äù"
    ),
    # Recipe
    # https://www.nytimes.com/2021/02/08/dining/birria-recipes.html
    (
        "You go to Birrieria Nochistl√°n for the Moreno family‚Äôs "
        "Zacatecan-style birria ‚Äî a big bowl of hot goat meat submerged "
        "in a dark pool of its own concentrated cooking juices.\n"
        "Right out of the pot, the steamed meat isn‚Äôt just tender, but "
        "in places deliciously sticky, smudged with chile adobo, falling "
        "apart, barely even connected to the bone. It comes with thick, "
        "soft tortillas, made to order, and a vibrant salsa roja. "
        "The Moreno family has been serving birria exactly like this for "
        "about 20 years.\n"
        "‚ÄúSometimes I think we should update our menu,‚Äù said Rosio Moreno, "
        "23, whose parents started the business out of their home in East "
        "Los Angeles. ‚ÄúBut we don‚Äôt want to change the way we do things "
        "because of the hype.‚Äù"
    ),
]

In [None]:
# Compare the keywords extracted by the given algorithms
selected_algorithms = ['TFIDF', 'KPMiner', 'YAKE', 'TextRank', 'TopicRank', 'KEA']

for text in texts:
    compare_keyword_extraction_algorithms(text, selection=selected_algorithms)

## üßë‚Äçüî¨ Try it yourself!

**Task**: 

1. Insert your own text that you would like to extract keywords from
2. Select the desired keyword extraction methods
3. Extract keywords

In [None]:
# Task 1: Add your own input text from which you want to extract keywords
text = "Replace this string rambling on about keyword extraction and how great it is with your own text"

# Task 2: Select the desired keyword extraction methods you want to compare
selected_algorithms = ['TFIDF', 'YAKE', 'TextRank', 'TopicRank', 'KEA']

# Task 3: Execute this cell to compare the extracted keywords
compare_keyword_extraction_algorithms(text, selection=selected_algorithms)

## Summary

When starting a new project that can benefit from keyword extraction **we recommend** to try **`pke`** first. It is easy to use, offers a good selection of keyword extraction methods (*batteries included*), and if nothing else provides strong baselines for more advanced methods. 

TextRank is a good starting point which only requires part-of-speech tagging. If this information is unavailable, YAKE is an interesting alternative with the fewest dependencies.
Lastly, even though all models come pre-trained TF‚ÄìIDF and supervised models can yield much improved results if a training corpus or large collection of similar documents is at hand.

As this notebook shows, the hole on the NLP practitioner's tool belt left by the removal of Gensim's keyword extraction functionality will easily be filled by an entire toolkit: `pke`.

The times of only having a hammer to solve keyword extraction are over, such that non-English languages don't need to look like nails any longer!


## Resources

### üìö Libraries & Packages

* [`pke` python keyphrase extraction](https://github.com/boudinfl/pke): Neat library implementing amongst others TF-IDF, YAKE, KPMiner, TextRank, SingleRank, TopicRank, TopologicalPageRank, PositionRank, MultipartiteRank, KEA, and WINGNUS. Uses GPLv3 licence.[[documentation](https://boudinfl.github.io/pke/)]
* [YAKE](https://github.com/LIAAD/yake): An alternative implementation from the authors of the YAKE paper.
* [PyTextRank](https://github.com/DerwenAI/pytextrank): An alternative Python implementation of *TextRank* as a *spaCy pipeline extension*.
* [Gensim 3.8](https://radimrehurek.com/gensim_3.8.3/summarization/keywords.html): Most widely used package for keyword extraction. The upcoming **version 4.0 [removes summarization](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#12-removed-gensimsummarization)** module (which includes keyword extraction), because of bad performance. (Is this a good or a bad thing? üôÉ)


### üìÑ Overview Papers

* *Keyword extraction: a review of methods and approaches* by Slobodan Beliga (2014)
 [[paper](http://langnet.uniri.hr/papers/beliga/Beliga_KeywordExtraction_a_review_of_methods_and_approaches.pdf)]
* *A Review of Keyphrase Extraction* by Eirini Papagiannopoulou and Grigorios Tsoumakas (2019) [[paper](https://arxiv.org/pdf/1905.05044)]