# Neural Keyword Extraction

In a [pervious quick tip](https://github.com/ml6team/quick-tips/tree/main/nlp/2021_03_18_pke_keyword_extraction) we looked at [`pke`](https://boudinfl.github.io/pke/build/html/index.html) as a replacement for Gensim's [recently removed keyword extraction module](https://github.com/RaRe-Technologies/gensim/releases/tag/4.0.0).
`pke` comes with batteries included: It has preprocessing build in, supports non-English languages, and provides a wide range of keyword extraction methods: statistical, graph-based, and supervised.

This makes `pke` a great choice to get started with keyword extraction, experiment with different methods and generate baselines to improve upon.

But what if the required performance is not met by these *classical* methods?

Newly-developed auspicious extraction methods fall into the category of **neural keyword extraction**.

The methods often utilise **sequence-to-sequence models** based on recurrent neural networks (RNNs) or Long Short-Term Memory (LSTM). Their objective is to transform a sequence of input words (the given document) into an abstract intermediate representation and generate a sequence of keywords from it.

These models do not use words or phrases directly, which enables them to generate unseen keywords as well. This is called **keyword generation** and combines abstractive as well as extractive keywords.

They report **impressive performance** increases over the classical extraction methods.

However, training such models
**requires a large collection of documents** and annotated keywords, since the final training step is usually supervised.

In addition, many of the models' repositories are not well maintained which makes it **difficult to train** them, especially on different languages or domains. 

This leaves the quesition of how neural keyword extraction can already be used today.

Follow the below sections in this notebook to learn how to use **two approaches in the neural keyword extraction category** and how they compare to classical extraction methods. 👇

## 🏗 Getting started: Install packages & download models

The below cells will setup everything that is required to get started with keyword extraction:

* Install packages
* Download additional resources

In [23]:
!pip install git+https://github.com/boudinfl/pke.git
!pip install transformers
!pip install keybert

# Download additional resources
!python -m nltk.downloader stopwords
!python -m nltk.downloader universal_tagset
!python -m spacy download en # Download English model

Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /tmp/pip-req-build-rlluee_4
  Running command git clone -q https://github.com/boudinfl/pke.git /tmp/pip-req-build-rlluee_4
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/9e/25/723487ca2a52ebcee88a34d7d1f5a4b80b793f179ee0f62d5371938dfa01/Unidecode-1.2.0-py2.py3-none-any.whl (241kB)
[K     |████████████████████████████████| 245kB 6.3MB/s 
Building wheels for collected packages: pke
  Building wheel for pke (setup.py) ... [?25l[?25hdone
  Created wheel for pke: filename=pke-1.8.1-cp37-none-any.whl size=8763774 sha256=a2f5a1f79624196480a22f4466f5a24e96d990b19dc1a7f10a075174c570ef31
  Stored in directory: /tmp/pip-ephem-wheel-cache-io7dchf3/wheels/8d/24/54/6582e854e9e32dd6c632af6762b3a5d2f6b181c2992e165462
Successfully built pke
Installing collected packages: unidecode, pke
Successfully installed pke-1.8.1 unidecode-1.2.0
[nltk_data] Downloading package st

## KeyBERT

Strictly speaking [KeyBERT](https://github.com/MaartenGr/KeyBERT) is not an end-to-end neural keyword extraction model. 

Nontheless the **underlying idea** is as clever as it is simple: It compares embeddings of words with embeddings of texts and selects the set of keywords which are most similar to the entire text.
Both **word embeddings** as well was **text embeddings** are generated using **state-of-the-art neural models**, which have lead to tremendous performance improvements in other tasks.

The benefit of this approach is that the used models are trained in an unsupervised manner and thus **do not require an annotated dataset** if keywords. In addition, they are available in a number of **non-English languages** as well.

Let's see below how KeyBert can be used in practice.

In [2]:
from keybert import KeyBERT

In [62]:
kw_model = KeyBERT()

def extract_keybert_keywords(text, keyphrase_ngram_range=(1,1), stop_words='english',
                             use_maxsum=True, nr_candidates=20, top_n=5,
                             use_mmr=False, diversity=0.7,
                             only_keywords=True):
    keyphrases = kw_model.extract_keywords(text, keyphrase_ngram_range=keyphrase_ngram_range, 
                                           stop_words=stop_words, use_maxsum=use_maxsum, 
                                           use_mmr=use_mmr, diversity=diversity, 
                                           nr_candidates=nr_candidates, top_n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

## BART-based model

🤗 Hugging Face Model Hub is *the* prime address when it comes to discovering state-of-the-art models that are easy to use. However, with respect to neural keyword (or keyphrase) extraction there are only two models available as of this writing.

Both models were created by [Ankur Singh](https://huggingface.co/ankur310794) and use a **BART-based sequence-to-sequence architecture**. Unfortunately is not a lot of details available about the training specific 

[One model](https://huggingface.co/ankur310794/bart-base-keyphrase-generation-kpTimes) was trained using the [KPTimes dataset](https://aclanthology.org/W19-8617/), a large dataset consisting of **English news articles** and hand-annotated keywords.

[The other](https://huggingface.co/ankur310794/bart-base-keyphrase-generation-openkp) was trained using the [OpenKP dataset](https://github.com/microsoft/OpenKP), which contains a large number of **English web documents** and up to three most relevant keywords. From this This restriction holds true for the model as well: it will return at most three keywords.

In [61]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
huggingface_models = {
    # Trained on OpenKP: Returns up to 3 keywords
    'openkp': "ankur310794/bart-base-keyphrase-generation-openkp",
    # Trained on KPTimes
    'kptimes': "ankur310794/bart-base-keyphrase-generation-kpTimes",
}

def load_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    return tokenizer, model

tokenizer, model = load_model(huggingface_models['kptimes'])

def extract_keywords_using_bart(text):
    encoded_text = tokenizer.prepare_seq2seq_batch(
        [text],
        return_tensors="pt",
    )
    encoded_keywords = model.generate(**encoded_text)
    raw_keywords = tokenizer.batch_decode(
        encoded_keywords, 
        skip_special_tokens=True,
    )
    keywords = [keyword.strip() for keyword_string in raw_keywords
                                for keyword in keyword_string.split(';')]
    return keywords

##⚔️ Classical vs. Neural Keyword Extraction

### Classical extraction methods

The code below wraps several extraction methods from pke into convenience functions that use the default parameters and only require an (English) text from which keywords will be extracted.

In [39]:
import string
from itertools import zip_longest

import pke
import pandas as pd
from nltk.corpus import stopwords

# Convenience functions for pke keyword extraction

## Statistical models
def extract_tfidf_keywords(text, top_n=10, language='en', normalization=None, 
                           n_grams=3, only_keywords=True):
    stoplist = list(string.punctuation)
    stoplist += stopwords.words('english')
    extractor = pke.unsupervised.TfIdf()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(n=n_grams, stoplist=stoplist)
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_yake_keywords(text, top_n=10, normalization=None, window=2, 
                          threshold=0.8, language='en', n=3, use_stems=False, 
                          only_keywords=True):
    stoplist = stopwords.words('english')
    extractor = pke.unsupervised.YAKE()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_selection(n=n, stoplist=stoplist)
    extractor.candidate_weighting(window=window, stoplist=stoplist, use_stems=use_stems)
    keyphrases = extractor.get_n_best(n=top_n, threshold=threshold)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

## Graph-based algorithms
def extract_textrank_keywords(text, top_n=10, language='en', normalization=None, 
                              window=2, top_percent=0.33, only_keywords=True):
    pos = {'NOUN', 'PROPN', 'ADJ'}
    extractor = pke.unsupervised.TextRank()
    extractor.load_document(input=text, language=language, normalization=normalization)
    extractor.candidate_weighting(window=window, pos=pos, top_percent=top_percent)
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

def extract_topicrank_keywords(text, top_n=10, language='en', only_keywords=True):
    extractor = pke.unsupervised.TopicRank()
    extractor.load_document(input=text, language=language)
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=top_n)
    if only_keywords:
        keyphrases = [phrase for phrase, score in keyphrases]
    return keyphrases

The next cell:

* **collects the above functions** for keyword extraction together with a set of keyword arguments for easy access in a dictionary,
* sets a **default subset of extraction functions** to compare, and 
* defines a **convenience function** that simplifies the **comparison** of the different extraction methods.

In [63]:
# Define extraction functions, labels, and set parameters
top_n = 10

KEYWORD_EXTRACTION_FUNCTIONS = {
    # Neural Keyword Extraction 
    'KeyBERT': (
        extract_keybert_keywords, 
        {
            'keyphrase_ngram_range': (1,2),
            'stop_words': 'english',
            'use_maxsum': True, 
            'nr_candidates': 20, 
            'top_n': 10, 
            'use_mmr': False, 
            'diversity': 0.7,
        },
    ),
    'BART-based': (
        extract_keywords_using_bart, 
        {},
    ),

    # Statistical models
    'TFIDF': (
        extract_tfidf_keywords, 
        {'top_n': top_n},
    ),
    'YAKE': (
        extract_yake_keywords, 
        {'top_n': top_n},
    ),
    # Graph-based models
    'TextRank': (
        extract_textrank_keywords, 
        {'top_n': top_n ,'window': 2},
    ),
    'TopicRank': (
        extract_topicrank_keywords, 
        {'top_n': top_n},
    ),
}

DEFAULT_SELECTION = ['KeyBERT', 'BART-based', 'TFIDF', 'YAKE', 'TextRank', 'TopicRank']

def compare_keyword_extraction_algorithms(text, 
                                          keyword_extraction_functions=None,
                                          selection=None):
    """Convenience function compare extracted keywords from the given text.

    Args:
        text (str): Text to extract keywords from.
        keyword_extraction_functions (dict): Dict contaning labels as keys and
            a tuple of (extraction_function, kwargs) as values. Defaults to None.
        selection (list): List of names of algorithm to use for keyword 
            extraction. See keyword_extraction_functions for possible values
            and/or to change arguments. Defaults to None.
    """
    if keyword_extraction_functions is None:
        keyword_extraction_functions = KEYWORD_EXTRACTION_FUNCTIONS
    if selection is None:
        selection = DEFAULT_SELECTION
    
    # Create DataFrame with extracted keywords
    all_keywords = pd.DataFrame(
        zip_longest(
            *(extraction_fn(text, **kwargs)
                for name, (extraction_fn, kwargs) in keyword_extraction_functions.items()
                if name in selection
            ),
            fillvalue="",
        ),
        columns=selection,
    )
    
    # Display table
    display(all_keywords)

### Extracted Keywords

With the keyword extractions functions implemented let's define a **few short example texts** which will be used below for keyword extraction.

In [64]:
texts = [
    # Dartmouth Workshop
    # https://en.wikipedia.org/wiki/Dartmouth_workshop
    (
        "The Dartmouth Summer Research Project on Artificial Intelligence was "
        "a 1956 summer workshop widely considered to be the founding event of "
        "artificial intelligence as a field. The project lasted approximately "
        "six to eight weeks and was essentially an extended brainstorming "
        "session. Eleven mathematicians and scientists originally planned to "
        "attend; not all of them attended, but more than ten others came for "
        "short times."
    ),
    # Abstract TextRank Paper
    (
        "In this paper, we introduce TextRank – a graph-based ranking model " 
        "for text processing, and show how this model can be successfully "
        "used in natural language applications. In particular, we propose "
        "two innovative unsupervised methods for keyword and sentence "
        "extraction, and show that the results obtained compare favorably "
        "with previously published results on established benchmarks."
     ),
    # News
    # https://www.nytimes.com/live/2021/02/09/us/trump-impeachment-trial
    (
        "The House managers prosecuting former President Donald J. Trump "
        "opened his Senate impeachment trial on Tuesday with a vivid and "
        "graphic sequence of footage of his supporters storming the Capitol "
        "last month in an effort to prevent Congress from finalizing his "
        "election defeat.\n"
        "The managers wasted no time moving immediately to their most powerful "
        "evidence: the explicit visual record of the deadly Capitol siege "
        "that threatened the lives of former Vice President Mike Pence and "
        "members of both houses of Congress juxtaposed against Mr. Trump’s "
        "own words encouraging members of the mob at a rally beforehand.\n"
        "The scenes of mayhem and violence — punctuated by expletives rarely "
        "heard on the floor of the Senate — highlighted the drama of the "
        "trial in gut-punching fashion for the senators who lived through "
        "the events barely a month ago and now sit as quasi-jurors. On the "
        "screens, they saw enraged extremists storming barricades, beating "
        "police officers, setting up a gallows and yelling, “Take the "
        "building,” “Fight for Trump” and “Pence is a traitor! Traitor Pence!”"
    ),
    # Recipe
    # https://www.nytimes.com/2021/02/08/dining/birria-recipes.html
    (
        "You go to Birrieria Nochistlán for the Moreno family’s "
        "Zacatecan-style birria — a big bowl of hot goat meat submerged "
        "in a dark pool of its own concentrated cooking juices.\n"
        "Right out of the pot, the steamed meat isn’t just tender, but "
        "in places deliciously sticky, smudged with chile adobo, falling "
        "apart, barely even connected to the bone. It comes with thick, "
        "soft tortillas, made to order, and a vibrant salsa roja. "
        "The Moreno family has been serving birria exactly like this for "
        "about 20 years.\n"
        "“Sometimes I think we should update our menu,” said Rosio Moreno, "
        "23, whose parents started the business out of their home in East "
        "Los Angeles. “But we don’t want to change the way we do things "
        "because of the hype.”"
    ),
    # The text within this notebook:
    (
        "In a pervious quick tip we looked at pke as a replacement for Gensim's "
        "recently removed keyword extraction module. pke comes with batteries "
        "included: It has preprocessing build in, supports non-English languages, "
        "and provides a wide range of keyword extraction methods: statistical, "
        "graph-based, and supervised.\n" 
        "This makes pke a great choice to get started with keyword extraction, "
        "experiment with different methods and generate baselines to improve upon.\n"
        "But what if the required performance is not met by these classical methods?\n"
        "Newly-developed auspicious extraction methods fall into the category of "
        "neural keyword extraction.\n"
        "The methods often utilise sequence-to-sequence models based on recurrent "
        "neural networks (RNNs) or Long Short-Term Memory (LSTM). Their objective "
        "is to transform a sequence of input words (the given document) into an "
        "abstract intermediate representation and generate a sequence of keywords "
        "from it.\n"
        "These models do not use words or phrases directly, which enables them to "
        "generate unseen keywords as well. This is called keyword generation and "
        "combines abstractive as well as extractive keywords.\n"
        "They report impressive performance increases over the classical "
        "extraction methods.\n"
        "However, training such models requires a large collection of documents "
        "and annotated keywords, since the final training step is usually "
        "supervised.\n"
        "In addition, many of the models' repositories are not well maintained "
        "which makes it difficult to train them, especially on different "
        "languages or domains.\n"
        "This leaves the quesition of how neural keyword extraction can already "
        "be used today.\n"
        "Follow the below sections in this notebook to learn how to use two "
        "approaches in the neural keyword extraction category and how they "
        "compare to classical extraction methods."
    ),
]

In [66]:
# Compare the keywords extracted by the given algorithms
selected_algorithms = ['KeyBERT', 'BART-based', 'TFIDF','YAKE', 'TextRank', 'TopicRank']

for text in texts:
    compare_keyword_extraction_algorithms(text, selection=selected_algorithms)



Unnamed: 0,KeyBERT,BART-based,TFIDF,YAKE,TextRank,TopicRank
0,field project,Dartmouth University,artificial,dartmouth summer research,summer research,artificial intelligence
1,artificial,Artificial intelligence,artificial intelligence,summer research project,artificial intelligence,mathematicians
2,originally planned,,intelligence,1956 summer workshop,summer,field
3,intelligence,,summer,summer workshop widely,brainstorming,event
4,summer research,,dartmouth summer,artificial intelligence,short,extended brainstorming session
5,scientists originally,,dartmouth summer research,workshop widely considered,,project
6,dartmouth,,summer research,dartmouth summer,,scientists
7,brainstorming session,,summer research project,summer research,,weeks
8,project lasted,,1956,research project,,dartmouth summer research project
9,dartmouth summer,,1956 summer,1956 summer,,summer workshop




Unnamed: 0,KeyBERT,BART-based,TFIDF,YAKE,TextRank,TopicRank
0,unsupervised methods,Text,results,natural language applications,sentence extraction,model
1,processing model,Computers and the Internet,introduce,based ranking model,text processing,results
2,ranking,,introduce textrank,introduce textrank,unsupervised,keyword
3,graph based,,textrank,based ranking,language,innovative unsupervised methods
4,language applications,,based,text processing,,sentence extraction
5,text,,based ranking,language applications,,text processing
6,paper introduce,,based ranking model,successfully used,,graph
7,sentence extraction,,ranking,natural language,,natural language applications
8,extraction results,,ranking model,ranking model,,particular
9,based ranking,,text processing,textrank,,textrank




Unnamed: 0,KeyBERT,BART-based,TFIDF,YAKE,TextRank,TopicRank
0,trump,Donald Trump,pence,former president donald,capitol last,senate impeachment trial
1,mr trump,US Politics,trump,house managers prosecuting,j. trump,members
2,house managers,Impeachment,managers,prosecuting former president,enraged extremists,congress
3,fight trump,House of Representatives,president,former vice president,own words,capitol last month
4,mike pence,Senate,senate,capitol last month,powerful evidence,house managers
5,prosecuting,Congress,storming,vice president mike,election defeat,pence
6,senate highlighted,,capitol,president mike pence,graphic sequence,former vice president mike pence
7,opened senate,,members,managers prosecuting former,house managers,mayhem
8,managers prosecuting,,congress,senate impeachment trial,capitol,graphic sequence
9,impeachment trial,,traitor,president donald,president,trial




Unnamed: 0,KeyBERT,BART-based,TFIDF,YAKE,TextRank,TopicRank
0,hot goat,Restaurant,moreno,concentrated cooking juices,concentrated cooking,moreno family
1,deliciously sticky,Los Angeles,moreno family,birrieria nochistlán,goat meat,style birria
2,goat meat,,family,hot goat meat,big bowl,zacatecan
3,places deliciously,,birria,goat meat submerged,style birria,places
4,nochistlán moreno,,meat,cooking juices,birrieria nochistlán,sticky
5,meat isn,,birrieria,big bowl,los,big bowl
6,rosio moreno,,birrieria nochistlán,hot goat,salsa,thick
7,vibrant salsa,,nochistlán,dark pool,moreno,hot goat meat
8,birrieria,,zacatecan,concentrated cooking,meat,home
9,birria exactly,,style birria,goat meat,birria,soft tortillas




Unnamed: 0,KeyBERT,BART-based,TFIDF,YAKE,TextRank,TopicRank
0,methods unable,Search Engines,pke,pervious quick tip,extraction methods,keyword extraction module
1,keyword extraction,Gensim,keyword extraction,recently removed keyword,- english,pke
2,languages provdes,,extraction,keyword extraction module,wide range,different methods
3,pervious quick,,methods,recently removed,extraction,experiment
4,improve methods,,keyword,removed keyword extraction,quick,statistical
5,replacement gensim,,pervious,pervious quick,-,replacement
6,methods generate,,pervious quick,quick tip,methods,gensim
7,extraction module,,pervious quick tip,keyword extraction,great,graph
8,pke comes,,quick tip,extraction module,,great choice
9,pke replacement,,tip,gensim,,wide range


## 🧑‍🔬 Try it yourself!

**Task**: 

1. Insert your own text that you would like to extract keywords from
2. Select the desired keyword extraction methods
3. Extract keywords

In [None]:
# Task 1: Add your own input text to e
text = "Replace this string rambling on about keyword extraction and how great it is with your own text"

# Task 2: Select the desired keyword extraction methods you want to compare
selected_algorithms = ['KeyBERT', 'BART-based', 'TFIDF', 'YAKE', 'TextRank', 'TopicRank']

# Task 3: Execute this cell to compare the extracted keywords
compare_keyword_extraction_algorithms(text, selection=selected_algorithms)

To-do: 

* [ ] Update summary
    * Requires data
    * Requires training from scratch
    * Repositories are not frequently updated


## Summary

This notebook gave a brief overview of neural keyword extraction.

If an annotated dataset is available or if the problem domain is closely related to one of the existing dataset and it's possible to invest some time into training a model yourself, neural keyword extraction **promises great performance**.

We further presented two approaches that are readily available: **KeyBERT** uses the **similarity** between word and text **embeddings** to find keywords which best describe a text and an **end-to-end neural keyword extraction model** based on BART.

For now, we recommend to first try `pke` to establish a solid baseline and explore the mentioned extraction methods as an addition.


## Resources

### 📚 Libraries & Packages

* [**KeyBERT**](https://github.com/MaartenGr/KeyBERT): Keyword extraction method, that chooses keywords whose word embeddings are most similar to embeddings of the entire text
* **BART-based neural keyword extraction** models [trained on the KPTimes dataset](https://huggingface.co/ankur310794/bart-base-keyphrase-generation-kpTimes) or [OpenKP dataset](https://huggingface.co/ankur310794/bart-base-keyphrase-generation-openkp) on 🤗 Model Hub.
* [**`pke`** python keyphrase extraction](https://github.com/boudinfl/pke): Neat library implementing amongst others TF-IDF, YAKE, KPMiner, TextRank, SingleRank, TopicRank, TopologicalPageRank, PositionRank, MultipartiteRank, KEA, and WINGNUS. Uses GPLv3 licence.[[documentation](https://boudinfl.github.io/pke/)]
* [**YAKE**](https://github.com/LIAAD/yake): An alternative implementation from the authors of the YAKE paper.


### 📄 Overview Papers

* *Keyword extraction: a review of methods and approaches* by Slobodan Beliga (2014)
 [[paper](http://langnet.uniri.hr/papers/beliga/Beliga_KeywordExtraction_a_review_of_methods_and_approaches.pdf)]
* *A Review of Keyphrase Extraction* by Eirini Papagiannopoulou and Grigorios Tsoumakas (2019) [[paper](https://arxiv.org/pdf/1905.05044)]
