<a href="https://colab.research.google.com/github/gulabpatel/NER/blob/main/Part_4_1_KeyBERT_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial** - Keyword Extraction with KeyBERT

## KeyBERT
KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

<br>


# **Installing KeyBERT**

In [1]:
%%capture
# !pip install keybert[all]
!pip install keybert

**NOTE**: If you choose to use `keybert[all]` to install all embedding backends, then this may take a while as it needs to install Spacy, Torch, Gensim, USE, etc. If you only want to use sentence-transformers, then I would advise you to use `pip install keybert`.

# **KeyBERT**
Using KeyBERT is rather straightforward, we simply choose a document that we want keywords/keyphrases from and pass it through our keyword model:

In [1]:
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

In [2]:
kw_model = KeyBERT()
kw_model.extract_keywords(doc)

[('supervised', 0.6523),
 ('labeled', 0.4702),
 ('learning', 0.467),
 ('training', 0.3858),
 ('labels', 0.3728)]

**NOTE**: Use `model="xlm-r-bert-base-nli-stsb-mean-tokens"` to select a model that support 50+ languages.

## Keyphrase Length
You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:



In [3]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1))

[('supervised', 0.6523),
 ('labeled', 0.4702),
 ('learning', 0.467),
 ('training', 0.3858),
 ('labels', 0.3728)]

To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases:

In [4]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3))

[('supervised learning algorithm', 0.6834),
 ('supervised learning', 0.6658),
 ('supervised learning example', 0.6641),
 ('supervised learning machine', 0.6528),
 ('function labeled training', 0.6526)]

Note that the stop_words are set by default to `"english"` so if you set this to None, then some of the stopwords will still be included in longer keyphrases:

In [5]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3), stop_words=None)

[('supervised learning is', 0.7048),
 ('supervised learning algorithm', 0.6834),
 ('supervised learning', 0.6658),
 ('supervised', 0.6523),
 ('in supervised learning', 0.6474)]

## Max Sum Similarity
To diversity the results, we take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.

In [6]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
                          use_maxsum=True, nr_candidates=20, top_n=5)

[('machine learning task', 0.5496),
 ('supervisory signal supervised', 0.5705),
 ('learning function', 0.5724),
 ('labeled training data', 0.5959),
 ('examples supervised', 0.6063)]

## Maximal Marginal Relevance

To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity. The results with **high** diversity:

In [7]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
                          use_mmr=True, diversity=0.7)

[('supervised learning algorithm', 0.6834),
 ('class labels unseen', 0.3239),
 ('value called supervisory', 0.2705),
 ('unseen situations reasonable', 0.2158),
 ('pairs infers function', 0.1953)]

The results with **low diversity**:



In [8]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3),
                              use_mmr=True, diversity=0.2)

[('supervised learning algorithm', 0.6834),
 ('supervised learning', 0.6658),
 ('supervised learning example', 0.6641),
 ('function labeled training', 0.6526),
 ('supervised', 0.6523)]

# **Embedding Models**
In this section, we will go through all embedding models and backends that are supported in KeyBERT.

## Sentence Transformers
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) and pass it through KeyBERT with `model`:

In [9]:
kw_model = KeyBERT(model="xlm-r-bert-base-nli-stsb-mean-tokens")
kw_model.extract_keywords(doc)

Downloading (…)252d8/.gitattributes:   0%|          | 0.00/795 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)ea1cc252d8/README.md:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading (…)1cc252d8/config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]

Downloading (…)cc252d8/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

[('learning', 0.6026),
 ('training', 0.518),
 ('algorithm', 0.471),
 ('analyzes', 0.4646),
 ('supervised', 0.4624)]

Or we can select a SentenceTransformer model with our own parameters:

In [11]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")

In [12]:
kw_model = KeyBERT(model=sentence_model)
kw_model.extract_keywords(doc)

[('learning', 0.6026),
 ('training', 0.518),
 ('algorithm', 0.471),
 ('analyzes', 0.4646),
 ('supervised', 0.4624)]

## Flair
Flair allows you to choose almost any embedding model that is publicly available.  
Flair can be used as follows:

In [15]:
!pip install -q flair

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m373.1/373.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m788.5/788.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.7/19.7 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?

In [16]:
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [17]:
kw_model = KeyBERT(model=roberta)
kw_model.extract_keywords(doc)

[('training', 0.9911),
 ('bias', 0.9911),
 ('inferred', 0.9911),
 ('scenario', 0.9911),
 ('way', 0.9911)]

You can select any 🤗 transformers model [here](https://huggingface.co/models).

Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily pass it to KeyBERT in order to use those word embeddings as document embeddings:

In [18]:
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])

2023-10-07 19:43:34,673 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M.vectors.npy not found in cache, downloading to /tmp/tmpj7fi8jnc


100%|██████████| 1.12G/1.12G [01:07<00:00, 17.7MB/s]

2023-10-07 19:44:43,271 copying /tmp/tmpj7fi8jnc to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M.vectors.npy





2023-10-07 19:44:51,973 removing temp file /tmp/tmpj7fi8jnc
2023-10-07 19:44:53,766 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M not found in cache, downloading to /tmp/tmpbdhxvk3a


100%|██████████| 37.5M/37.5M [00:03<00:00, 12.0MB/s]

2023-10-07 19:44:57,920 copying /tmp/tmpbdhxvk3a to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M
2023-10-07 19:44:57,989 removing temp file /tmp/tmpbdhxvk3a





In [19]:
kw_model = KeyBERT(model=document_glove_embeddings)
kw_model.extract_keywords(doc)

[('function', 0.4896),
 ('output', 0.4621),
 ('data', 0.4577),
 ('learning', 0.4538),
 ('input', 0.4524)]

## Spacy
Spacy has shown great promise over the last years and is now slowly transitioning into transformer-based techniques which makes it interesting to use in KeyBERT.

We start by using a non-transformer-based model which we will have to download first:

In [20]:
%%capture
!python -m spacy download en_core_web_md

Next, simply load the model into a Spacy nlp instance and pass it through KeyBERT:

In [21]:
import spacy
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [22]:
kw_model = KeyBERT(model=nlp)
kw_model.extract_keywords(doc)

[('function', 0.7746),
 ('typically', 0.7282),
 ('consisting', 0.7276),
 ('instances', 0.7218),
 ('generalize', 0.7118)]

We can also use their transformer-based models which we also have to download first:

In [23]:
%%capture
!python -m spacy download en_core_web_trf

As before, we simply load the model and pass it through KeyBERT. Note that we exclude a bunch of features as they are not used in KeyBERT.

In [None]:
import spacy
from thinc.api import set_gpu_allocator, require_gpu

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)

In [None]:
kw_model = KeyBERT(model=nlp)
kw_model.extract_keywords(doc)

## Universal Sentence Encoder (USE)
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here for embedding the documents. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.

In [26]:
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [27]:
kw_model = KeyBERT(model=embedding_model)
kw_model.extract_keywords(doc)

Instructions for updating:
Use tf.identity with explicit device placement instead.


[('training', 0.2578),
 ('learning', 0.231),
 ('algorithm', 0.2202),
 ('data', 0.1965),
 ('pairs', 0.1921)]

## Gensim
For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.

In [None]:
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')

[--------------------------------------------------] 2.0% 18.7/958.4MB downloaded

In [None]:
kw_model = KeyBERT(model=ft)
kw_model.extract_keywords(doc)

## Custom Backend
If your backend or model cannot be found in the ones currently available, you can use the BaseEmbedder class to create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:

In [None]:
from keybert.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
    def __init__(self, embedding_model):
        super().__init__()
        self.embedding_model = embedding_model

    def embed(self, documents, verbose=False):
        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
        return embeddings

# Create custom backend
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

In [None]:
kw_model = KeyBERT(model=custom_embedder)
kw_model.extract_keywords(doc)

github: https://github.com/MaartenGr/KeyBERT/tree/master

# **Candidates**
In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction. We are going to create these candidates with [YAKE](https://github.com/LIAAD/yake), another great tool for extracting keywords.

We start by installing yake:

In [None]:
%%capture
!pip install yake

Next, we will create 20 candidate keywords with YAKE:

In [None]:
import yake

kw_extractor = yake.KeywordExtractor(top=20)
candidates = kw_extractor.extract_keywords(doc)
candidates = [candidate[0] for candidate in candidates]

Finally, we are going to pass these candidates to KeyBERT and use MMR to select the top 5 keywords/keyphrases:

In [None]:
kw_model = KeyBERT()
kw_model.extract_keywords(doc, candidates, use_mmr=True, diversity=0.5)