# Hands-on session with pke - part 2

This notebook provides a series of examples on how to parameterize the keyphrase extraction models in `pke`.
More specifically, we will see how to customize the identification of keyphrase candidates and how to use different models implemented in `pke`.

In a second time, we will conduct a series of experiments to compare several models on Inspec, a commonly-used benchmark dataset for keyphrase extraction that contains bibliographic records (i.e. title/abstract from scientific papers).

As a reminder, `pke` provides a standardized API for extracting keyphrases from a document by typing the following 5 lines:

```python
import pke

extractor = pke.unsupervised.TfIdf()        # initialize a keyphrase extraction model, here TFxIDF
extractor.load_document(input='text')       # load the content of the document (str or spacy Doc)
extractor.candidate_selection()             # identify keyphrase candidates
extractor.candidate_weighting()             # weight keyphrase candidates
keyphrases = extractor.get_n_best(n=5)      # select the 5-best candidates as keyphrases
```

## Preamble on keyphrase extraction datasets using 🤗 datasets

For simplicity and ease of use, we rely on the `datasets` module from 🤗 huggingface to load and access sample documents.



In [1]:
from datasets import load_dataset

# load the inspec dataset
dataset = load_dataset('boudinfl/inspec', "all")

# let's have a look at one sample document from the validation split
sample = dataset["validation"][233]

print("id: {}".format(sample["id"]))
print("title: {}...".format(sample["title"][:50]))
print("abstract: {}...".format(sample["abstract"][:50]))
print("controlled keyphrases: {}; ...".format("; ".join(sample["contr"][:3])))
print("uncontrolled keyphrases: {}; ...".format("; ".join(sample["uncontr"][:3])))

Reusing dataset inspec (/Users/boudin-f/.cache/huggingface/datasets/boudinfl___inspec/all/1.0.1/f333b3e8c7190f09ecbc2eee2706f13dd7370a0f3d72bb15ceb6e34ee90a6aa7)


  0%|          | 0/3 [00:00<?, ?it/s]

id: 1895
title: An algorithm combining neural networks with fundam...
abstract: An algorithm combining neural networks with the fu...
controlled keyphrases: chromium alloys; iron alloys; neural nets; ...
uncontrolled keyphrases: algorithm; neural networks; fundamental parameters; ...


## Model parameterization - candidate selection

Candidate selection is a crucial stage in keyphrase extraction as it determines the size of the search space (i.e. number of candidates to rank/weight) and the upper bound performance (i.e. maximum recall).
Here, we will see how to configure the candidate selection method in `pke` to achieve the best compromise between search space and maximum performance.

In order to compare candidate selection methods, we compute the maximum recall score against the gold standard (human-assigned) keyphrases as

$$max\_recall = \frac{| candidates \cap references|}{|references|}$$

In [2]:
import pke

# initialize a simple model that ranks candidates using their position
extractor = pke.unsupervised.FirstPhrases()

# the text to process is the title concatenated to the abstract
text = sample["title"] + ". " + sample["abstract"]

# the references in stemmed form to compute the maximum recall
references = sample["uncontr_stems"]

# load the document using the initialized model
extractor.load_document(input=text, language='en')

### Setting up a linguistic-based selection method

In [3]:
grammar = r"""
                NP:
                    {<NOUN|PROPN>+}
            """

extractor.grammar_selection(grammar=grammar)

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("- Subsample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
max_recall = len(set(references) & set(candidates)) / len(set(references))
print("- Maximum recall: {:.3f}".format(max_recall))

# identify missed reference keyphrases
missed = set(references) - set(candidates)
print("- Missed reference keyphrases: {}".format(missed))

28 keyphrase candidates were identified
- Subsample of candidates: algorithm ; network ; paramet ; paramet equat ; nnfp
- Maximum recall: 0.529
- Missed reference keyphrases: {'neural network', 'hyperbol function model', 'fundament paramet', 'nonlinear matrix effect', 'fundament paramet equat', 'complex multivari system', 'lachance-trail model', 'theoret correct model'}


### <span style="background:lightpink">Exercice ✍️</span>

try modifying/adding PoS patterns of the grammar to increase the maximum recall, for example by allowing predicative adjectives (e.g. `<ADJ>+`).

### Setting up a n-gram-based selection method

In [4]:
import re

# here we use a simple n-gram selection for candidates
extractor.ngram_selection(n=3)

# filter out spurious candidates 
for i, candidate in enumerate(list(extractor.candidates.keys())):
    # remove if containing punctuation marks
    if re.search(r'\.|\?|\!|\,', candidate):
        del extractor.candidates[candidate]

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("- Subsample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
max_recall = len(set(references) & set(candidates)) / len(set(references))
print("- Maximum recall: {:.3f}".format(max_recall))

# identify missed reference keyphrases
missed = set(references) - set(candidates)
print("- Missed reference keyphrases: {}".format(missed))

363 keyphrase candidates were identified
- Subsample of candidates: an ; an algorithm ; an algorithm combin ; algorithm ; algorithm combin
- Maximum recall: 0.941
- Missed reference keyphrases: {'nonlinear matrix effect'}


### <span style="background:lightpink">Exercice ✍️</span>

try removing unrelevant candidates to reduce the search space by adding constraints in the filtering process.

## Model parameterization - candidate weighting/ranking

The keyphrase extraction model that we use in `pke` define how candidates are weighted. For example, in TopicRank, candidates are weighted using a graph-based ranking model whereas in Yake, candidates are weighted using a combination of statistical features (e.g. position, frequency). Here, we will see how to use different models implemented in `pke`. For comparison purposes, we will use a unified candidate selection method (as presented above). Models are evaluated against the gold standard (human-assigned) keyphrases by computing the precision, recall and f-measure at the top-N extracted keyphases as:

$$ precision@N = \frac{| top-N candidates \cap references|}{|top-N candidates|} $$

$$ recall@N = \frac{| top-N candidates \cap references|}{|references|} $$

$$ f-measure@N = 2 \times \frac{precision@N \cdot recall@N }{precision@N + recall@N} $$

In [5]:
# the text to process is the title concatenated to the abstract
text = sample["title"] + ". " + sample["abstract"]

# the unified grammar for candidate selection
grammar="NP: {<ADJ>*<NOUN|PROPN>+}"

### Baseline model: TopicRank

In [6]:
extractor = pke.unsupervised.TopicRank()
extractor.load_document(input=text, language='en')
extractor.grammar_selection(grammar=grammar)
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=5, stemming=True)

top5 = [candidate for candidate, weight in keyphrases]
print("top-5 keyphrases:", '; '.join(top5))

# evaluate the Precision / Recall / F-measure of the model
P = len(set(top5) & set(references)) / len(top5)
R = len(set(top5) & set(references)) / len(references)
F = 2 * (P*R) / (P+R)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))

top-5 keyphrases: fundament paramet; nnfp; algorithm; classic theoret correct model; non-linear matrix effect
P@5: 0.400 R@5: 0.118 F@5: 0.182


### A better graph-based model: MultipartiteRank

In [7]:
extractor = pke.unsupervised.YAKE()
extractor.load_document(input=text, language='en')
extractor.grammar_selection(grammar=grammar)
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=5, stemming=True)

top5 = [candidate for candidate, weight in keyphrases]
print("top-5 keyphrases:", '; '.join(top5))

# evaluate the Precision / Recall / F-measure of the model
P = len(set(top5) & set(references)) / len(top5)
R = len(set(top5) & set(references)) / len(references)
F = 2 * (P*R) / (P+R)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))

top-5 keyphrases: nnfp algorithm; neural network; fundament paramet; fundament paramet equat; fundament paramet approach
P@5: 0.800 R@5: 0.235 F@5: 0.364


### Supervised model: Kea

In [8]:
extractor = pke.supervised.Kea()
extractor.load_document(input=text, language='en')
extractor.grammar_selection(grammar=grammar)
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=5, stemming=True)

top5 = [candidate for candidate, weight in keyphrases]
print("top-5 keyphrases:", '; '.join(top5))

# evaluate the Precision / Recall / F-measure of the model
P = len(set(top5) & set(references)) / len(top5)
R = len(set(top5) & set(references)) / len(references)
F = 2 * (P*R) / (P+R)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))



top-5 keyphrases: nnfp algorithm; neural network; fundament paramet; fundament paramet equat; non-linear matrix effect
P@5: 0.800 R@5: 0.235 F@5: 0.364


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


<span style="background:lightpink">
    Exercice: try using another model.
</span>

## Benchmarking models on the inspec dataset

In [9]:
# preprocessing the dataset using spacy
import re
import spacy
from tqdm.notebook import tqdm
from spacy.tokenizer import _get_regex_pattern

nlp = spacy.load("en_core_web_sm")

# Tokenization fix for in-word hyphens (e.g. non-linear is kept as one token)
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
re_token_match = f"({re_token_match}|\w+-\w+)"
nlp.tokenizer.token_match = re.compile(re_token_match).match

docs = []
for sample in tqdm(dataset['test']):
    docs.append(nlp(sample["title"]+". "+sample["abstract"]))

  0%|          | 0/500 [00:00<?, ?it/s]

In [10]:
# extract keyphrases
keyphrases = []
for i, doc in enumerate(tqdm(docs)):
    extractor = pke.unsupervised.FirstPhrases()
    extractor.load_document(input=doc, language='en')
    extractor.grammar_selection(grammar=grammar)
    extractor.candidate_weighting()
    keyphrases.append([u for u,v in extractor.get_n_best(n=5, stemming=True)])

  0%|          | 0/500 [00:00<?, ?it/s]

In [11]:
import numpy as np

# evaluate keyphrases
scores = []
for i, output in enumerate(tqdm(keyphrases)):
    references = dataset['test'][i]["uncontr_stems"]
    P = len(set(output) & set(references)) / len(output)
    R = len(set(output) & set(references)) / len(references)
    F = 0.0
    if (P+R) > 0:
        F = 2 * (P*R) / (P+R)
    scores.append((P, R, F))

avg_scores = np.mean(scores, axis=0)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(avg_scores[0], avg_scores[1], avg_scores[2]))

  0%|          | 0/500 [00:00<?, ?it/s]

P@5: 0.336 R@5: 0.204 F@5: 0.239
