# Hands-on session with pke - part 2

This notebook provides a series of examples on how to parameterize the keyphrase extraction models in `pke`.
More specifically, we will see how to customize the identification of keyphrase candidates and how to use different models implemented in `pke`.

As a reminder, `pke` provides a standardized API for extracting keyphrases from a document by typing the following 5 lines:

```python
import pke

extractor = pke.unsupervised.TfIdf()        # initialize a keyphrase extraction model, here TFxIDF
extractor.load_document(input='text')       # load the content of the document (str or spacy Doc)
extractor.candidate_selection()             # identify keyphrase candidates
extractor.candidate_weighting()             # weight keyphrase candidates
keyphrases = extractor.get_n_best(n=5)      # select the 5-best candidates as keyphrases
```

### Preamble - initializing a simple model and a sample document

In [13]:
import pke

# sample document (1895.abstr from the Inspec dataset)
sample = """An algorithm combining neural networks with fundamental parameters.
An algorithm combining neural networks with the fundamental parameters equations (NNFP) is proposed for making
corrections for non-linear matrix effects in x-ray fluorescence analysis. In the algorithm, neural networks were
applied to relate the concentrations of components to both the measured intensities and the relative theoretical
intensities calculated by the fundamental parameter equations. The NNFP algorithm is compared with the classical
theoretical correction models, including the fundamental parameters approach, the Lachance-Traill model, a
hyperbolic function model and the COLA algorithm. For an alloy system with 15 measured elements, in most cases,
the prediction errors of the NNFP algorithm are lower than those of the fundamental parameters approach, the
Lachance-Traill model, the hyperbolic function model and the COLA algorithm separately. If there are the serious
matrix effects, such as matrix effects among Cr, Fe and Ni, the NNFP algorithm generally decreased predictive
errors as compared with the classical models, except for the case of Cr by the fundamental parameters approach.
The main reason why the NNFP algorithm has generally a better predictive ability than the classical theoretical
correction models might be that neural networks can better calibrate the non-linear matrix effects in a complex
multivariate system.""".replace("\n", " ")

# initialize a simple model that ranks candidates using their position
extractor = pke.unsupervised.FirstPhrases()

# load the document using the initialized model
extractor.load_document(input=sample, language='en')

## Model parameterization - candidate selection

Candidate selection is a crucial stage in keyphrase extraction as it determines the size of the search space (i.e. number of candidates to rank/weight) and the upper bound performance (i.e. maximum recall).
Here, we will see how to configure the candidate selection method in `pke` to achieve the best compromise between search space and maximum performance.

In order to compare candidate selection methods, we compute the maximum recall score against the gold-standard (human-assigned) keyphrases as

$$max\_recall = \frac{| \text{candidates} \cap \text{references}|}{|\text{references}|}$$

Candidate and reference keyphrases are stemmed (using `nltk`'s Porter stemmer) to reduce the number of mismatches.

In [26]:
# gold-standard keyphrases for the sample document (1895.abstr, keyphrases are in stemmed form)
references = ['algorithm', 'neural network', 'fundament paramet', 'fundament paramet equat',
              'nonlinear matrix effect', 'x-ray fluoresc analysi', 'intens', 'nnfp algorithm',
              'theoret correct model', 'lachance-trail model', 'hyperbol function model',
              'cola algorithm', 'alloy system', 'cr', 'fe', 'ni', 'complex multivari system']

def max_recall(candidates, references):
    return len(set(references) & set(candidates)) / len(set(references))

### Setting up a linguistic-based selection method

In [27]:
grammar = r"""
                NP:
                    {<NOUN|PROPN>+}
            """

extractor.grammar_selection(grammar=grammar)

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("- Subsample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
print("- Maximum recall: {:.3f}".format(max_recall(candidates, references)))

# identify missed reference keyphrases
missed = set(references) - set(candidates)
print("- Missed reference keyphrases: {}".format(missed))

28 keyphrase candidates were identified
- Subsample of candidates: algorithm ; network ; paramet ; paramet equat ; nnfp
- Maximum recall: 0.529
- Missed reference keyphrases: {'fundament paramet equat', 'hyperbol function model', 'complex multivari system', 'theoret correct model', 'fundament paramet', 'neural network', 'lachance-trail model', 'nonlinear matrix effect'}


### <span style="background:lightpink">Exercice ✍️</span>

try modifying/adding PoS patterns of the grammar to increase the maximum recall, for example by allowing predicative adjectives (e.g. `<ADJ>+`).

### Setting up a n-gram-based selection method

In [28]:
# here we use a simple n-gram selection for candidates
extractor.ngram_selection(n=3)

# filter out spurious candidates 
for i, candidate in enumerate(list(extractor.candidates.keys())):
    
    # get the candidate words 
    words = [w.lower() for w in extractor.candidates[candidate].surface_forms[0]]
    
    # remove candidates containing stopwords
    if set(extractor.stoplist) & set(words):
        del extractor.candidates[candidate]

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("- Subsample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
print("- Maximum recall: {:.3f}".format(max_recall(candidates, references)))

# identify missed reference keyphrases
missed = set(references) - set(candidates)
print("- Missed reference keyphrases: {}".format(missed))

173 keyphrase candidates were identified
- Subsample of candidates: algorithm ; algorithm combin ; algorithm combin neural ; combin ; combin neural
- Maximum recall: 0.941
- Missed reference keyphrases: {'nonlinear matrix effect'}


### <span style="background:lightpink">Exercice ✍️</span>

try removing more spurious candidates to reduce the search space, for example by removing candidates containing punctuation marks as words.

## Model parameterization - candidate weighting/ranking

The keyphrase extraction model that we use in `pke` define how candidates are weighted. For example, in TopicRank, candidates are weighted using a graph-based ranking model whereas in YAKE, candidates are weighted using a combination of statistical features (e.g. position, frequency). Here, we will see how to use different models implemented in `pke`. For comparison purposes, we will use a unified candidate selection method based on the following PoS grammar:

In [29]:
# the unified grammar for candidate selection
grammar="NP: {<ADJ>*<NOUN|PROPN>+}"

Models are evaluated against the gold-standard keyphrases by computing the precision, recall and f-measure at the top-N extracted keyphases as:

$$ P@N = \frac{| \text{top-N keyphrases} \cap \text{references}|}{|\text{top-N keyphrases}|} $$

$$ R@N = \frac{| \text{top-N keyphrases} \cap \text{references}|}{|\text{references}|} $$

$$ F_1@N = \frac{2 \times P@N \cdot R@N }{P@N + R@N} $$

In [32]:
def evaluate(top_N_keyphrases, references):
    P = len(set(top_N_keyphrases) & set(references)) / len(top_N_keyphrases)
    R = len(set(top_N_keyphrases) & set(references)) / len(references)
    F = (2*P*R)/(P+R) if (P+R) > 0 else 0 
    return (P, R, F)

### Baseline model: TopicRank

In [34]:
extractor = pke.unsupervised.TopicRank()
extractor.load_document(input=sample, language='en')
extractor.grammar_selection(grammar=grammar)
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=5, stemming=True)

top5 = [candidate for candidate, weight in keyphrases]
print("top-5 keyphrases:", '; '.join(top5))

# evaluate the model
P, R, F = evaluate(top5, references)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))

top-5 keyphrases: fundament paramet; nnfp; algorithm; classic theoret correct model; non-linear matrix effect
P@5: 0.400 R@5: 0.118 F@5: 0.182


### A strong baseline model: MultipartiteRank

In [42]:
extractor = pke.unsupervised.MultipartiteRank()
extractor.load_document(input=sample, language='en')
extractor.grammar_selection(grammar=grammar)
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=5, stemming=True)

top5 = [candidate for candidate, weight in keyphrases]
print("top-5 keyphrases:", '; '.join(top5))

# evaluate the model
P, R, F = evaluate(top5, references)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))

top-5 keyphrases: fundament paramet; algorithm; neural network; nnfp; classic theoret correct model
P@5: 0.600 R@5: 0.176 F@5: 0.273


### <span style="background:lightpink">Exercice ✍️</span>

try using another model, for example among the other unsupervised models implemented in `pke`: `FirstPhrases`, `TextRank`, `TfIdf`, `YAKE` or a supervised model: `Kea`.