# Hands-on session with pke

## Part 2 : model parameterization

As we have seen, keyphrase extraction is commonly treated as a three-stage process involving:
 1. identifying keyphrase candidates
 2. weighting these candidates
 3. selecting the N-best candidates as keyphrases

Each of the above stages can be parameterized/modified in `pke` and we will see how below.

As a reminder, `pke` provides a standardized API for extracting keyphrases from a document by typing the following 5 lines:

```python
import pke

extractor = pke.unsupervised.TopicRank()             # initialize a model, here TopicRank
extractor.load_document(input='text', language='en') # load the document
extractor.candidate_selection()                      # identify keyphrase candidates
extractor.candidate_weighting()                      # weight candidates 
keyphrases = extractor.get_n_best(n=10)              # select the N-best candidates
```

In [1]:
import pke

# sample document (C-41 from the SemEval-2010 dataset)
sample = """Evaluating Adaptive Resource Management for Distributed Real-Time Embedded Systems.
A challenging problem faced by researchers and developers
of distributed real-time and embedded (DRE) systems is 
devising and implementing effective adaptive resource 
management strategies that can meet end-to-end quality of service
(QoS) requirements in varying operational conditions. This
paper presents two contributions to research in adaptive 
resource management for DRE systems. First, we describe the
structure and functionality of the Hybrid Adaptive 
Resourcemanagement Middleware (HyARM), which provides 
adaptive resource management using hybrid control techniques
for adapting to workload fluctuations and resource 
availability. Second, we evaluate the adaptive behavior of HyARM
via experiments on a DRE multimedia system that distributes
video in real-time. Our results indicate that HyARM yields
predictable, stable, and high system performance, even in the
face of fluctuating workload and resource availability."""

# normalize spacing (replace newlines with whitespaces)
sample = sample.replace("\n", " ")

# gold keyphrases for the sample document 
references = ['adapt resourc manag', 'distribut real-time embed system', 
              'end-to-end qualiti of servic', 'hybrid adapt resourcemanag middlewar',
              'hybrid control techniqu', 'real-time video distribut system',
              'real-time corba specif', 'video encod/decod', 'resourc reserv mechan',
              'dynam environ', 'stream servic', 'distribut real-time emb system',
              'hybrid system', 'qualiti of servic']

# initialize a simple model that ranks candidates using their position
extractor = pke.unsupervised.FirstPhrases()

# load the document using the initialized model
extractor.load_document(input=sample, language='en')

### Stage-1 : candidate selection

Candidate selection is a crucial step in keyphrase extraction as it determines the size of the search space (i.e. number of candidates to rank) and the upper bound performance (i.e. maximum recall).
Here, we will see how to configure the candidate selection method in `pke` to achieve the best compromise between search space and maximum performance.

#### Default's method (from FirstPhrases)

In [2]:
# First, we apply the default candidate selection method, which is model-dependent.
# here for FirstPhrases, candidates are the longest sequences of nouns and adjectives
extractor.candidate_selection()

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("sample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
max_recall = len(set(references) & set(candidates)) / len(set(references))
print("Maximum recall: {:.3f}".format(max_recall))

41 keyphrase candidates were identified
sample of candidates: adapt resourc manag ; real ; time embed system ; challeng problem ; research
Maximum recall: 0.143


#### Identify keyphrase candidates using a PoS patterns

In [3]:
# first we need to empty the candidates
extractor.candidates.clear()

# here we use a simple grammar for candidates (NP) as
grammar = r"""
                NBAR:
                    {<NOUN|PROPN|ADJ>{,2}<NOUN|PROPN>} 
                    
                NP:
                    {<NBAR>}
                    {<NBAR><ADP><NBAR>}
            """

extractor.grammar_selection(grammar=grammar)

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("sample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
max_recall = len(set(references) & set(candidates)) / len(set(references))
print("Maximum recall: {:.3f}".format(max_recall))

38 keyphrase candidates were identified
sample of candidates: adapt resourc manag ; real ; time embed system ; challeng problem ; research
Maximum recall: 0.143


#### Identify n-grams as keyphrase candidates

In [4]:
# first we need to empty the candidates
extractor.candidates.clear()

# here we use a simple n-gram selection for candidates
extractor.ngram_selection(n=3)

# let's see how many candidates are identified
print("{} keyphrase candidates were identified".format(len(extractor.candidates)))

# print out a sample
candidates = [*extractor.candidates]
print("sample of candidates:", ' ; '.join(candidates[:5]))

# compute the maximum recall
max_recall = len(set(references) & set(candidates)) / len(set(references))
print("Maximum recall: {:.3f}".format(max_recall))

378 keyphrase candidates were identified
sample of candidates: evalu ; evalu adapt ; evalu adapt resourc ; adapt ; adapt resourc
Maximum recall: 0.214


### Stage-2 candidate weighting/ranking

The keyphrase extraction model that we use in `pke` define how keyphrase candidates are weighted. For example, in TopicRank, candidates are weighted using a graph-based ranking model whereas in Yake, candidates are weighted using a combination of statistical features (e.g. position, frequency). Here, we will see how to use different models implemented in `pke`. For comparison purposes, we will use a unified candidate selection method (as presented above).

#### Baseline model: TopicRank

In [9]:
# initialize a TopicRank model
extractor = pke.unsupervised.TopicRank()

# load the document
extractor.load_document(input=sample, language='en')

# identify Noun Phrases as keyphrase candidates
extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")

# weight candidates
extractor.candidate_weighting(threshold=0.74) # the threshold parameter is used to compute topics

# get the 5 best keyphrases
keyphrases = [candidate for candidate, weight in extractor.get_n_best(n=5, stemming=True)]
print("keyphrases:", ' ; '.join(keyphrases))

# evaluate the Precision / Recall / F-measure of the model
P = len(set(keyphrases) & set(references)) / len(keyphrases)
R = len(set(keyphrases) & set(references)) / len(references)
F = 2 * (P*R) / (P+R)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))

keyphrases: adapt resourc manag ; dre ; hyarm ; time embed system ; hybrid adapt
P@5: 0.200 R@5: 0.071 F@5: 0.105


#### Good performance graph-based model: MultipartiteRank

In [10]:
# initialize a MultipartiteRank model
extractor = pke.unsupervised.MultipartiteRank()

# load the document
extractor.load_document(input=sample, language='en')

# identify Noun Phrases as keyphrase candidates
extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")

# weight candidates
extractor.candidate_weighting(threshold=0.74,  # parameter used to compute topics
                              alpha=1.1)       # parameter that controls the strength of the weight adjustment

# get the 5 best keyphrases
keyphrases = [candidate for candidate, weight in extractor.get_n_best(n=5, stemming=True)]
print("keyphrases:", ' ; '.join(keyphrases))


# evaluate the Precision / Recall / F-measure of the model
P = len(set(keyphrases) & set(references)) / len(keyphrases)
R = len(set(keyphrases) & set(references)) / len(references)
F = 2 * (P*R) / (P+R)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))

keyphrases: adapt resourc manag ; time embed system ; dre ; hyarm ; hybrid adapt
P@5: 0.200 R@5: 0.071 F@5: 0.105


#### Supervised model: Kea

In [11]:
# initialize a TopicRank model
extractor = pke.supervised.Kea()

# load the document
extractor.load_document(input=sample, language='en')

# identify Noun Phrases as keyphrase candidates
extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")

# weight candidates
extractor.candidate_weighting()

# get the 5 best keyphrases
keyphrases = [candidate for candidate, weight in extractor.get_n_best(n=5, stemming=True)]
print("keyphrases:", ' ; '.join(keyphrases))


# evaluate the Precision / Recall / F-measure of the model
P = len(set(keyphrases) & set(references)) / len(keyphrases)
R = len(set(keyphrases) & set(references)) / len(references)
F = 2 * (P*R) / (P+R)
print("P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(P, R, F))



keyphrases: hyarm ; adapt resourc manag ; time embed system ; dre ; effect adapt resourc
P@5: 0.200 R@5: 0.071 F@5: 0.105


https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


### Stage-3 selecting the N-best candidates

In this last stage, the highest weighted keyphrase candidates are selected as output keyphrases. Here, we will see how to configure this selection.

In [29]:
# initialize a MultipartiteRank model
extractor = pke.unsupervised.MultipartiteRank()

# load the document
extractor.load_document(input=sample, language='en')

# identify Noun Phrases as keyphrase candidates
extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")

# weight candidates
extractor.candidate_weighting()

# print out the 10 best candidates
print("top-10 keyphrases")
print("\n".join([ "{}:{}".format(u, v) for u, v in extractor.get_n_best(n=10)]))
print("*"*10)

# same in stemmed forms
print('5-best in stemmed forms', extractor.get_n_best(n=10, stemming=True))

# same but with redundancy removal (removing candidates contained in higher-ranked ones)
print('5-best in stemmed forms w/o redundancy', extractor.get_n_best(n=10, redundancy_removal=True, stemming=True))

top-10 keyphrases
adaptive resource management:0.08897405365022867
time embedded systems:0.05966042113023676
dre:0.05322017865856586
hyarm:0.046889644710346554
hybrid adaptive:0.036936501953574175
end:0.034285447267345746
systems:0.03426425286057176
time:0.0317259787561323
real:0.031146977659670304
workload fluctuations:0.02882190551192987
**********
5-best in stemmed forms [('adapt resourc manag', 0.08897405365022867), ('time embed system', 0.05966042113023676), ('dre', 0.05322017865856586), ('hyarm', 0.046889644710346554), ('hybrid adapt', 0.036936501953574175), ('end', 0.034285447267345746), ('system', 0.03426425286057176), ('time', 0.0317259787561323), ('real', 0.031146977659670304), ('workload fluctuat', 0.02882190551192987)]
5-best in stemmed forms w/o redundancy [('adapt resourc manag', 0.08897405365022867), ('time embed system', 0.05966042113023676), ('dre', 0.05322017865856586), ('hyarm', 0.046889644710346554), ('hybrid adapt', 0.036936501953574175), ('end', 0.0342854472673457