# Hands-on session with pke

## Part 2 : model parameterization

As we have seen, keyphrase extraction is commonly treated as a three-stage process involving:
 1. identifying keyphrase candidates
 2. weighting these candidates
 3. selecting the N-best candidates as keyphrases

Each of the above stages can be parameterized/modified in `pke` and we will see how below.

As a reminder, `pke` provides a standardized API for extracting keyphrases from a document by typing the following 5 lines:

```python
import pke

extractor = pke.unsupervised.TopicRank()             # initialize a model, here TopicRank
extractor.load_document(input='text', language='en') # load the document
extractor.candidate_selection()                      # identify keyphrase candidates
extractor.candidate_weighting()                      # weight candidates 
keyphrases = extractor.get_n_best(n=10)              # select the N-best candidates
```

In [1]:
import pke

# sample document (2040.abstr from the Hulth-2003 dataset)
sample = """Inverse problems for a mathematical model of ion exchange in a compressible ion exchanger.
A mathematical model of ion exchange is considered, allowing for ion exchanger compression in the process
of ion exchange. Two inverse problems are investigated for this model, unique solvability is proved, and
numerical solution methods are proposed. The efficiency of the proposed methods is demonstrated by a
numerical experiment."""

# normalize spacing (replace newlines with whitespaces)
sample = sample.replace("\n", " ")

### Stage-1 : candidate selection

Here, we will see how to configure the candidate selection method in `pke`. This is a very important step that determines the upper bound (maximum recall) for keyphrase extraction. 

In [8]:
# let us start with a TopicRank model
extractor = pke.unsupervised.TopicRank()

# load the document using the initialized model
extractor.load_document(input=sample, language='en')

# by default the candidate selection method uses the PoS pattern (Noun|Adj)
extractor.candidate_selection()

# for each keyphrase candidate
for i, candidate in enumerate(extractor.candidates):
    
    # the candidate is in stemmed form, we can find the (first occurring)
    # surface form using the candidates dictionary structure
    surface_form = ' '.join(extractor.candidates[candidate].surface_forms[0]).lower()
    
    # print out the candidate id, its stemmed form and first occurring surface form
    print("candidate {}: {} (stemmed) ; {}".format(i, candidate, surface_form))

candidate 0: invers problem (stemmed) ; inverse problems
candidate 1: mathemat model (stemmed) ; mathematical model
candidate 2: ion exchang (stemmed) ; ion exchange
candidate 3: compress ion exchang (stemmed) ; compressible ion exchanger
candidate 4: ion exchang compress (stemmed) ; ion exchanger compression
candidate 5: process (stemmed) ; process
candidate 6: model (stemmed) ; model
candidate 7: uniqu solvabl (stemmed) ; unique solvability
candidate 8: numer solut method (stemmed) ; numerical solution methods
candidate 9: effici (stemmed) ; efficiency
candidate 10: method (stemmed) ; methods
candidate 11: numer experi (stemmed) ; numerical experiment


In [9]:
# load the document using the initialized model
extractor.load_document(input=sample, language='en')

grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"

grammar = r"""
                NBAR:
                    {<NOUN|PROPN|ADJ>{,2}<NOUN|PROPN>} 
                    
                NP:
                    {<NBAR>}
                    {<NBAR><ADP><NBAR>}
            """

extractor.grammar_selection(grammar=grammar)

# for each keyphrase candidate
for i, candidate in enumerate(extractor.candidates):
    
    # the candidate is in stemmed form, we can find the (first occurring)
    # surface form using the candidates dictionary structure
    surface_form = ' '.join(extractor.candidates[candidate].surface_forms[0]).lower()
    
    # print out the candidate id, its stemmed form and first occurring surface form
    print("candidate {}: {} (stemmed) ; {}".format(i, candidate, surface_form))


candidate 0: invers problem (stemmed) ; inverse problems
candidate 1: mathemat model (stemmed) ; mathematical model
candidate 2: ion exchang (stemmed) ; ion exchange
candidate 3: compress ion exchang (stemmed) ; compressible ion exchanger
candidate 4: ion exchang compress (stemmed) ; ion exchanger compression
candidate 5: process (stemmed) ; process
candidate 6: model (stemmed) ; model
candidate 7: uniqu solvabl (stemmed) ; unique solvability
candidate 8: numer solut method (stemmed) ; numerical solution methods
candidate 9: effici (stemmed) ; efficiency
candidate 10: method (stemmed) ; methods
candidate 11: numer experi (stemmed) ; numerical experiment


In [10]:
# load the document using the initialized model
extractor.load_document(input=sample, language='en')

extractor.ngram_selection(n=3)

# for each keyphrase candidate
for i, candidate in enumerate(extractor.candidates):
    
    # the candidate is in stemmed form, we can find the (first occurring)
    # surface form using the candidates dictionary structure
    surface_form = ' '.join(extractor.candidates[candidate].surface_forms[0]).lower()
    
    # print out the candidate id, its stemmed form and first occurring surface form
    print("candidate {}: {} (stemmed) ; {}".format(i, candidate, surface_form))


candidate 0: invers problem (stemmed) ; inverse problems
candidate 1: mathemat model (stemmed) ; mathematical model
candidate 2: ion exchang (stemmed) ; ion exchange
candidate 3: compress ion exchang (stemmed) ; compressible ion exchanger
candidate 4: ion exchang compress (stemmed) ; ion exchanger compression
candidate 5: process (stemmed) ; process
candidate 6: model (stemmed) ; model
candidate 7: uniqu solvabl (stemmed) ; unique solvability
candidate 8: numer solut method (stemmed) ; numerical solution methods
candidate 9: effici (stemmed) ; efficiency
candidate 10: method (stemmed) ; methods
candidate 11: numer experi (stemmed) ; numerical experiment
candidate 12: invers (stemmed) ; inverse
candidate 13: invers problem for (stemmed) ; inverse problems for
candidate 14: problem (stemmed) ; problems
candidate 15: problem for (stemmed) ; problems for
candidate 16: problem for a (stemmed) ; problems for a
candidate 17: for (stemmed) ; for
candidate 18: for a (stemmed) ; for a
candidate 