In [1]:
!pip install git+https://github.com/boudinfl/pke.git@v2.0
!pip install datasets
!pip install ipywidgets

Collecting git+https://github.com/boudinfl/pke.git@v2.0
  Cloning https://github.com/boudinfl/pke.git (to revision v2.0) to /private/var/folders/_s/dsym612j14gggkqchsd35clh0000gn/T/pip-req-build-bsahm40a
  Running command git clone --filter=blob:none --quiet https://github.com/boudinfl/pke.git /private/var/folders/_s/dsym612j14gggkqchsd35clh0000gn/T/pip-req-build-bsahm40a
  Running command git checkout -b v2.0 --track origin/v2.0
  Switched to a new branch 'v2.0'
  Branch 'v2.0' set up to track remote branch 'v2.0' from 'origin'.
  Resolved https://github.com/boudinfl/pke.git to commit 43b9783f5397df3d1741feaadf0ff64f401ce19c
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting en_core_web_sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0.tar.gz
  Using cached en_core_web_sm-3.2.0-py3-none-any.whl






# Hands-on session with pke - part 3

This notebook provides an end-to-end example of model benchmarking on Inspec, a commonly-used dataset for keyphrase extraction that contains bibliographic records (i.e. title/abstract from scientific papers).

## Preamble on keyphrase extraction datasets using 🤗 datasets

For simplicity and ease of use, we rely on the `datasets` module from 🤗 huggingface to load and access sample documents from the inspec dataset. 



In [2]:
from datasets import load_dataset

# load the inspec dataset
dataset = load_dataset('boudinfl/inspec', "all")

# let's have a look at one sample document from the validation split
sample = dataset["test"][0]

print("id: {}".format(sample["id"]))
print("title: {}...".format(sample["title"][:50]))
print("abstract: {}...".format(sample["abstract"][:50]))
print("gold-standard keyphrases: {}; ...".format("; ".join(sample["uncontr"][:3])))

Reusing dataset inspec (/Users/boudin-f/.cache/huggingface/datasets/boudinfl___inspec/all/1.0.1/f333b3e8c7190f09ecbc2eee2706f13dd7370a0f3d72bb15ceb6e34ee90a6aa7)


  0%|          | 0/3 [00:00<?, ?it/s]

id: 2007
title: The creation of a high-fidelity finite element mod...
abstract: A detailed finite element model of the human kidne...
gold-standard keyphrases: high-fidelity finite element model; kidney; trauma research; ...


## Benchmarking models

### step-1: let's start by preprocessing the dataset using spacy

In [3]:
import re
import spacy
from tqdm.notebook import tqdm
from spacy.tokenizer import _get_regex_pattern

nlp = spacy.load("en_core_web_sm")

# Tokenization fix for in-word hyphens (e.g. 'non-linear' would be kept 
# as one token instead of default spacy behavior of 'non', '-', 'linear')
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
re_token_match = f"({re_token_match}|\w+-\w+)"
nlp.tokenizer.token_match = re.compile(re_token_match).match

# populates a docs list with spacy doc objects
docs = []
for sample in tqdm(dataset['test']):
    docs.append(nlp(sample["title"]+". "+sample["abstract"]))

  0%|          | 0/500 [00:00<?, ?it/s]

### step-2: run the desired models on the dataset and store extracted keyphrases

In [4]:
from pke.unsupervised import *

outputs = {}
for model in [FirstPhrases, TopicRank, PositionRank, MultipartiteRank]:
    outputs[model.__name__] = []
    
    extractor = model()
    for i, doc in enumerate(tqdm(docs)):
        extractor.load_document(input=doc, language='en')
        extractor.grammar_selection(grammar="NP: {<ADJ>*<NOUN|PROPN>+}")
        extractor.candidate_weighting()
        outputs[model.__name__].append([u for u,v in extractor.get_n_best(n=5, stemming=True)])

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

### step-3: evaluate the performance of each model

In [5]:
import numpy as np

def evaluate(top_N_keyphrases, references):
    P = len(set(top_N_keyphrases) & set(references)) / len(top_N_keyphrases)
    R = len(set(top_N_keyphrases) & set(references)) / len(references)
    F = (2*P*R)/(P+R) if (P+R) > 0 else 0 
    return (P, R, F)

# loop through the models
for model in outputs:
    
    # compute the P, R, F scores for the model
    scores = []
    for i, output in enumerate(tqdm(outputs[model])):
        references = dataset['test'][i]["uncontr_stems"]
        scores.append(evaluate(output, references))
    
    # compute the average scores
    avg_scores = np.mean(scores, axis=0)
    
    # print out the performance of the model
    print("Model: {} P@5: {:.3f} R@5: {:.3f} F@5: {:.3f}".format(model, avg_scores[0], avg_scores[1], avg_scores[2]))

  0%|          | 0/500 [00:00<?, ?it/s]

Model: FirstPhrases P@5: 0.336 R@5: 0.204 F@5: 0.239


  0%|          | 0/500 [00:00<?, ?it/s]

Model: TopicRank P@5: 0.345 R@5: 0.207 F@5: 0.243


  0%|          | 0/500 [00:00<?, ?it/s]

Model: PositionRank P@5: 0.386 R@5: 0.239 F@5: 0.277


  0%|          | 0/500 [00:00<?, ?it/s]

Model: MultipartiteRank P@5: 0.354 R@5: 0.211 F@5: 0.249
