<a href="https://colab.research.google.com/github/leah-apking/homework/blob/main/part-1-graph-based-keyphrase-extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install git+https://github.com/boudinfl/pke.git
!pip install matplotlib
!python -m spacy download en_core_web_sm

Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /tmp/pip-req-build-svqcxonm
  Running command git clone --filter=blob:none --quiet https://github.com/boudinfl/pke.git /tmp/pip-req-build-svqcxonm
  Resolved https://github.com/boudinfl/pke.git to commit 69871ffdb720b83df23684fea53ec8776fd87e63
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from pke==2.0.0)
  Downloading Unidecode-1.3.7-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pke
  Building wheel for pke (setup.py) ... [?25l[?25hdone
  Created wheel for pke: filename=pke-2.0.0-py3-none-any.whl size=6160628 sha256=81d01e402a03dc0fee364b2bdd18969a6951cf28ffdf0de3e95ceeb15e631e05
  Stored in directory: /tmp/pip-ephem-wheel-cache-izv8n4ua/wheels/8c/07/29/6b35bed2aa36e33d77ff3677eb716965ece4d2e56639ad0aab
Successfully

# Hands-on session with pke - part 1

This notebook covers a brief introduction on keyphrase extraction using `pke`, an open source python-based keyphrase extraction toolkit. `pke` provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extented to develop new models.

The overall architecture of `pke` is depicted in the Figure below.
Extracting keyphrases from an input document involves three stages.
First, **keyphrase candidates** (i.e. words and phrases that are eligible to be keyphrases) are selected from the content of the document (populates the `self.candidates` dictionary). Second, **candidates are either ranked** using a candidate weighting function (unsupervised approaches), **or classified as keyphrase or not** using a set of extracted features (supervised approaches) (populates the `self.weights` dictionary). Third, the top-N highest weighted candidates, or those classified as keyphrase with the highest confidence scores, are selected as keyphrases.

![pke_architecture.png](attachment:pke_architecture.png)

`pke` provides a standardized API for extracting keyphrases from a document:

```python
import pke

extractor = pke.unsupervised.TfIdf()                # initialize a keyphrase extraction model, here TFxIDF
extractor.load_document(input='text')               # load the content of the document  (str or spacy Doc)
extractor.candidate_selection()                     # identify keyphrase candidates
extractor.candidate_weighting()                     # weight keyphrase candidates
keyphrases = extractor.get_n_best(n=10)             # select the 10-best candidates as keyphrases
```

## Graph-based keyphrase extraction with TopicRank

[TopicRank (Bougouin et al., 2013)](https://aclanthology.org/I13-1062/) is an unsupervised graph-based ranking model to keyphrase extraction that is often used as a baseline by the research community.
TopicRank relies on a graph-based topical representation of the input document, and uses a random walk algorithm derived from PageRank to estimate the importance of each topic (node).
The most representative phrase candidates belonging to the highest-scored topics are then selected as keyphrases.

This notebook presents an end-to-end example of keyphrase extraction using TopicRank implemented in `pke`.

### step-1: let's start by importing `pke` and initializing a `TopicRank` model

In [2]:
import pke

# initialize a TopicRank keyphrase extraction model
extractor = pke.unsupervised.TopicRank()

### step-2: what we need now is a sample document

In [23]:
# sample document (2040.abstr from the Inspec dataset)
sample = """UWM made this HELOC easy from import to CTC. A slight hiccup when UW missed that the client was allowed to have 1x30 days late on his mortgage as long as it wasn't within the last 6 months. Declined the approval at first but then got back on course with a quick re-look at the guidelines. Smooth sailing after that and client is happy.""".replace("\n", " ")

### step-3: we can load the sample document using the pke model

When raw text is given to a `pke` model, `spacy`/`nltk` is used to pre-process the text (sentence splitting, tokenization, Part-of-Speech tagging, stemming).

In [24]:
# load the document using the initialized model
# text preprocessing is carried out using spacy
extractor.load_document(input=sample, language='en')

In [25]:
# loading a document populates the extractor.sentences list
# let's have a look at the pre-processed text

# for each sentence in the document
for i, sentence in enumerate(extractor.sentences):

    # print out the sentence id, its tokens, its stems and the corresponding Part-of-Speech tags
    print("sentence {}:".format(i))
    print(" - words: {} ...".format(' '.join(sentence.words[:5])))
    print(" - stems: {} ...".format(' '.join(sentence.stems[:5])))
    print(" - PoS: {} ...".format(' '.join(sentence.pos[:5])))

sentence 0:
 - words: UWM made this HELOC easy ...
 - stems: uwm made thi heloc easi ...
 - PoS: PROPN VERB DET PROPN ADJ ...
sentence 1:
 - words: A slight hiccup when UW ...
 - stems: a slight hiccup when uw ...
 - PoS: DET ADJ NOUN SCONJ PROPN ...
sentence 2:
 - words: Declined the approval at first ...
 - stems: declin the approv at first ...
 - PoS: VERB DET NOUN ADP ADV ...
sentence 3:
 - words: Smooth sailing after that and ...
 - stems: smooth sail after that and ...
 - PoS: ADJ NOUN ADP DET CCONJ ...


### step-4 : identifying keyphrase candidates

In [26]:
# identify the keyphrase candidates using TopicRank's default strategy
# i.e. the longest sequences of nouns and adjectives `(Noun|Adj)*`
extractor.candidate_selection()

In [27]:
# identifying keyphrase candidates populates the extractor.candidates dictionary
# let's have a look at the keyphrase candidates

# for each keyphrase candidate
for i, candidate in enumerate(extractor.candidates):

    # print out the candidate id, its stemmed form
    print("candidate {}: {} (stemmed form)".format(i, candidate))

    # print out the surface forms of the candidate
    print(" - surface forms:", [ " ".join(u) for u in extractor.candidates[candidate].surface_forms])

    # print out the corresponding offsets
    print(" - offsets:", extractor.candidates[candidate].offsets)

    # print out the corresponding sentence ids
    print(" - sentence_ids:", extractor.candidates[candidate].sentence_ids)

    # print out the corresponding PoS patterns
    print(" - pos_patterns:", extractor.candidates[candidate].pos_patterns)

candidate 0: uwm (stemmed form)
 - surface forms: ['UWM']
 - offsets: [0]
 - sentence_ids: [0]
 - pos_patterns: [['PROPN']]
candidate 1: heloc easi (stemmed form)
 - surface forms: ['HELOC easy']
 - offsets: [3]
 - sentence_ids: [0]
 - pos_patterns: [['PROPN', 'ADJ']]
candidate 2: import (stemmed form)
 - surface forms: ['import']
 - offsets: [6]
 - sentence_ids: [0]
 - pos_patterns: [['NOUN']]
candidate 3: ctc (stemmed form)
 - surface forms: ['CTC']
 - offsets: [8]
 - sentence_ids: [0]
 - pos_patterns: [['PROPN']]
candidate 4: slight hiccup (stemmed form)
 - surface forms: ['slight hiccup']
 - offsets: [11]
 - sentence_ids: [1]
 - pos_patterns: [['ADJ', 'NOUN']]
candidate 5: client (stemmed form)
 - surface forms: ['client', 'client']
 - offsets: [18, 65]
 - sentence_ids: [1, 3]
 - pos_patterns: [['NOUN'], ['NOUN']]
candidate 6: day (stemmed form)
 - surface forms: ['days']
 - offsets: [24]
 - sentence_ids: [1]
 - pos_patterns: [['NOUN']]
candidate 7: mortgag (stemmed form)
 - surfac

### step-5 : ranking keyphrase candidates

In [28]:
# In TopicRank, candidate weighting is a three-step process:
#  1. candidate clustering (grouping keyphrase candidates into topics)
#  2. graph construction (building a complete-weighted-graph of topics)
#  3. rank topics (nodes) using a random walk algorithm
extractor.candidate_weighting()

In [29]:
# let's have a look at the topics

# for each topic of the document
for i, topic in enumerate(extractor.topics):

    # print out the topic id and the candidates it groups together
    print("topic {}: {} ".format(i, ';'.join(topic)))

topic 0: approv 
topic 1: client 
topic 2: cours 
topic 3: ctc 
topic 4: day 
topic 5: guidelin 
topic 6: happi 
topic 7: heloc easi 
topic 8: import 
topic 9: month 
topic 10: mortgag 
topic 11: quick re-look 
topic 12: slight hiccup 
topic 13: smooth sail 
topic 14: uwm 


In [30]:
# let have a look at the graph-based representation of the document
#
# here, nodes are topics, edges between topics are weighted according to
# the strength of their semantic relation measured by the reciprocal distances
# between the offset positions of the candidate keyphrases

import networkx as nx
import matplotlib.pyplot as plt
%matplotlib notebook

# set the labels as list of candidates for each topic
labels = {i: ';'.join(topic) for i, topic in enumerate(extractor.topics)}

# set the weights of the edges
edge_weights = [extractor.graph[u][v]['weight'] for u,v in extractor.graph.edges()]

# set the weights of the nodes (topic weights are stored in _w attribute)
sizes = [10e3*extractor._w[i] for i, topic in enumerate(extractor.topics)]

# draw the graph
nx.draw_shell(extractor.graph, with_labels=True, labels=labels, width=edge_weights, node_size=sizes)

In [31]:
# let's have a look at the weights/ranks of the topics

# In TopicRank, weights are computed for each topic, and only one
# representative candidate per topic (by default the first occurring
# one) is kept

# for each representative candidate
for candidate, weight in extractor.weights.items():

    # print out the candidate (in stemmed form) and its weight
    print('{}: {}'.format(candidate, weight))

approv: 0.052424241898000254
client: 0.10624159911651665
cours: 0.05932677236892589
ctc: 0.07716255540944406
day: 0.05326696154118079
guidelin: 0.07498721087272281
happi: 0.057961774765564376
heloc easi: 0.07543035859990932
import: 0.08230950371129672
month: 0.05065246661463455
mortgag: 0.05076752060310962
quick re-look: 0.07032353593918264
slight hiccup: 0.06362960165940743
smooth sail: 0.07371441014320135
uwm: 0.05180148675690363


### step-6: selecting the N-best candidates as keyphrases

In [33]:
# Get the N-best candidates (here, 5) as keyphrases
keyphrases = extractor.get_n_best(n=10, stemming=False)

# for each of the best candidates
for i, (candidate, score) in enumerate(keyphrases):

    # print out the its rank, phrase and score
    print("rank {}: {} ({})".format(i, candidate, score))

rank 0: client (0.10624159911651665)
rank 1: import (0.08230950371129672)
rank 2: ctc (0.07716255540944406)
rank 3: heloc easy (0.07543035859990932)
rank 4: guidelines (0.07498721087272281)
rank 5: smooth sailing (0.07371441014320135)
rank 6: quick re-look (0.07032353593918264)
rank 7: slight hiccup (0.06362960165940743)
rank 8: course (0.05932677236892589)
rank 9: happy (0.057961774765564376)


## Conclusion

Now that we are familiar with the three-stage process involved in keyphrase extraction (candidate selection, candidate ranking, N-best selection), as well as with the `pke` API, we are ready for part-2 in which experiment with different models and parameters and see how to evaluate the quality of the produced keyphrases.

# KeyBERT

In [34]:
pip install keybert

Collecting keybert
  Downloading keybert-0.8.3.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.3.8 (from keybert)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers>=0.3.8->keybert)
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers>=0.3.8->keybert)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 

In [36]:
from keybert import KeyBERT

# Init KeyBERT
kw_model = KeyBERT()
kw_model.extract_keywords(sample[0], stop_words=None)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

[]

In [39]:
# Extract the top 5 key terms with a 1 to 3 n-grams
kw_model.extract_keywords(docs=sample, stop_words=None, keyphrase_ngram_range=(1, 3), top_n=5)


[('mortgage as long', 0.4596),
 ('heloc easy from', 0.4226),
 ('heloc', 0.4217),
 ('heloc easy', 0.4079),
 ('mortgage', 0.4016)]

In [42]:
pip install keyphrase_vectorizers

Collecting keyphrase_vectorizers
  Downloading keyphrase_vectorizers-0.0.11-py3-none-any.whl (29 kB)
Collecting spacy-transformers>=1.1.6 (from keyphrase_vectorizers)
  Downloading spacy_transformers-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers>=1.1.6->keyphrase_vectorizers)
  Downloading spacy_alignments-0.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy-alignments, spacy-transformers, keyphrase_vectorizers
Successfully installed keyphrase_vectorizers-0.0.11 spacy-alignments-0.9.1 spacy-transformers-1.3.2


In [43]:
from keyphrase_vectorizers import KeyphraseCountVectorizer
kw_model.extract_keywords(docs=sample, vectorizer=KeyphraseCountVectorizer())

2023-10-18 20:10:53,352 - KeyphraseVectorizer - INFO - It looks like you do not have downloaded a list of stopwords yet. It is attempted to download the stopwords now.
INFO:KeyphraseVectorizer:It looks like you do not have downloaded a list of stopwords yet. It is attempted to download the stopwords now.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[('heloc', 0.4217),
 ('mortgage', 0.4016),
 ('uwm', 0.3886),
 ('uw', 0.3168),
 ('ctc', 0.2754)]