# Extraction des mots/phrases-cl√©s avec `keybert` et `keyphrase-vectorizers`
### Approche _PatternRank_
###### [Schopf _et al._, 2022](https://arxiv.org/pdf/2210.05245.pdf)
---

#1Ô∏è‚É£ `keybert`
* _cf._ [Grootendorst (2020)](https://doi.org/10.5281/zenodo.4461265)
* librairie Python pour extraire des mots/phrases-cl√©s les plus similaires √† un document en exploitant les plongements BERT<br>
‚ö†Ô∏è on doit sp√©cifier la longueur des n-grammes √† extraire, alors que l'on ne sait pas quelle est la longueur optimale ;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`keyphrase_ngram_range=(1, 3)` : on veut extraire trois types de s√©quences : uni-, bi- ou trigrammes <br>
‚ö†Ô∏è la grammaticalit√© des phrases n'est pas prise en compte (p. ex. ¬´ scientifique les planches ¬ª)

**_Maximal Marginal Relevance_**

* Afin de diversifier les r√©sultats de l'extraction des mots / phrases-cl√©s, on peut utiliser _Maximal Margin Relevance_ (_MMR_), param√®tre √©galement bas√© sur la similarit√© cosinus :
 * `use_mmr=True, diversity=[0-1]` (le degr√© de diversit√© entre 0 et 1)



 **Mots vides**

 Les listes de mots vides proviennent du vectorizer utilis√© avec KeyBERT, et non pas de KeyBERT en soi.

 * `stop_words=None` : si aucune liste ne s'applique
 * `stop_words='french'` : si l'on applique une liste de mots vides en fran√ßais

In [None]:
!pip install keybert
!pip install nltk
!pip install spacy
import torch
import os
from sentence_transformers import SentenceTransformer
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer
import nltk

from google.colab import drive
# Monter le Google Drive
drive.mount('/content/drive')

# Initialize the Sentence Transformer Model
sentence_model = SentenceTransformer("distiluse-base-multilingual-cased-v1")
kw_model = KeyBERT(model=sentence_model)

# Download and set up French stop words
## si spaCy

# Load spaCy French model
!python -m spacy download fr_core_news_lg
nlp = spacy.load('fr_core_news_lg')
# Convert spaCy's set of stop words to a list
french_stop_words = list(nlp.Defaults.stop_words)


## si NLTK
# nltk.download('stopwords')
# from nltk.corpus import stopwords
# french_stop_words = stopwords.words('french')

# Initialize CountVectorizer with French stop words
vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words=french_stop_words)

# Assuming Google Drive is mounted and paths are correctly set up
path = '/content/drive/MyDrive/ObTIC/ateliers/extraction_mots_cles/corpus/'
file_name = 'echantillon_charcot.txt'
file_path = '../output/test_keybert_sorted.txt'

# Function to sort keywords
def sort_keywords_by_score(keywords):
    # Sort keywords based on the score in descending order
    return sorted(keywords, key=lambda x: x[1], reverse=True)

# List to store all keywords
all_keywords = []

# Extract keywords from the file
with open(os.path.join(path, file_name), 'r') as myfile:
    raw_data = myfile.readlines()
    start = 0
    end = 200
    while len(raw_data) >= end:
        data = " ".join(raw_data[start:end])
        start = end
        end += 200
        keywords = kw_model.extract_keywords(data, vectorizer=vectorizer, use_mmr=True, diversity=0.7)
        all_keywords.extend(keywords)

# Sort all keywords once after extraction
sorted_keywords = sort_keywords_by_score(all_keywords)

# Write sorted keywords to the output file
with open(os.path.join(path, file_path), 'w') as outfile:
    for keyword, score in sorted_keywords:
        print(f"{keyword}: {score}")
        outfile.write(f"{keyword}: {score}\n")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


cellule corne ant√©rieure: 0.5327
post√©rieure cordon post√©rieur: 0.5078
valeur scientifique planches: 0.4976
cervico dorsale: 0.465
r√©gion cervicale figure: 0.4572
cirrhose cancer primitif: 0.4355
alt√©ration cellules ganglionnaires: 0.4032
anatomie pathologique mo√´lle: 0.3931
lcucocythcs substance granuleuse: 0.3474
compl√®tement d√©truite: 0.334
disparition tissu fibrillaire: 0.3313
reproduction aide photogravure: 0.3177
√©pini√®re 45 planches: 0.2559
pyramidaux r√©gion: 0.2419
passage cordon lat√©ral: 0.2348
pr√©c√©dente montrer d√©tail: 0.2306
paul auteur date: 0.2187
pr√©paration pr√©c√©dente vu: 0.2106
color√©e m√©thode: 0.1828
38: 0.1808
ii canal central: 0.1677
texte explicatif seulement: 0.1456
deux foyers: 0.1446
pr√©paration montre point: 0.141
zones analogues celles: 0.1364
ii parties indemnes: 0.1195
particuli√®rement intenses r√©gions: 0.1101
droit figure 14: 0.0942
figure 29: 0.0911
sup√©rieure color√©e m√©thode: 0.0847
moyenne color√©e m√©thode: 0.0794
plus fort: 0.

#2Ô∏è‚É£ PatternRank
* `keybert` + **`keyphrase-vectorizers`** = PatternRank<br>
 ‚ùáÔ∏è pas besoin de sp√©cifier la longueur des n-grammes √† extraire, car la librairie l'inf√®re elle-m√™me<br>
‚ùáÔ∏è la grammaticalit√© des phrases est prise en compte gr√¢ce aux extractions des parties du discours (p. ex. `<N.*>*<ADJ.*>*<ADJ.*>+`--> _scl√©rose lat√©rale amyotrophique_)
* _cf._ [Schopf _et al._ (2022)](https://arxiv.org/pdf/2210.05245.pdf) et [Schopf (2022)](https://towardsdatascience.com/enhancing-keybert-keyword-extraction-results-with-keyphrasevectorizers-3796fa93f4db)



In [3]:
!pip install keyphrase-vectorizers
!pip install flair
from keyphrase_vectorizers import KeyphraseCountVectorizer
from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Set paths
path = '/content/drive/MyDrive/ObTIC/ateliers/extraction_mots_cles/corpus/'
input_file_name = 'echantillon_charcot.txt'
output_file_name = '../output/output_pattern_rank_charcot_sorted_20.txt'

# Initialize KeyBERT with a multilingual model
kw_model = KeyBERT(model=TransformerDocumentEmbeddings('bert-base-multilingual-cased'))

# Setup vectorizer with specific language and pattern settings
vectorizer = KeyphraseCountVectorizer(spacy_pipeline='fr_core_news_lg', pos_pattern='<N.*>+<ADJ.*>*', stop_words='french')

# Function to sort keywords by score
def sort_keywords_by_score(keywords):
    return sorted(keywords, key=lambda x: x[1], reverse=True)

# List to store all keywords
all_keywords = []

# Read and process the file
with open(os.path.join(path, input_file_name), 'r') as input_file:
    raw_data = input_file.readlines()
    start = 0
    end = 20  # divide text into chunks
    while start < len(raw_data):
        data = " ".join(raw_data[start:end]).replace('\n', ' ')
        start = end
        end += 20
        try:
            # Extract keyphrases
            kp = kw_model.extract_keywords(data, vectorizer=vectorizer)
            all_keywords.extend(kp)
        except ValueError as e:
            print(f"An error occurred while processing chunks starting at line {start}: {e}")

# Sort all keywords at the end
sorted_keywords = sort_keywords_by_score(all_keywords)

# Write sorted keywords to output file
with open(os.path.join(path, output_file_name), 'w') as output_file:
    for keyword, score in sorted_keywords:
        print(f"{keyword}: {score}")
        output_file.write(f"{keyword}: {score}\n")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
tissu gingival: 0.9184
√©rysip√®le p√©riodique annuel: 0.9109
travaux r√©cents: 0.91
faisceau pyramidal direct: 0.9064
foie: 0.9052
syringomy√©lie: 0.9048
foie: 0.9036
r√©tractions fibro: 0.9008
goitre exophtalmique: 0.9008
fibrome: 0.8995
faisceau pyramidal: 0.899
mercredi m√©dical: 0.8975
syringomy√©lie: 0.8972
enveloppe souple imperm√©able: 0.8933
anatomie morbide: 0.8932
d√©g√©n√©ralion secondaire: 0.8931
scl√©rose: 0.8929
tissu fibrillaire: 0.8925
kyste crico: 0.8889
scl√©rose: 0.8882
scl√©rose: 0.8873
scl√©rose lat√©rale amyotrophique: 0.8873
onanoit: 0.8865
nettet√©: 0.8852
tissu gliomateux: 0.8845
scl√©rose: 0.8836
scl√©rose: 0.8832
d√©g√©n√©ration cons√©cutive: 0.8829
tissu fibrillaire: 0.8829
faisceau pyramidal scl√©ros√©: 0.8812
professeur stiiaus: 0.8804
cavit√© pathologique: 0.8791
scl√©rose: 0.879
anatomie pathologique contemplative: 0.8784
scl√

# üì° Rep√©rage des phrases-cl√©s communes

In [35]:
# Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
import re
import numpy as np
pattern = re.compile(r":.*\n")
charcot_pr = "/content/drive/MyDrive/ObTIC/ateliers/extraction_mots_cles/output/output_pattern_rank_charcot_sorted_20.txt"
autres_pr = "/content/drive/MyDrive/ObTIC/ateliers/extraction_mots_cles/output/output_pattern_rank_autres_sorted_20.txt"



# N'extraire que des phrases-cl√©s, sans leurs scores
with open(charcot_pr, 'r') as input_file_charcot, open(autres_pr, 'r') as input_file_autres:
    raw_data_charcot = input_file_charcot.readlines()
    raw_data_autres = input_file_autres.readlines()
    res_charcot = [pattern.sub("", match) for match in raw_data_charcot]
    res_autres = [pattern.sub("", match) for match in raw_data_autres]
    # for r in res_charcot:
      # print(r)
    # for r2 in res_autres:
    #   print(r2)

    common_elements = np.intersect1d(res_charcot, res_autres)
    celem_list = common_elements.tolist()
    for c in celem_list:
      print(c)



cordons
deux moiti√©s
foie
hypnotisme
observation iv
planche ii
planche iv
planche ix
planche vii
planche viii
planche xi
planche xii
planche xiii
planche xiv
planche xvi
planche xvii
planche xxi
planche xxiii
r√©gion lombaire
