# Tutorial for term extraction
This tutorial describes the main steps and options for the usage of the term extractors as part of the BioTermCategorizer library.

### Importing the library
In order to import the library and all its functions, the following code should be executed

In [1]:
import sys, os, re

#set the path to the library
general_path = os.getcwd().split("BioTermCategorizer")[0]+"BioTermCategorizer/"
sys.path.append(general_path+'biotermcategorizer/')
general_path

'/mnt/c/Users/Sergi/Documents/BioTermCategorizer/'

Additionally, the main class `TermExtractor` must be imported

In [2]:
from TermExtractor import TermExtractor

[nltk_data] Downloading package stopwords to /home/sergi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/sergi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sergi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to /home/sergi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Using TermExtractor's basic functions
#### The TermExtractor object
A first instance of the class `TermExtractor` must be assigned to a variable, as shown below. The parameters that can be introduced are:
- `extraction_methods` (list): List of keyword extraction methods. Use a list even if only one extraction method is provided. Default value: `["textrank"]`.
- `categorization_method` (str): Method for text categorization. Default value: `"setfit"`.
- `language` (str): Language for text processing. Only implemented in Spanish. Default value: `"spanish"`.
- `max_tokens` (int): Maximum number of tokens for keyword extraction. Default value: `3`.
- `pos` (bool): Whether to use Part of Speech sequences in KeyBert. Default value: `False`.
- `pos_pattern` (str): The Part of Speech regex pattern. Default value: `"<NOUN.*>*<ADP.*>*<NOUN.*>*<ADJ.*>*|<PROPN.*>+|<VERB.*>+"`.
- `join` (bool): Whether to join keywords from different methods and remove overlaps among them. When `join=True`, only a list of keywords is provided for all the selected extraction methods, as opposed to when `join=False`. Default value: `False`.
- `postprocess` (bool): Whether to postprocess the extracted terms to remove and modify the ones with little meaning. Default value: `True`.
- `n` (int): Maximum number of labels for a single sample. Default value: `1`.
- `thr_setfit` (float): Threshold for SetFit classification. Default value: `0.5`.
- `thr_transformers` (float): Threshold for Transformers classification. Default value: `-1`.
- `n_clusters` (int): Number of clusters for clustering. Default value: `None`.
- `categorizer_model_path` (str): Path to the categorizer model. Default value: `None`.
- `output_path` (str): Path to save the trained model. Default value: `"./trained_model"`.
- `clustering_model` (str): Clustering model for keyword clustering. Default value: `"cambridgeltl/SapBERT-from-PubMedBERT-fulltext"`.
- `classifier_model` (str): Classifier model for classification. Default value: `"/mnt/c/Users/Sergi/Desktop/BSC/spanish_sapbert_models/sapbert_15_noparents_1epoch"`.
- `**kwargs` (dict): Additional keyword arguments. Default value: `None` (if not specified, otherwise, provide specific default values).ostprocess=True`

An instance of the object TermExtractor with the default parameters is shown below:

In [5]:
extractor = TermExtractor()

Once the first instance of the extractor is created, some attributes of the `TermExtractor` object can already be presented. The following attributes can be extracted from the `TermExtractor` object through dot indexing:
- `extraction_methods` (list): List of the selected extraction methods.
- `extractors` (dict): dictionary of the Extractor() class type objects to be used.
- `categorization_method` (str): Selected categorization method.
- `categorizer`: instance of the SetFitClassifier, TransformersClassifier or Clustering class object.
- `join`(bool): Provided value for join as a parameter.
- `postprocess` (bool): Provided value for postprocess as a parameter.
- `kwargs` (dict): Dictionary of introduced extra parameters.
- `keywords`(list): List of Keyword() class objects including the extracted keywords. Since the extraction of the keywords has not yet been called, the attributes is `None`.

In [4]:
#attributes with the default settings
print("The default settings use the extractor", str(extractor.extraction_methods), "and the categorizer", str(extractor.categorization_method))
print("The selected extractors are", str(extractor.extractors))
print("The selected categorizer is", str(extractor.categorizer))
print("The parameter join is", str(extractor.join), "and the parameter postprocess is", str(extractor.postprocess))
print("The following extra parameters have been introduced:", str(extractor.kwargs))
print("At this moment the extracted keywords are", str(extractor.keywords))

The default settings use the extractor ['textrank'] and the categorizer setfit
The selected extractors are {'textrank': <extractors.TextRankExtractor.TextRankExtractor object at 0x7f8480915660>}
The selected categorizer is <categorizers.SetFitClassifier.SetFitClassifier object at 0x7f856c94e4a0>
The parameter join is False and the parameter postprocess is True
The following extra parameters have been introduced: {}
At this moment the extracted keywords are None


When the `TermExtractor` object is called on a string of text, it automatically extracts the keywords and classifies the according to the designated parameters. These keywords are stored in the attribute `TermExtractor.keywords`, which is a list of `Keyword` class objects.

In [6]:
#shortened text for a smaller number of keywords
text = """Presentamos el caso de un recién nacido que acude a urgencias por descubrir un aumento del hemiescroto izquierdo, así como un color violáceo del mismo; [...] No presenta fiebre, ni otro dato clínico adicional. [...] En la ecografía escrotal encontramos un teste derecho de características normales; el teste izquierdo se encuentra aumentado de tamaño, con irregularidades en su interior, con una zona anecoica que puede sugerir necrosis, rodeado de una zona hiperecoica sugerente de calcificación ; todo ello compatible con torsión testicular evolucionada. [...]"""

extractor(text)

#### The Keyword class
The objects of type `Keyword` also have different attributes that can be called through dot indexing:
- `text` (str): The extracted keyword text.
- `extraction_method` (str): The extraction method used to find the keyword.
- `score` (float): The relevance score of the keyword. This score varies in range depending on the used method.
- `span` (list): list of two ints as [ini, fin] containing the starting and ending indexs of the keyword in the input text.
- `categorization_method` (str): The categorization method used to categorize the keyword.
- `label` (list): label or labels assigned to the keyword.

The user can either access these attributes of each keyword individually or print the keyword directly, which returns all the attributes in a structured manner.

In [6]:
#printing the words directly
print("Printing the keywords directly:\n", extractor.keywords)

#printing attributes individually
words = [(kw.text,kw.extraction_method,kw.span,kw.score,kw.categorization_method,kw.label) for kw in extractor.keywords]
print("\nPrinting the keywords individually:\n", words)

Printing the keywords directly:
 [<Keyword(text='torsión testicular', span='[524, 541]', extraction method='textrank', score='0.12600626989711727', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='características normales', span='[273, 296]', extraction method='textrank', score='0.10035908282536155', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='fiebre', span='[170, 175]', extraction method='textrank', score='0.07631917863343891', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='tamaño', span='[344, 349]', extraction method='textrank', score='0.07574647594218213', categorization method='setfit', class='['NO_CATEGORY']')>, <Keyword(text='zona anecoica', span='[396, 408]', extraction method='textrank', score='0.07092914714476105', categorization method='setfit', class='['GEO_GEN']')>, <Keyword(text='teste izquierdo', span='[302, 316]', extraction method='textrank', score='0.0686690744025542', categorization method='setf

### Use of different extractors
The keyword extraction implemented in `TermExtractor` is performed in an unsupervised manner. The four extraction algorithms are RAKE, YAKE!, TextRank and KeyBERT; and they can be used individually or together. In addition to that, the parameters `max_tokens`, `join` and `postprocess` can be modified.

Some examples showing these options are shown below:

In [7]:
#extractor using Rake as extraction method and a maximum of 5 tokens and postprocessing
extractor1 = TermExtractor(extraction_methods=["rake"], max_tokens=5)

#extractor using 3 extraction methods with join=True and no postprocessing
extractor2 = TermExtractor(extraction_methods=["rake","yake"], join=True, postprocess=False)

#extractor using 3 extraction methods with join=True and postprocessing
extractor3 = TermExtractor(extraction_methods=["rake","keybert","textrank"], join=True)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
No sentence-transformers model found with name /mnt/c/Users/Sergi/Desktop/BSC/spanish_sapbert_models/sapbert_15_parents_1epoch. Creating a new one with MEAN pooling.


The keywords extracted by each of these extractors are shown below:

In [8]:
#with Rake and 5 tokens at most and postprocessing
extractor1(text)
print("Using extractor1:\n", extractor1.keywords, "\n")

#extractor using 3 extraction methods with join=True and no postprocessing
extractor2(text)
print("Using extractor2:\n", extractor2.keywords, "\n")

#extractor using 3 extraction methods with join=True and postprocessing
extractor3(text)
print("Using extractor3:\n", extractor3.keywords, "\n")

Using extractor1:
 [<Keyword(text='torsión testicular evolucionada', span='[524, 554]', extraction method='rake', score='9.0', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='puede sugerir necrosis', span='[414, 435]', extraction method='rake', score='9.0', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='ecografía escrotal encontramos', span='[222, 251]', extraction method='rake', score='9.0', categorization method='setfit', class='['PROCEDIMIENTO']')>, <Keyword(text='dato clínico adicional', span='[186, 207]', extraction method='rake', score='9.0', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='zona hiperecoica sugerente', span='[453, 478]', extraction method='rake', score='8.5', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='zona anecoica', span='[396, 408]', extraction method='rake', score='4.5', categorization method='setfit', class='['GEO_GEN']')>, <Keyword(text='teste izquierdo', span='[3

#### Extraction based on Part of Speech tags with KeyBERT
In addition to extracting keywords based on a maximum number of tokens, these can also be extracted using specific Part of Speech (PoS) patterns using the KeyBERT extractor. An example is shown below:

In [9]:
#extractor using KeyBERT as extraction method with the default PoS tag
extractor4 = TermExtractor(extraction_methods=["keybert"], pos=True)

#extractor using KeyBERT as extraction method with a specified PoS tag
extractor5 = TermExtractor(extraction_methods=["keybert"], pos=True, pos_pattern="<NOUN.*>*<ADJ.*>*")

No sentence-transformers model found with name /mnt/c/Users/Sergi/Desktop/BSC/spanish_sapbert_models/sapbert_15_parents_1epoch. Creating a new one with MEAN pooling.
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
No sentence-transformers model found with name /mnt/c/Users/Sergi/Desktop/BSC/spanish_sapbert_models/sapbert_15_parents_1epoch. Creating a new one with MEAN pooling.


The keywords extracted by each of these extractors are shown below:

In [10]:
#extractor using KeyBERT as extraction method with the default PoS tag
extractor4(text)
print("Using extractor1:\n", extractor4.keywords, "\n")

#extractor using KeyBERT as extraction method with a specified PoS tag
extractor5(text)
print("Using extractor2:\n", extractor5.keywords, "\n")

Using extractor1:
 [<Keyword(text='teste derecho de características normales', span='[256, 296]', extraction method='keybert', score='0.6458', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='torsión testicular evolucionada', span='[524, 554]', extraction method='keybert', score='0.6277', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='aumento del hemiescroto izquierdo', span='[79, 111]', extraction method='keybert', score='0.6274', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='ecografía escrotal', span='[222, 239]', extraction method='keybert', score='0.5957', categorization method='setfit', class='['PROCEDIMIENTO']')>, <Keyword(text='zona hiperecoica', span='[453, 468]', extraction method='keybert', score='0.5505', categorization method='setfit', class='['SINTOMA']')>] 

Using extractor2:
 [<Keyword(text='torsión testicular evolucionada', span='[524, 554]', extraction method='keybert', score='0.6039', categorizatio

### Use of different categorizers
The keyword multilabel classification included in `TermExtractor` has been implemented in both a supervised and unsupervised manner. The supervised classification is conducted with the SetFit classifier and the AutoModelForSequenceClassification, both from the SentenceTransformers library. The unsupervised categorization is conducted with a KMeans Clustering algorithm. 
#### Classification with SetFit
The SetFit model of the Transformers library is selected by setting the parameter `categorization_method` to "setfit". The default model has been trained on a corpus with 22 classes, but another trained model can be specified in the parameter `categorizer_model_path`. The classification is performed in a multilabel manner, where each label is assigned a score between 0 and 1 during the prediction. The user can set a threshold for the labels to be considered using `thr_setift`(0.5 by default) and a maximum number of labels for each keyword using `n` (1 by default). Some examples are shown below:

In [11]:
#default settings and specifying a model path
extractor6 = TermExtractor(categorization_method="setfit", categorizer_model_path='/mnt/c/Users/Sergi/Desktop/BSC/modelos_entrenados/SetFit/noparents_sp')

extractor6(text)
print("Using extractor6:\n", extractor6.keywords)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Using extractor6:
 [<Keyword(text='torsión testicular', span='[524, 541]', extraction method='textrank', score='0.12600626989711727', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='características normales', span='[273, 296]', extraction method='textrank', score='0.10035908282536155', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='fiebre', span='[170, 175]', extraction method='textrank', score='0.07631917863343891', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='tamaño', span='[344, 349]', extraction method='textrank', score='0.07574647594218213', categorization method='setfit', class='['NO_CATEGORY']')>, <Keyword(text='zona anecoica', span='[396, 408]', extraction method='textrank', score='0.07092914714476105', categorization method='setfit', class='['GEO_GEN']')>, <Keyword(text='teste izquierdo', span='[302, 316]', extraction method='textrank', score='0.0686690744025542', categorization method='setfit', class='['

In [12]:
#allowing a multilabel output (up to 3 labels over 0.2)
extractor7 = TermExtractor(categorization_method="setfit", thr_setfit=0.2, n=3)

extractor7(text)
print("Using extractor7:\n", extractor7.keywords)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Using extractor7:
 [<Keyword(text='torsión testicular', span='[524, 541]', extraction method='textrank', score='0.12600626989711727', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='características normales', span='[273, 296]', extraction method='textrank', score='0.10035908282536155', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='fiebre', span='[170, 175]', extraction method='textrank', score='0.07631917863343891', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='tamaño', span='[344, 349]', extraction method='textrank', score='0.07574647594218213', categorization method='setfit', class='['NO_CATEGORY']')>, <Keyword(text='zona anecoica', span='[396, 408]', extraction method='textrank', score='0.07092914714476105', categorization method='setfit', class='['GEO_GEN']')>, <Keyword(text='teste izquierdo', span='[302, 316]', extraction method='textrank', score='0.0686690744025542', categorization method='setfit', class='['

In this example, the mention "teste izquierdo" has been assigned two labels.
#### Classification with AutoModelForSequenceClassification
The AutoModelForSequenceClassification model of the Transformers library is selected by setting the parameter `categorization_method` to "transformers". The default model has been trained on the same corpus as the SetFit classifier, but another trained model can be specified in the parameter `categorizer_model_path`. The classification is performed in a multilabel manner, where each label is assigned a logit during the prediction. The user can set a threshold for the labels to be considered using `thr_transformers`(-1 by default) and a maximum number of labels for each keyword using `n` (1 by default). An example is shown below:

In [7]:
#allowing a multilabel output (up to 2 labels over -2)
extractor8 = TermExtractor(categorization_method="transformers", thr_transformers=-2, n=2)

extractor8(text)
print("Using extractor8:\n", extractor8.keywords)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Using extractor8:
 [<Keyword(text='torsión testicular', span='[524, 541]', extraction method='textrank', score='0.12600626989711727', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='características normales', span='[273, 296]', extraction method='textrank', score='0.10035908282536155', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='fiebre', span='[170, 175]', extraction method='textrank', score='0.07631917863343891', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='tamaño', span='[344, 349]', extraction method='textrank', score='0.07574647594218213', categorization method='setfit', class='['NO_CATEGORY']')>, <Keyword(text='zona anecoica', span='[396, 408]', extraction method='textrank', score='0.07092914714476105', categorization method='setfit', class='['GEO_GEN']')>, <Keyword(text='teste izquierdo', span='[302, 316]', extraction method='textrank', score='0.0686690744025542', categorization method='setfit', class='['

#### Custering with KMeans
The KMeans Clustering model of the Transformers library is selected by setting the parameter `categorization_method` to "clustering". The default pretrained model is obtained from "cambridgeltl/SapBERT-from-PubMedBERT-fulltext" and is later trained on the data that it is provided for prediction (generates the clusters directly from it). Another pretrained model can be specified in the parameter `clustering_model` and later trained on the given data. In both these cases the trained model is stored in a directory specified by the parameter `output_path`, which by default is "./trained_model".
On the other hand, if the user has a completely trained model that they want to use it can be specified in the parameter `categorizer_model_path`, and no later training will be needed nor will the model be saved.

The clustering only assigns one label to each mention, and the number of clusters to perform the clustering has to be specified in the variable `n_clusters`. Some examples are shown below:

In [11]:
#default parameters (default model trained on the text) and 5 clusters
extractor9 = TermExtractor(categorization_method="clustering", n_clusters=5)
print(extractor9.categorization_method)
extractor9(text)
print("Using extractor9:\n", extractor9.keywords)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


setfit
Using extractor9:
 [<Keyword(text='torsión testicular', span='[524, 541]', extraction method='textrank', score='0.12600626989711727', categorization method='setfit', class='['ENFERMEDAD']')>, <Keyword(text='características normales', span='[273, 296]', extraction method='textrank', score='0.10035908282536155', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='fiebre', span='[170, 175]', extraction method='textrank', score='0.07631917863343891', categorization method='setfit', class='['SINTOMA']')>, <Keyword(text='tamaño', span='[344, 349]', extraction method='textrank', score='0.07574647594218213', categorization method='setfit', class='['NO_CATEGORY']')>, <Keyword(text='zona anecoica', span='[396, 408]', extraction method='textrank', score='0.07092914714476105', categorization method='setfit', class='['GEO_GEN']')>, <Keyword(text='teste izquierdo', span='[302, 316]', extraction method='textrank', score='0.0686690744025542', categorization method='setfit', cl

#### Training the categorizers
Up until now, the shown categorizers (both the classifiers and the clustering model) have used pretrained models. However, there is also the option to train these categorizers with the user's own data to achieve a more personalized result.
##### Training SetFit


##### Training AutoModelForSequenceClassification

##### Training the KMeans Clustering