# Tutorial for term extraction
This tutorial describes the main steps for the usage of the term extractors as part of the BioTermCategorizer library.

### Importing the library
In order to import the library and all its functions, the following code should be executed

In [3]:
import sys, os, re

#set the path to the library
general_path = os.getcwd().split("BioTermCategorizer")[0]+"BioTermCategorizer/"
sys.path.append(general_path+'biotermcategorizer/')
general_path

'/mnt/c/Users/Sergi/Documents/BioTermCategorizer/'

Additionally, the main class `TermExtractor` must be imported

In [4]:
from TermExtractor import TermExtractor

SyntaxError: invalid syntax (Categorizer.py, line 80)

### Using TermExtractor
#### Instance of TermExtractor
A first instance of the class `TermExtractor` must be assigned to a variable, as shown below. The parameters that can be introduced are:
- `extraction_methods` (list): List of extraction methods to use. Use a list even if only one extraction method is provided.
- `language`(str): Language for text processing. Only implemented in spanish.
- `max_tokens` (int): Maximum number of tokens for extracted terms.
- `join` (bool): Whether to join terms obtained with different methods and to remove overlaps among them. When `join=True`, only a list of keywords is provided for all the selected extraction methods, as opposed to when `join=False`.
- `postprocess`(bool): Whether to postprocess the extracted terms to remove and modify the ones with little meaning.

The default values for these 4 parameters are as follows:
`extraction_methods=["textrank"], language="spanish", max_tokens=3, join=False, postprocess=True`

Some examples for creating a first instance of the extractor are shown below.

In [4]:
#extractor with default parameters
extractor1 = TermExtractor()

#extractor using Rake as extraction method and a maximum of 5 tokens and postprocessing
extractor2 = TermExtractor(extraction_methods=["rake"], max_tokens=5)

#extractor using all extraction methods separately and no postprocessing
extractor3 = TermExtractor(extraction_methods=["rake","yake","textrank"], postprocess=False)

#extractor using all extraction methods with join=True and no postprocessing
extractor4 = TermExtractor(extraction_methods=["rake","yake","textrank"], join=True, postprocess=False)

#extractor using all extraction methods with join=True and postprocessing
extractor5 = TermExtractor(extraction_methods=["rake","yake","textrank"], join=True)

Once the first instance of the extractor is created, some attributes of the `TermExtractor` object can already be presented. The following attributes can be extracted from the `TermExtractor` object through dot indexing:
- `extraction_methods` (list): List of the selected extraction methods.
- `extractors` (dict): dictionary of the Extractor() class type objects to be used.
- `join`(bool): Provided value for join as a parameter.
- `postprocess` (bool): Provided value for postprocess as a parameter.
- `keywords`(list): List of Keyword() class objects including the extracted keywords. Since the extraction of the keywords has not yet been called, the attributes is `None`.

In [6]:
#attributes with the default settings
print("The default settings use the extractor", str(extractor1.extraction_methods), "with the parameter join =", str(extractor1.join))
print("The selected extractor is", str(extractor1.extractors))
print("The extracted keywords are", str(extractor1.keywords))

#attributes using all extraction methods with join=True and no postprocessing
print("\nExtractors", str(extractor4.extraction_methods), "with the parameter join =", str(extractor4.join))
print("The selected extractors are", str(extractor4.extractors))
print("The extracted keywords are", str(extractor4.keywords))

The default settings use the extractor ['textrank'] with the parameter join = False
The selected extractor is {'textrank': <extractors.TextRankExtractor.TextRankExtractor object at 0x7f8cdddea170>}
The extracted keywords are None

Extractors ['rake', 'yake', 'textrank'] with the parameter join = True
The selected extractors are {'rake': <extractors.RakeExtractor.RakeExtractor object at 0x7f8cdbe911b0>, 'yake': <extractors.YakeExtractor.YakeExtractor object at 0x7f8cdbe91240>, 'textrank': <extractors.TextRankExtractor.TextRankExtractor object at 0x7f8cdbe90c70>}
The extracted keywords are None


#### Calling TermExtractor to extract keywords
Once the instance of `TermExtractor` has been assigned to a variable, this variable can be called on a piece of text so that it finds its corresponding keywords. These keywords are stored in the attribute `keywords` of the object `TermExtractor` as objects of type `Keyword`. 

The objects of type `Keyword` also have different attributes that can be called through dot indexing:
- `text` (str): The extracted keyword text.
- `method` (str): The extraction method used to find the keyword.
- `score` (float): The relevance score of the keyword. This score varies in range depending on the used method.
- `span` (list): list of two ints as [ini, fin] containing the starting and ending indexs of the keyword in the input text.

The user can either access these attributes of each keyword individually or print the keyword directly, which returns all the attributes in a structured manner.

In [7]:
#sample text
text = "La desinfección total puede eliminar cualquier microorganismo menos las esporas, que solo se eliminan por esterilización."

#with default parameters
extractor1(text)

#printing the words directly
print("Printing the keywords directly:\n", extractor1.keywords)

#printing attributes individually
words = [(kw.text,kw.method,kw.span,kw.score) for kw in extractor1.keywords]
print("\nPrinting the keywords individually:\n", words)

Printing the keywords directly:
 [<Keyword(text='esterilización', method='textrank', score='0.24598459598535585', span='[106, 119]')>, <Keyword(text='desinfección total', method='textrank', score='0.09198286384736273', span='[3, 20]')>, <Keyword(text='cualquier microorganismo', method='textrank', score='0.09157193650587012', span='[37, 60]')>, <Keyword(text='menos las esporas', method='textrank', score='0.07212809468876731', span='[62, 78]')>]

Printing the keywords individually:
 [('esterilización', 'textrank', [106, 119], 0.24598459598535585), ('desinfección total', 'textrank', [3, 20], 0.09198286384736273), ('cualquier microorganismo', 'textrank', [37, 60], 0.09157193650587012), ('menos las esporas', 'textrank', [62, 78], 0.07212809468876731)]


In [8]:
#with Rake and 5 tokens at most and postprocessing
extractor2(text)
print(extractor2.keywords, "\n")

#with all algorithms and join=False and no postprocessing
extractor3(text)
print(extractor3.keywords, "\n")

#with all algorithms and join=True and no postprocessing
extractor4(text)
print(extractor4.keywords, "\n")

#with all algorithms, join=True and postprocessing
extractor5(text)
print(extractor5.keywords)

[<Keyword(text='solo', method='rake', score='1.0', span='[85, 88]')>, <Keyword(text='esterilización', method='rake', score='1.0', span='[106, 119]')>, <Keyword(text='esporas', method='rake', score='1.0', span='[72, 78]')>, <Keyword(text='eliminan', method='rake', score='1.0', span='[93, 100]')>] 

[<Keyword(text='solo', method='rake', score='1.0', span='[85, 88]')>, <Keyword(text='esterilización', method='rake', score='1.0', span='[106, 119]')>, <Keyword(text='esporas', method='rake', score='1.0', span='[72, 78]')>, <Keyword(text='eliminan', method='rake', score='1.0', span='[93, 100]')>, <Keyword(text='menos las esporas', method='yake', score='0.016559150827736194', span='[62, 78]')>, <Keyword(text='eliminan por esterilización', method='yake', score='0.016559150827736194', span='[93, 119]')>, <Keyword(text='desinfección total puede', method='yake', score='0.03339840940482845', span='[3, 26]')>, <Keyword(text='total puede eliminar', method='yake', score='0.03339840940482845', span='[16