In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from nb_utils import showAttribs

# Module: eliana.preprocessing

Classes to preprocess the events, similar to NLP preprocessing but oriented specifically to event logs.

## Tokenizers

The tokenizers in eliana.preprocessing module support multiple inheritance to implement complex behavior based on regular expressions. New tokenizers can be created from the base tokenizers shown below.

### Abstract Tokenizers

The AbstractTokenizer provides the interface for the rest of tokenizers used in eliana.preprocessing.

In [3]:
from eliana.preprocessing import AbstractTokenizer
showAttribs(AbstractTokenizer)

AbstractTokenizer   : Main methods for tokenizers.

Constructor
__init__            : Initializes the AbstractTokenizer with optional keyword arguments.

Properties
options             : (bool) remove_extra_spaces, to_lowercase, strip

Methods
help                : Provides help information about the tokenizer class, including its inheritance and tokenization order.
normalize           : Applies option normalization to the input text.
tokenize            : Apply tokenization to the provided text.


## Main Tokenizers

In [4]:
from eliana.preprocessing import RegExpTokenizer, Numbers, UTCdate, Punctuation

### RegExpTokenizer

In [5]:
print(RegExpTokenizer().help())

Tokenizer class "RegExpTokenizer"
Inherits from:
Options: namespace(remove_extra_spaces=False, to_lowercase=False, strip=False)

Tokenization is done in the following order

RegExpTokenizer:     A tokenizer that uses regular expressions to tokenize text. Inherits from AbstractTokenizer.


In [6]:
# Example
tkn = RegExpTokenizer(remove_extra_spaces=True)
tkn.add_regexp( r"xx(.+)xx", r"\1")

text = "xxabcxx      jj cc"

print(f"Options: {tkn.options.__dict__}")
print(f"Regexps: {tkn.regexps}")
print(f"Original : '{text}'")
print(f"Tokenized: '{tkn.tokenize(text)}'")

Options: {'remove_extra_spaces': True, 'to_lowercase': False, 'strip': False}
Regexps: [('(\\{\\})+', '{}'), ('xx(.+)xx', '\\1')]
Original : 'xxabcxx      jj cc'
Tokenized: 'abc jj cc'


### Numbers

In [7]:
print(Numbers().help())

Tokenizer class "Numbers"
Inherits from: RegExpTokenizer
Options: namespace(remove_extra_spaces=False, to_lowercase=False, strip=False)

Tokenization is done in the following order

RegExpTokenizer:     A tokenizer that uses regular expressions to tokenize text. Inherits from AbstractTokenizer.

Numbers: Transform numbers using {} as token, but ignore numbers that are parts of a word
    1 -> {}
    -10.54 -> {}
    9.1e-2.1 -> {}
    There are 2 telescopes: UT1, UT2 -> There are {} telescopes: UT1, UT2
    2Good2be_True -> 2Good2be_True


In [8]:
# Example
tkn = Numbers()
tkn.tokenize("abc 123 456.78 Not99Tokenized")
text = "abc 123 456.78 Not99Tokenized"
print(f"Options: {tkn.options.__dict__}")
print(f"Original : '{text}'")
print(f"Tokenized: '{tkn.tokenize(text)}'")

Options: {'remove_extra_spaces': False, 'to_lowercase': False, 'strip': False}
Original : 'abc 123 456.78 Not99Tokenized'
Tokenized: 'abc {} {} Not99Tokenized'


### UTCdate

In [9]:
print(UTCdate().help())

Tokenizer class "UTCdate"
Inherits from: RegExpTokenizer
Options: namespace(remove_extra_spaces=False, to_lowercase=False, strip=False)

Tokenization is done in the following order

RegExpTokenizer:     A tokenizer that uses regular expressions to tokenize text. Inherits from AbstractTokenizer.

UTCdate: Transform UTC dates using {} as token
    2022-10-01T00:43:01.123 -> {}
    Started at 2019-04-01T22:29:07 (underlined) -> Started at {} (underlined)


In [10]:
# Example
tkn = UTCdate() 
tkn.tokenize("abc 2019-12-31T00:00:00 2019-12-31T23:59:59.999 smtg else")

'abc {} {} smtg else'

### Punctuation

In [11]:
print(Punctuation().help())

Tokenizer class "Punctuation"
Inherits from: RegExpTokenizer
Options: namespace(remove_extra_spaces=False, to_lowercase=False, strip=False)

Tokenization is done in the following order

RegExpTokenizer:     A tokenizer that uses regular expressions to tokenize text. Inherits from AbstractTokenizer.

Punctuation: Remove all punctuation
    Original: Hi! I'm counting 1,2, 3 ... and so on.
    Tokenized: Hi I m counting 1 2 3 and so on


In [12]:
# Example
tkn = Punctuation() 
tkn.tokenize("Hi! I'm counting 1,2, 3 ... and so on.")

'Hi  I m counting 1 2  3     and so on '

### Composed Tokenizer Example

In [13]:
class VltTokenizer(UTCdate, Numbers, Punctuation):
    """ Composing example
    """
    def __init__(self, **kwargs):
        kwargs.setdefault('remove_extra_spaces', True)
        kwargs.setdefault('to_lowercase', True)
        kwargs.setdefault('strip', True)
        super().__init__(**kwargs)

In [14]:
print(VltTokenizer().help())

Tokenizer class "VltTokenizer"
Inherits from: UTCdate, Numbers, Punctuation, RegExpTokenizer
Options: namespace(remove_extra_spaces=True, to_lowercase=True, strip=True)

Tokenization is done in the following order

UTCdate: Transform UTC dates using {} as token
    2022-10-01T00:43:01.123 -> {}
    Started at 2019-04-01T22:29:07 (underlined) -> Started at {} (underlined)

Numbers: Transform numbers using {} as token, but ignore numbers that are parts of a word
    1 -> {}
    -10.54 -> {}
    9.1e-2.1 -> {}
    There are 2 telescopes: UT1, UT2 -> There are {} telescopes: UT1, UT2
    2Good2be_True -> 2Good2be_True

Punctuation: Remove all punctuation
    Original: Hi! I'm counting 1,2, 3 ... and so on.
    Tokenized: Hi I m counting 1 2 3 and so on

RegExpTokenizer:     A tokenizer that uses regular expressions to tokenize text. Inherits from AbstractTokenizer.

VltTokenizer:  Composing example


In [15]:
tkn = VltTokenizer()
tkn.tokenize("wat2tcs lt3aga w2fors (bobWish_105797) ITERATION=10 [bob_234] lt4aag cmd77 2022-10-01T00:43:01.123")

'wat2tcs lt3aga w2fors bobwish_105797 iteration {} bob_234 lt4aag cmd77 {}'

In [16]:
tkn = VltTokenizer(to_lowercase=False)
tkn.tokenize("wat2tcs lt3aga w2fors (bobWish_105797) ITERATION=10 [bob_234] lt4aag cmd77 2022-10-01T00:43:01.123")

'wat2tcs lt3aga w2fors bobWish_105797 ITERATION {} bob_234 lt4aag cmd77 {}'

## Log Colorizer
Learn from statistical properties in a trace of logs

In [17]:
from eliana.preprocessing import LogColorizer
showAttribs(LogColorizer)

LogColorizer        : Model to learn tokenizers from a dataset of traces.

Constructor
__init__            : Initializes the LogColorizer with a tokenizer or a custom tokenization function.

Properties
regexps             : List of regular expressions used by the tokenizer, including post-processing patterns.
special             : Symbol used as a placeholder for numeric or variable values in templates. Default is '§'.
templates           : List of templates used for matching traces.
wildcard            : Placeholder for variable text in templates. Default is "{}".

Methods
fit                 : Learns templates and tokenization rules from a dataset of traces.
fit_on_traces       : None
save                : Saves the tokenizer object to a file.
tokenize            : Tokenizes a word or phrase using the provided tokenizer and post-processing steps.


See [LogColorizer](03-preprocessing.log_colorizer.ipynb) documentation.