# Analiticcl tutorial (using Python)

Analiticcl is an approximate string matching or fuzzy-matching system that can be used for spelling
correction or text normalisation (such as post-OCR correction or post-HTR correction). Texts can be checked against a
validated or corpus-derived lexicon (with or without frequency information) and spelling variants will be returned.

To understand the theoretical background behind analiticcl, we recommend you to also view [this presentation
video](https://diode.zone/w/kkrqA4MocGwxyC3s68Zsq7) that was presented at the KNAW Humanities Cluster in January 2022.

Analiticcl can be invoked from either the command-line or via Python using the binding.
In this tutorial, we will use the latter option and explore some of the functionality of analiticcl.

## Installation

First of all, we need to install analiticcl, in a Jupyter Notebook this is simply accomplished as follows:

In [1]:
%pip install analiticcl

Defaulting to user installation because normal site-packages is not writeable
Collecting analiticcl
  Downloading analiticcl-0.4.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: analiticcl
Successfully installed analiticcl-0.4.4
Note: you may need to restart the kernel to use updated packages.


When invoked from the command line instead, do the following to create a Python virtual environment and install analiticcl in it:

```
$ python -m venv env
$ . env/bin/activate
$ pip install analiticcl
```

Now analiticcl is installed, we can import the module. As we usually only need three main classes, we import only these:

In [1]:
from analiticcl import VariantModel, Weights, SearchParameters

## Data preparation

Analiticcl doesn't do much out-of-the-box and is only as good as the data you feed it. It is a fairly low-level tool that is quite versatile, and it's up to you to wield it effectively. It specifically needs *lexicons* or *variant lists* to operate, these contain the words or phrases that the system will match against.

**Advanced note:** All input for analiticcl must be UTF-8 encoded and use unix-style line endings, NFC unicode normalisation is strongly recommended.

### Alphabet file

We first of all need an *alphabet file* which simply defines all characters in the alphabet, grouping certain character variants together if desired. See the [README.md](README.md) for further documentation on this. We simply take the example alphabet file that is supplied with analiticcl. The alphabet file is a TSV file (tab separated fields) containing all characters of the alphabet. Each line describes a
single alphabet 'character', all columns on the same line are considered equivalent variants of the same character from the perspective of analiticcl:

In [2]:
alphabet_file = "examples/simple.alphabet.tsv"

with open(alphabet_file,'r', encoding='utf-8') as f:
    print(f.read())

a	A	á	à	Á	À	ä	Ä	ã	Ã	â	Â
e	E	ë	é	è	ê	Ë	É	È	Ê	æ	Æ
o	O	ö	ó	ò	õ	ô	Ö	Ó	Ò	Õ	Ô	å	Å	ø	œ
i	I	ï	í	Í
u	U	ú	Ú	ü	Ü
y	Y
b	B
c	C
d	D
f	f
g	G
h	H
k	k
l	L
m	M
n	N	ñ	Ñ
p	P
r	R
s	S
t	T
j	J
v	V
w	W
q	Q
x	X
z	Z
"	``	''
'
\s	\t
.	,	:	?	!
0	1	2	3	4	5	6	7	8	9



### Lexicon

In this tutorial we will use an English lexicon from the [GNU aspell](http://aspell.net/) project, a commonly used spell checker library. It simply contains one word per line. An example is supplied with analiticcl:

In [3]:
lexicon_file = "examples/eng.aspell.lexicon"

## Variant Model

### Building

We now have all we need to build our first variant model using Analiticcl.  A variant model enables quickly and efficiently matching any input to specified lexicons, effectively matching the input text against the lexicons and in doing so finding variants of the input (or variants of the lexicon entries, it's only a matter of perspective).

In [4]:
model = VariantModel(alphabet_file, Weights())

For the time being we're content with the default weights (more about these later), passed as second parameter.

The model is still empty upon instantiation. We need to feed it with one or more lexicons. Let's pass the English aspell lexicon:

In [5]:
model.read_lexicon(lexicon_file)

After loading all lexicon, we build the model as follows:

In [10]:
model.build()

Computing anagram values for all items in the lexicon...
 - Found 119773 instances
Adding all instances to the index...
 - Found 108802 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 27 anagrams of length 1
 - Found 248 anagrams of length 2
 - Found 942 anagrams of length 3
 - Found 2593 anagrams of length 4
 - Found 5623 anagrams of length 5
 - Found 10163 anagrams of length 6
 - Found 14617 anagrams of length 7
 - Found 16911 anagrams of length 8
 - Found 16391 anagrams of length 9
 - Found 13930 anagrams of length 10
 - Found 10650 anagrams of length 11
 - Found 7194 anagrams of length 12
 - Found 4434 anagrams of length 13
 - Found 2459 anagrams of length 14
 - Found 1384 anagrams of length 15
 - Found 667 anagrams of length 16
 - Found 339 anagrams of length 17
 - Found 128 anagrams of length 18
 - Found 62 anagrams of length 19
 - Found 20 anagrams of length 20
 - Found 9 anagrams of length 21
 - Found 8 anagrams of length 22
 - Found 2 anagrams o

### Simple Querying

Now the model is loaded we can query it as follows, let's take an existing word that's in the model first:

In [11]:
variants  = model.find_variants("separate", SearchParameters())
for variant in variants:
    print(variant)

{'text': 'separate', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separated', 'score': 0.8125, 'dist_score': 0.8125, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separates', 'score': 0.8125, 'dist_score': 0.8125, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separately', 'score': 0.75, 'dist_score': 0.75, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': "separate's", 'score': 0.75, 'dist_score': 0.75, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separative', 'score': 0.734375, 'dist_score': 0.734375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separator', 'score': 0.71875, 'dist_score': 0.71875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separable', 'score': 0.703125, 'dist_score': 0.703125, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'

As expected, the word itself is returned with a perfect score of *1.0*, along with various lower-ranking variants. Each variant is represented as a dictionary with the following keys:

* ``text`` - The textual value (str) of the variant as it occurs in the lexicon
* ``score`` - The combined score of this variant (float), a weighted combination of `dist_score` and `freq_score`
* ``dist_score`` - The distance score (float). A perfect match always has score *1.0*.
* ``freq_score`` - The frequency score (float), in case lexicons have frequency information. The most frequent match always has score *1.0*.
* ``lexicons`` - The lexicons where the match was found (list). This is useful in case you loaded multiple lexicons and may even serve as a simple form of tagging.

And let's now try it with misspelled input that is not in the actual model, even though it's not an exact match, we expect the properly spelled variant to come out on top:

In [12]:
variants = model.find_variants("seperate", SearchParameters())
for variant in variants:
    print(variant)

{'text': 'separate', 'score': 0.734375, 'dist_score': 0.734375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'operate', 'score': 0.6875, 'dist_score': 0.6875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'desperate', 'score': 0.6875, 'dist_score': 0.6875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'temperate', 'score': 0.6875, 'dist_score': 0.6875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'serrate', 'score': 0.65625, 'dist_score': 0.65625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'exasperate', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separated', 'score': 0.609375, 'dist_score': 0.609375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separates', 'score': 0.609375, 'dist_score': 0.609375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lex

The `find_variants` method is used to *query* the model directly. Parameters can be specified as part of ``SearchParameters`` using keyword arguments, the following are supported:

* ``max_edit_distance`` - Maximum edit distance (levenshtein-damarau). Insertions, deletions, substitutions and transposition all have the same cost (1). It is recommended to set this value slightly lower than the maximum anagram distance. This may take an absolute integer value, i.e. the difference in characters (regardless of order), a floating point value in the range 0-1 to express a relative is expressed ratio of the total length of the text fragment under consideration, or a tuple of a floating point value and an integer (same interpretation as above) with the integer acting as a limit.
* ``max_anagram_distance`` - Maximum anagram distance (e heuristic approximation of edit distance). This may take an absolute integer value, i.e. the difference in characters (regardless of order), a floating point value in the range 0-1 to express a relative is expressed ratio of the total length of the text fragment under consideration, or a tuple of a floating point value and an integer (same interpretation as above) with the integer acting as a limit.
* ``score_threshold`` - Require scores to meet this threshold (float), they are pruned otherwise
* ``cutoff_threshold`` - Cut-off threshold: if a score in the ranking is a specific factor greater than the best score, the ranking will be cut-off at that point and the score not included. Should be set to a value like 2.
* ``freq_weight`` - Weight attributed to the frequency information in frequency reranking, in relation to the similarity (distance) component. 0 = disabled)
* ``max_matches`` - Number of matches to return per input (set to 0 for unlimited if you want to exhaustively return every possibility within the specified anagram and edit distance).

Example with some constraining parameters:

In [None]:
variants = model.find_variants("seperate", SearchParameters(max_anagram_distance=2, max_edit_distance=2, max_matches=1))
for variant in variants:
    print(variant)

Note that in these examples, the resulting `freq_score` is always *1.0* (the maximum) because there is no frequency information associated with our lexicon. Our aspell lexicon was just a plain list of words. Analiticcl often produces better results if corpus-derived frequency information is available in the lexicon, these can be provided as simple counts (integers) in the second column of the tab separated format. The system will then be more inclined to pick high-frequent words over those with low frequency. The weight of the frequency component can be set in `freq_weight` in `SearchParameters`, a value of 0.5 would mean that the frequency score is deemed half as strong/important as the distance score, a value of 2 would mean if is twice as strong/important, a value of 0 disables it entirely (default). A value of 0.25 might be a reasonable default to set if you have frequency information in your lexicon(s). If you load multiple lexicons with frequency information, all frequencies must be expressed on the same scale (e.g. derived from the same corpus).

The weights for the similarity/distance computations are set via keyword arguments on initialisation of the `Weights` class, which was, if you recall, passed once when we instantiated the `VariantModel`. We distinguish the following weights, all are floating point values, each corresponds to a component in the similarity/distance computation:

* ``ld`` - Edit/Levenshtein distance. This is usually the main component (with highest weight) in the similarity computation (default: 0.5)
* ``lcs`` - Longest Common Substring (default: 0.125)
* ``prefix`` - Prefix match. Looks at common prefixes (default: 0.125)
* ``suffix`` - Suffix match. Looks at common suffixes (default: 0.125)
* ``case`` - Casing. Looks at uppercase/lowercase differences  (default: 0.125)

You can set a value to 0.0 to disable a component. The weights are most easily interpretable if their sum is 1.0, but this is not a requirement (the system will do the normalisation for you).
The result of computation using of these weights ends up in the results dictionary under the `dist_score` key.

## Searching

Using the `find_variants() ` method you query analiticcl's variant model with an exact input string and ask it to correct it as a single unit. This
effectively implements the *correction* part of a spelling-correction system, but does not really handle the *detection*
aspect that automatically determines which part of the input needs correction in the first place.

If you want to *detect* and subsequently *correct* possible errors in running text, you need the `find_all_matches()`  method.
The input is running text (a string) and analiticcl will return all matches it can find:

In [13]:
matches = model.find_all_matches("We would like seperate beds", SearchParameters(unicodeoffsets=True))
for match in matches:
    print(match)

{'input': 'We', 'offset': {'begin': 0, 'end': 2}, 'variants': [{'text': 'we', 'score': 0.875, 'dist_score': 0.875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'Wei', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'Web', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'Wed', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'wee', 'score': 0.5625, 'dist_score': 0.5625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'E', 'score': 0.5, 'dist_score': 0.5, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'W', 'score': 0.5, 'dist_score': 0.5, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'awe', 'score': 0.5, 'dist_score': 0.5, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text':

Each match will refer back to the string using *begin* and *end* offsets (the end is non-inclusive), if `unicodeoffsets` is set in `SearchParameters` then these will be unicode codepoints, otherwise they will be UTF-8 bytes (analiticcl's internal representation). For Python, you almost always will want to set `unicodeoffsets=True` as that corresponds with how Python deals with string indexing.

Now you might think that processing words in a sentence is equal to simply splitting the words (i.e. tokenisation) and calling `find_variants()` on each word, but analiticcl actually does way more than that:

* Entries in lexicons need not be single words, we support higher-order n-grams. You can pass keyword argument `max_ngram` to `SearchParameters` with an integer value indicating the maximum order of n-grams you want to support (1 = unigrams (default), 2 = bigrams, etc).
* The system tries to find the optimal path in case of conflicting solution
* Analiticcl does invoke `find_variants()` behind the scenes, but it can parallellise calls to leverage multiple processor cores and speeding up the process. Set keyword parameter `single_thread=True` for `SearchParameters` if you don't want this behaviour.
* Analiticcl can solve runons and splits:

In [17]:
matches = model.find_all_matches("We would like sep arate beds", SearchParameters(unicodeoffsets=True))
print(matches[3])

{'input': 'sep arate', 'offset': {'begin': 14, 'end': 23}, 'variants': [{'text': 'separate', 'score': 0.7499999999999999, 'dist_score': 0.7499999999999999, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'separated', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'separates', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'separative', 'score': 0.5694444444444445, 'dist_score': 0.5694444444444445, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': 'separately', 'score': 0.5694444444444444, 'dist_score': 0.5694444444444444, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}, {'text': "separate's", 'score': 0.5694444444444444, 'dist_score': 0.5694444444444444, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}]}


When using ``find_all_matches()``, The following keyword arguments are supported for `SearchParameters` in addition to the ones already mentioned before:

* `consolidate_matches` - (boolean) Consolidate matches and extract a single most likely sequence, if set to `False`, all possible matches (including overlapping ones) are returned.

### Variant lists

Thus-far we have used a simple lexicon with one word per line. Additionally, we know we can also load lexicons with corpus-derived frequency information. Now we want to show that you can also load explicit variant lists. 
A variant list explicitly relates spelling variants to preferred forms, and in doing so go a step further than a simple lexicon which only
specifies the validated or corpus-derived form.

A variant list is *directed* and *weighted*, it specifies a normalised/preferred form first, and then specifies variants and variant scores (and optionally, frequencies). Take the following example (all fields are tab separated):

```tsv
separate	seperate	1.0	seprate	1.0 
```

This states that the preferred word *separate* has two variants (misspellings in this case), and both have a score
(0-1) that expresses how likely the variant maps to the preferred word. Let's load this into analiticcl using `read_variants`:

In [9]:
with open("example.variantlist.tsv","w",encoding="utf-8") as f:
    f.write("separate	seperate	1.0	seprate	1.0\n")

model2 = VariantModel(alphabet_file, Weights())
model2.read_variants("example.variantlist.tsv", transparent=True)
model2.build()

Computing anagram values for all items in the lexicon...
 - Found 3 instances
Adding all instances to the index...
 - Found 3 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 1 anagrams of length 7
 - Found 2 anagrams of length 8
Constructing Language Model...
 - No language model provided


A variant list can be either transparent or not. If it is transparent it means that all the variants it lists will not be returned (they are transparent), only the preferred form is. Such a list is also sometimes known as an error list.
If a list is non-transparent, the variants may be return as valid matches.

Now if we query for the misspelling *seperate* we get a *perfect* (1.0) match via the variant list, the `via` key expresses that what it was matched through transparently:

In [10]:
variants = model2.find_variants("seperate", SearchParameters(max_anagram_distance=2, max_edit_distance=2, max_matches=1))
for variant in variants:
    print(variant)

{'text': 'separate', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'via': 'seperate', 'lexicons': ['example.variantlist.tsv']}


Why would you want to use a variant list? We've seen that even with a simple lexicon we were already able to correctly catch this misspelling, Analiticcl after all is precisely designed to match variants without needing to make them explicit.
Still a variant list may help especially to bridge larger edit distances without needing to increase the `max_edit_distance` or `max_anagram_distance` of the algorithm (which comes with a performance penalty). It is especially useful in cases of error correction or normalisation where there is a large difference between the preferred form and the (possibly erroneous) variant. It can also be used to link entirely different orthographic synonyms to a preferred normalised form.

### Background Lexicon

We can not understate the importance of the background lexicon to reduce false positives. Analiticcl will eagerly
attempt to match your test input to whatever lexicons or variant lists you provide. This demands a certain degree of completeness in your
lexicons. If your lexicon contains a relatively rare word like "boulder" and not a more common word like "builder", then
analiticcl will happily suggest all instances of "builder" to be "boulder". The risk for this increases as the allowed
edit distances increase. 

With `model2` from the previous section we now have a good example to illustrate this. This model contains only one entry from a variant list and has no further background lexicon, so if we query it with for example the word *operate* (fairly close in edit distance to the misspelling *seperate* that is in our variant list, and we get *separate* as result:

In [11]:
variants = model2.find_variants("operate", SearchParameters(max_anagram_distance=2, max_edit_distance=2, max_matches=1))
for variant in variants:
    print(variant)

{'text': 'separate', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'via': 'seprate', 'lexicons': ['example.variantlist.tsv']}


Background lexicons should also contain morphological variants and not just lemmas. Ideally it is derived automatically from a fully spell-checked corpus. Analiticcl **will not** work for you if you just feed it some small lexicons and no complete enough background lexicons, unless you are sure your test texts have a very constrained limited vocabulary.

### Confusion lists

When analiticcl computes distances, it relies on the *alphabet* you provided. Whatever character in the alphabet is substituted for whatever other in the matching process, it carries the same weight for the distance algorithm. However, this is not the case in practice. We often see that certain characters are more often confused than others. 

Take a word like *analysis*, we may imagine people misspelling it as *analisys*, *analysys* or *analisis*, considering the *i* and *y* have the same phonetic expression in this context.
Analiticcl allows you to express such *confusables* in a *confusable list*.

The confusable list is a TSV file (tab separated fields) containing known confusable patterns and weights to assign to
these patterns when they are found. The file contains one confusable pattern per line. The patterns are expressed in the
edit script language of [sesdiff](https://github.com/proycon/sesdiff). Consider the following example:

```tsv
-[y]+[i]	1.1
```

This pattern expressed a deletion of the letter ``y`` followed by insertion of ``i``, which comes down to substitution
of ``y`` for  ``i``. Edits that match against this confusable pattern receive the weight *1.1*, meaning such an edit is
given preference over edits with other confusable patterns, which by definition have weight *1.0*. Weights greater than
*1.0* are being given preference in the score weighting, weights smaller than ``1.0`` imply a penalty. When multiple
confusable patterns match, the products of their weights is taken. The final weight is applied to the whole candidate
score, so weights should be values fairly close to ``1.0`` in order not to introduce too large bonuses/penalties.

The edit script language from sesdiff also allows for matching on immediate context, consider the following variant of the above
which only matches the substitution when it comes between two *s* characters (like in *analysys* -> *analysis*) .

```tsv
=[s]-[y]+[i]=[s]    1.1
```

To force matches on the beginning or end, start or end the pattern with respectively a  ``^`` or a ``$``. A further description of the edit script language
can be found in the [sesdiff](https://github.com/proycon/sesdiff) documentation.

A confusion list is loaded using the `read_confusablelist(filename)` method.

## Context information

In everything that we've seen thus-far, except for the confusion lists and n-grams, context does not play a role yet. However, context is often one of the most important cues in language.
There are two ways to take context into account when searching via `find_all_matches()`: language modelling and/or via context rules.

### Language modelling

In order to consider context information, analiticcl can construct and apply a simple n-gram language model. The input for this language
model is an n-gram frequency list, provided through `read_lm(filename)` .

The input file should be a corpus-derived list of unigrams and bigrams, optionally also trigrams (and even all up to quintgrams if
needed, higher-order ngrams are not supported though).  This is a TSV file containing the ngram in the first column
(space character acts as token separator), and the absolute frequency count in the second column. It is also recommended
it contains the special tokens ``<bos>`` (begin of sentence) and ``<eos>`` end of sentence. The items in this list are
**NOT** used for variant matching, use ``read_lexicon()`` in addition if you want to also match against
these items. It is fine to have an entry in both the language model and lexicon, it will be stored only once
internally to save memory.

When language modelling is used, the following weights for `SearchParameters` become relevant and determine the balance between the language model and the variant model:

* ``lm_weight`` - The weight of the language model (float, default 1.0)
* ``variantmodel_weight`` - The weight of the variant model (float, default 3.0)

Note that even in the variant model, you *may* have a frequency component, whilst frequency information is also expressed in the language model. Take this into account when assigning weights if you don't want frequency information to take on too strong a role. Assigning weights is not trivial and not an exact science, it's a balancing act and values are best determined experimentally by trying out some and seeing what works best for you. Typically the variant model should carry more weight than the language model, otherwise the language model will simply choose fortuitous words that have little relation with the original input.

### Context rules

Another way to consider context information is through context rules. The context rules define certain patterns that are
to be either favoured or penalized (similar to the confusible lists we saw before). The context rules are expressed in a tab separated file which can be passed to
analiticcl using ``read_contextrules(filename)``. The first column contains a sequence separated by semicolons, and the second a
score close to 1.0 (lower scores penalize the pattern, higher scores favour it):

```tsv
hello ; world	1.1
```

This means that if the words "hello world" appear as a solution a text/sentence, its total context score will be boosted
(proportional to the length of the match), effectively preferring this solution over others. This context score is an
independent component in the final score function and its weight can be set using ``contextrules_weight`` in ``SearchParameters``. Its value is relative to both `lm_weight` and `variantmodel_weight`.

Note that the words also need to be in a lexicon you provide for a rule to work. You can express disjunctions using the
pipe character (``|``), as follows:

```tsv
hello|hi ; world|planet	1.1
```

This will match all four possible combinations. Rather than match the text, you can match specific lexicons you loaded
using the `@` prefix. This makes sense mainly if you use different lexicons and could be used as a form of elementary tagging:

```tsv
@greetings.tsv ; world	1.1
```

Here too you can create disjunctions using the pipe character:

```tsv
@greetings.tsv|@curses.tsv ; world	1.1
```

If you want to negate a match, just add ``!`` as a prefix. This also works in combination with ``@``, allowing you to match anything *except* the words from a particular the lexicon. If you want to negate an entire disjunction, use parenthesis like ``!(a|b|c|)``.

There are two standalone characters you may use in matching:

* ``?`` - Matches anything
* ``^`` - Matches anything that does not match with *any* lexicon (i.e. out of vocabulary words)

Note that in all cases,  you'll still need to explicitly load the lexicons (or variants lists) using ``read_lexicon()`` or  ``read_variants()``,
etc...

The rules are applied in the exact order you specify them. Note that a certain words in a text may only match against
one pattern (the first that is found). When defining context rules, you'll generally want to specify longer rules before
shorter ones, as otherwise the longer rules might never be considered. For example, in the following example, the second
pattern would never apply because the first one already matches:

```tsv
hello	1.1
hello ; world	1.1
```

### Entity Tagging

Analiticcl can be used as a simple entity tagger using its context rules. Make sure you understand the above section before you
continue reading.

You may pass two additional tab-separated columns to the context rules file, the third column specifies a tag to assign
to any matches, and an *optional* fourth column specifies an offset for tagging (more about this later). For example:

```tsv
hello ; world	1.1	greeting
```

Any instances of "hello world" will be assigned the tag "greeting", more specifically "hello" will be assigned the tag
"greeting" and gets sequence number 0, "world" gets the same tag and sequence number 1.

If you want to tag only a subset and leave certain left or right context untagged, then you can do so by specifying an
offset (in matches aka words, not characters). Such an offset takes the form ``offset:length``. For example:

```tsv
hello ; world	1.1	greeting	1:1
```

In this case only the word "world" will get the tag greeting (and sequence number 0).

It is also possible to assign multiple (even overlapping) tags with a single context rule. Use a semicolon to separate multiple tags and multiplet tag offsets (must be equal amount). However, it is not possible to apply multiple context rules once one has matched:

```tsv
@firstname.tsv ; @lastname.tsv	1.0	person;firstname;lastname 0:2;0:1;1:2
```

This mechanism can also be used to assign tags based on lexicons whilst allowing some form of lexicon weighting, even if
no further context is included:

```tsv
@greetings.tsv	1.0	greeting
in|to|from ; @city.tsv	1.1	location	1:1
@firstname.tsv ; @lastname.tsv	1.0	person
```



