# Analiticcl tutorial (using Python)

Analiticcl is an approximate string matching or fuzzy-matching system that can be used for spelling
correction or text normalisation (such as post-OCR correction or post-HTR correction). Texts can be checked against a
validated or corpus-derived lexicon (with or without frequency information) and spelling variants will be returned.

## Installation

Analiticcl can be invoked from either the command-line or via Python using the binding binding.
In this tutorial, we will use the latter option and explore some of the functionality of analiticcl.

First of all, we need to install analiticcl, in a Jupyter Notebook this is simply accomplished as follows:

In [1]:
%pip install analiticcl

Defaulting to user installation because normal site-packages is not writeable
Collecting analiticcl
  Downloading analiticcl-0.4.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: analiticcl
Successfully installed analiticcl-0.4.4
Note: you may need to restart the kernel to use updated packages.


When invoked from the command line instead, do the following to create a Python virtual environment and install analiticcl in it:

```
$ python -m venv env
$ . env/bin/activate
$ pip install analiticcl
```

Now analiticcl is installed, we can import the module. As we usually only need three main classes, we import only these:

In [1]:
from analiticcl import VariantModel, Weights, SearchParameters

## Data preparation

Analiticcl doesn't do much out-of-the-box and is only as good as the data you feed it. It specifically needs *lexicons* or *variant lists* to operate, these contain the words or phrases that the system will match against.

**Advanced note:** All input for analiticcl must be UTF-8 encoded and use unix-style line endings, NFC unicode normalisation is strongly recommended.

### Alphabet file

We first of all need an *alphabet file* which simply defines all characters in the alphabet, grouping certain character variants together if desired. See the [README.md](README.md) for further documentation on this. We simply take the example alphabet file that is supplied with analiticcl. The alphabet file is a TSV file (tab separated fields) containing all characters of the alphabet. Each line describes a
single alphabet 'character', all columns on the same line are considered equivalent variants of the same character from the perspective of analiticcl:

In [2]:
alphabet_file = "examples/simple.alphabet.tsv"

with open(alphabet_file,'r', encoding='utf-8') as f:
    print(f.read())

a	A	á	à	Á	À	ä	Ä	ã	Ã	â	Â
e	E	ë	é	è	ê	Ë	É	È	Ê	æ	Æ
o	O	ö	ó	ò	õ	ô	Ö	Ó	Ò	Õ	Ô	å	Å	ø	œ
i	I	ï	í	Í
u	U	ú	Ú	ü	Ü
y	Y
b	B
c	C
d	D
f	f
g	G
h	H
k	k
l	L
m	M
n	N	ñ	Ñ
p	P
r	R
s	S
t	T
j	J
v	V
w	W
q	Q
x	X
z	Z
"	``	''
'
\s	\t
.	,	:	?	!
0	1	2	3	4	5	6	7	8	9



### Lexicon

In this tutorial we will use an English lexicon from the [GNU aspell](http://aspell.net/) project, a commonly used spell checker library. It simply contains one word per line. An example is supplied with analiticcl:

In [3]:
lexicon_file = "examples/eng.aspell.lexicon"

## Variant Model

### Building

We now have all we need to build our first variant model using Analiticcl.  A variant model enables quickly and efficiently matching any input to specified lexicons, effectively matching the input text against the lexicons and in doing so finding variants of the input (or variants of the lexicon entries, it's only a matter of perspective).

In [4]:
model = VariantModel(alphabet_file, Weights())

For the time being we're content with the default weights (more about these later), passed as second parameter.

The model is still empty upon instantiation. We need to feed it with one or more lexicons. Let's pass the English aspell lexicon:

In [5]:
model.read_lexicon(lexicon_file)

After loading all lexicon, we build the model as follows:

In [10]:
model.build()

Computing anagram values for all items in the lexicon...
 - Found 119773 instances
Adding all instances to the index...
 - Found 108802 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 27 anagrams of length 1
 - Found 248 anagrams of length 2
 - Found 942 anagrams of length 3
 - Found 2593 anagrams of length 4
 - Found 5623 anagrams of length 5
 - Found 10163 anagrams of length 6
 - Found 14617 anagrams of length 7
 - Found 16911 anagrams of length 8
 - Found 16391 anagrams of length 9
 - Found 13930 anagrams of length 10
 - Found 10650 anagrams of length 11
 - Found 7194 anagrams of length 12
 - Found 4434 anagrams of length 13
 - Found 2459 anagrams of length 14
 - Found 1384 anagrams of length 15
 - Found 667 anagrams of length 16
 - Found 339 anagrams of length 17
 - Found 128 anagrams of length 18
 - Found 62 anagrams of length 19
 - Found 20 anagrams of length 20
 - Found 9 anagrams of length 21
 - Found 8 anagrams of length 22
 - Found 2 anagrams o

### Querying

Now the model is loaded we can query it as follows, let's take an existing word that's in the model first:

In [11]:
variants  = model.find_variants("separate", SearchParameters())
for variant in variants:
    print(variant)

{'text': 'separate', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separated', 'score': 0.8125, 'dist_score': 0.8125, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separates', 'score': 0.8125, 'dist_score': 0.8125, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separately', 'score': 0.75, 'dist_score': 0.75, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': "separate's", 'score': 0.75, 'dist_score': 0.75, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separative', 'score': 0.734375, 'dist_score': 0.734375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separator', 'score': 0.71875, 'dist_score': 0.71875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separable', 'score': 0.703125, 'dist_score': 0.703125, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'

As expected, the word itself is returned with a perfect score of *1.0*, along with various lower-ranking variants. Each variant is represented as a dictionary with the following keys:

* ``text`` - The textual value (str) of the variant as it occurs in the lexicon
* ``score`` - The combined score of this variant (float), a weighted combination of `dist_score` and `freq_score`
* ``dist_score`` - The distance score (float). A perfect match always has score *1.0*.
* ``freq_score`` - The frequency score (float), in case lexicons have frequency information. The most frequent match always has score *1.0*.
* ``lexicons`` - The lexicons that were matched (list).

And let's now try it with misspelled input that is not in the actual model, even though it's not an exact match, we expect the properly spelled variant to come out on top:

In [12]:
variants = model.find_variants("seperate", SearchParameters())
for variant in variants:
    print(variant)

{'text': 'separate', 'score': 0.734375, 'dist_score': 0.734375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'operate', 'score': 0.6875, 'dist_score': 0.6875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'desperate', 'score': 0.6875, 'dist_score': 0.6875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'temperate', 'score': 0.6875, 'dist_score': 0.6875, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'serrate', 'score': 0.65625, 'dist_score': 0.65625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'exasperate', 'score': 0.625, 'dist_score': 0.625, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separated', 'score': 0.609375, 'dist_score': 0.609375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lexicon']}
{'text': 'separates', 'score': 0.609375, 'dist_score': 0.609375, 'freq_score': 1.0, 'lexicons': ['examples/eng.aspell.lex

The `find_variants` method is used to *query* the model directly. Parameters can be specified as part of ``SearchParameters`` using keyword arguments, the following are supported:

* ``max_edit_distance`` - Maximum edit distance (levenshtein-damarau). Insertions, deletions, substitutions and transposition all have the same cost (1). It is recommended to set this value slightly lower than the maximum anagram distance. This may take an absolute integer value, i.e. the difference in characters (regardless of order), a floating point value in the range 0-1 to express a relative is expressed ratio of the total length of the text fragment under consideration, or a tuple of a floating point value and an integer (same interpretation as above) with the integer acting as a limit.
* ``max_anagram_distance`` - Maximum anagram distance (e heuristic approximation of edit distance). This may take an absolute integer value, i.e. the difference in characters (regardless of order), a floating point value in the range 0-1 to express a relative is expressed ratio of the total length of the text fragment under consideration, or a tuple of a floating point value and an integer (same interpretation as above) with the integer acting as a limit.
* ``score_threshold`` - Require scores to meet this threshold (float), they are pruned otherwise
* ``cutoff_threshold`` - Cut-off threshold: if a score in the ranking is a specific factor greater than the best score, the ranking will be cut-off at that point and the score not included. Should be set to a value like 2.
* ``freq_weight`` - Weight attributed to the frequency information in frequency reranking, in relation to the similarity (distance) component. 0 = disabled)
* ``max_matches`` - Number of matches to return per input (set to 0 for unlimited if you want to exhaustively return every possibility within the specified anagram and edit distance).

### Searching









