# Analiticcl tutorial (using Python)

Analiticcl is an approximate string matching or fuzzy-matching system that can be used for spelling
correction or text normalisation (such as post-OCR correction or post-HTR correction). Texts can be checked against a
validated or corpus-derived lexicon (with or without frequency information) and spelling variants will be returned.

## Installation

Analiticcl can be invoked from either the command-line or via Python using the binding binding.
In this tutorial, we will use the latter option and explore some of the functionality of analiticcl.

First of all, we need to install analiticcl, in a Jupyter Notebook this is simply accomplished as follows:

In [None]:
%pip install analiticcl

When invoked from the command line instead, do the following to create a Python virtual environment and install analiticcl in it:

```
$ python -m venv env
$ . env/bin/activate
$ pip install analiticcl
```

Now analiticcl is installed, we can import the module. As we usually only need three main classes, we import only these:

In [None]:
from analiticcl import VariantModel, Weights, SearchParameters

## Data preparation

Analiticcl doesn't do much out-of-the-box and is only as good as the data you feed it. It specifically needs *lexicons* or *variant lists* to operate, these contain the words or phrases that the system will match against.

**Advanced note:** All input for analiticcl must be UTF-8 encoded and use unix-style line endings, NFC unicode normalisation is strongly recommended.

### Alphabet file

We first of all need an *alphabet file* which simply defines all characters in the alphabet, grouping certain character variants together if desired. See the [README.md](README.md) for further documentation on this. We simply take the example alphabet file that is supplied with analiticcl. The alphabet file is a TSV file (tab separated fields) containing all characters of the alphabet. Each line describes a
single alphabet 'character', all columns on the same line are considered equivalent variants of the same character from the perspective of analiticcl:

In [1]:
alphabet_file = "examples/simple.alphabet.tsv"

with open(alphabet_file,'r', encoding='utf-8') as f:
    print(f.read())

a	A	á	à	Á	À	ä	Ä	ã	Ã	â	Â
e	E	ë	é	è	ê	Ë	É	È	Ê	æ	Æ
o	O	ö	ó	ò	õ	ô	Ö	Ó	Ò	Õ	Ô	å	Å	ø	œ
i	I	ï	í	Í
u	U	ú	Ú	ü	Ü
y	Y
b	B
c	C
d	D
f	f
g	G
h	H
k	k
l	L
m	M
n	N	ñ	Ñ
p	P
r	R
s	S
t	T
j	J
v	V
w	W
q	Q
x	X
z	Z
"	``	''
'
\s	\t
.	,	:	?	!
0	1	2	3	4	5	6	7	8	9



### Lexicon

In this tutorial we will use an English lexicon from the [GNU aspell](http://aspell.net/) project, a commonly used spell checker library. It simply contains one word per line. An example is supplied with analiticcl:

In [None]:
lexicon_file = "examples/eng.aspell.lexicon"

## Variant Model

We now have all we need to build our first variant model using Analiticcl.  A variant model enables quickly and efficiently matching any input to specified lexicons, effectively matching the input text against the lexicons and in doing so finding variants of the input (or variants of the lexicon entries, it's only a matter of perspective).

In [None]:
model = VariantModel(alphabet_file, Weights())

For the time being we're content with the default weights (more about these later), passed as second parameter.

The model is still empty upon instantiation. We need to feed it with one or more lexicons. Let's pass the English aspell lexicon:

In [None]:
model.read_lexicon(lexicon_file)

Now the model is loaded we can query it as follows, let's take an existing word that's in the model first:

In [None]:
variants  = model.find_variants("separate")
print(variants)

And let's now try it with misspelled input that is not in the actual model, we expect the properly spelled variant to come out on top:

In [None]:
variants = model.find_variants("seperate")
print(variants)