# Tokenizing Strategy

BoTokenizer is nothing more than a convenience class that makes use of several tools in pybo to constitute a tokenizer.

If many of the available tools are used in the tokenizer, it is because pybo's initial aim was to answer the need of a tokenizer for Tibetan. When other needs will arise, other tools and modules will be added.

Tokenizing means correctly identifying words within an input string.

If Tibetan was a language that had clear and unambiguous word boundaries, we might have opted for a negative strategy: identifying the words by identifying what is between words. Then only find an appropriate way to deal with the exceptions.

However, Tibetan only seperates syllables with tseks/dots, so we need to adopt an opposite strategy: for a given starting point in the input string, find out how many words can fit in the following syllables, decide which one from all the candidates is correct and then decide the next word will start from that point onwards.

Given the words `ab`, `abc`, `ba` and `cde` and the input string `abcdefgh`, here is how we would proceed:

1. Starting point: 0.
2. Check how many words can fit in from that point: `ab` and `abc` are found
3. Decide what word is the correct one in the present context: `ab` (arbitrary decision in this example)
4. Starting point: 2.
5. Check how many words can fit in from that point: `cde` is found
6. Decide what word is the correct one: `cde` (the only one)
7. Starting point: 5.
8. Check how many words can fit in from that point: none.
9. `f` is decided to be a non-word token.
10. Starting point: 6.
11. Check how many words can fit in from that point: none.
12. `g` is decided to be a non-word token.
13. Starting point: 7.
14. Check how many words can fit in from that point: none.
15. `h` is decided to be a non-word token.

In the end, the chain of tokens is: [`ab`, `cde`, `f`, `g`, `h`]

# Building-blocks of a tokenizer

From the above, we see the tokenizer will resolve around two things:

 1. lexical resources: a list of all the valid Tibetan words
 2. a mechanism to walk the input string while deciding where tokens start and end.

## 1. Lexical resources

The raw lists of words are stored [here](./pybo/resources/trie) inside the `resources` folder of pybo.

These lists need to be crafted with great care. The produced tokens directly depend on the content of these lists. Let's say we had an additional word `fg` in the example above, or that the word `ab` was missing. The output would be completely different.

The difficulty in knowing what word to add or to remove in our lists is to define the limit between a regular word and a compounded word. We don't want to include compounded expressions, compounded words or concatenated words in our lists, except for a few exceptions. Yet we find a lot of those in Tibetan dictionaries. Some even list full titles as dictionary entries.

Because of this situation, we provide means to add or deactivate entries in the trie structure used to host the lexical entries in the tokenizer.

###  pybotrie.py

PyBoTrie builds on BasicTrie (that is subclasses) to provide higher-level facilities that are used for tokenizing purposes.

#### Profiles

#### Building and saving the trie

In [1]:
from pybo import PyBoTrie, BoSyl

trie = PyBoTrie(BoSyl(), profile='POS')

Loading Trie...
Time: 2.3894572257995605


As we can see in the message printed, it only loaded the trie that was pickled and saved on disk. PyBoTrie checks for the existence of a trie on disk before deciding to build one from the lexical resources. 

You can also choose to rebuild a trie like so:

In [2]:
trie.rebuild_trie()

building Trie... Time: 6.829641819000244


This functionality comes in handy when you modify an existing trie by either deactivating a word or to add entries:

In [3]:
word = 'བཀྲ་ཤིས་'
print(trie.has_word(word))  # inherited from BasicTrie

{'exists': True, 'data': 'NOUNᛃᛃᛃ'}


#### Using the trie

`has_word()` loops over every character in its argument and walks down the trie. If it can't go until the end of the input string, the word is not present in the lexical resources. If it reaches the end, it checks that it reached the end of a word (to avoid matching half-words) and returns the information stored with that word.

Here is what is actually happening:

In [4]:
# setting the current node to the root of the trie
current_node = trie.head
print(f'initial value of current_node: {list(current_node.children.keys())}\n')
for n, letter in enumerate(word):
    print(f'{n}: "{letter}"', end='\t')
    if current_node:  # ensures we can continue walking
        print(f'letter is in current_node: {letter in current_node.children}')
        current_node = trie.walk(letter, current_node)
        
        print(f'\tnew value: {list(current_node.children.keys())}\n')

initial value of current_node: ['ཀ', 'ག', 'ད', 'བ', 'མ', 'འ', 'ས', 'ཡ', 'ཐ', 'ར', 'ཤ', 'ན', 'པ', 'ཅ', 'ཌ', 'ཧ', 'ཏ', 'ལ', 'ཨ', 'ཛ', 'ཙ', 'ཝ', 'ཁ', 'ང', 'ཆ', 'ཇ', 'ཉ', 'ཕ', 'ཚ', 'ཞ', 'ཟ', 'ཊ', 'ཥ', '༺', '༐', 'ཪ', 'ྐ', '྄', 'ཋ', 'ཎ', ' ', 'ྨ', 'ྴ']

0: "བ"	letter is in current_node: True
	new value: ['ཅ', 'ཙ', 'ར', 'ཱ', 'ི', 'ུ', 'ཻ', 'ཛ', 'ེ', 'ྷ', 'ན', '་', 'ཀ', 'ས', 'ག', 'ཏ', 'ལ', 'ད', 'འ', 'ང', 'བ', 'མ', 'ོ', 'ྱ', 'ྲ', 'ླ', 'ཞ', 'ཟ', 'ཤ', 'ཉ', 'ཊ', 'ཎ', 'པ', 'ཕ', 'ཡ', 'ཥ', 'ཧ', 'ཽ', 'ཾ', 'ྠ', 'ྨ', 'ྀ']

1: "ཀ"	letter is in current_node: True
	new value: ['ག', 'ང', 'ར', 'ས', 'འ', 'ུ', 'ོ', 'ྲ', 'ད', 'ན', 'བ', 'ལ', 'ྱ', 'ླ', 'ྟ', 'ྐ', 'ྚ', 'ྭ', 'ྵ']

2: "ྲ"	letter is in current_node: True
	new value: ['་', 'མ', 'ེ', 'ག', 'བ', 'ལ', 'ས', 'ི', 'ུ', 'ོ', 'ྀ', 'ཱ', 'ཾ', 'ཿ']

3: "་"	letter is in current_node: True
	new value: ['ཤ', 'བ', 'ཝ', 'ར', 'མ']

4: "ཤ"	letter is in current_node: True
	new value: ['ི', 'ྲ']

5: "ི"	letter is in current_node: True
	new value: ['ས']

6: "ས"	letter is in

Now that we have fed the whole word to the trie and that we see current_node is still not `None`, we have to check that we are not in the middle of a word:

In [5]:
print(f'word exists as an entry in the lexical resources: {current_node.is_match()}')
print(f'the data about this word stored in the trie is: "{current_node.data}"')

word exists as an entry in the lexical resources: True
the data about this word stored in the trie is: "NOUNᛃᛃᛃ"


To make it short, in order to use PyBoTrie, we need to:
 - store the current node in a variable (`trie.head` the first time)
 - use `walk()` to go one step down the trie
 - use `is_match()` to know if we have a match or not
 - use the content of the other attributes of the node (`data`, `freq` and `skrt`)

#### Modifying the trie

In [6]:
trie.remove_word(word)  # inherited from BasicTrie
print(trie.has_word(word))

trie.inflect_n_add(word, ins='data', pos='NOUN')
print(trie.has_word(word))

{'exists': False}
{'exists': True, 'data': 'NOUNᛃᛃᛃ'}


`inflect_n_add()` adds 

Now, we can check for the existence of a given word in our trie:

In [7]:
trie.has_word('བཀྲ་ཤིས་')

{'exists': True, 'data': 'NOUNᛃᛃᛃ'}

Let's try to see if we have a couple of other words:

In [8]:
worda = 'མཐའ་'
worda_affixed1 = 'མཐར་'
worda_affixed2 = 'མཐའིའོ་'
for word in [worda, worda_affixed1, worda_affixed2]:
    print(trie.has_word(word))

{'exists': True, 'data': 'NOUNᛃᛃᛃ'}
{'exists': True, 'data': 'NOUNᛃlaᛃ1ᛃaa'}
{'exists': True, 'data': 'NOUNᛃgi+oᛃ4ᛃaa'}


All three words exist, yet present differing data:

The first one doesn't have any information about affixed particles.

The second one says that:
 - the word's Part-Of-Speech is a noun
 - the affixed particle is a ladon(ལ་དོན།)
 - the affixed particle spans 1 character(ར)
 - the hosting word ends with a འ in its unaffixed form (མཐའ་).

The third one shows that 
 - its POS is also a noun
 - the affixed particles are:
     - a genetive (བྱེད་སྒྲ་): gi(གི་) is the chosen canonical form of this case
     - a terminative (རྫོགས་ཚིག་): o(འོ་) is the chosen form.
 - the affixed particle spans 4 characters (འིའོ)
 - the hosting word ends with a འ in its unaffixed form (མཐའ་).

In [11]:
wordb = 'རྒྱ་མཚོ་'
wordb_affixed1 = 'རྒྱ་མཚོའམ་'
wordb_affixed2 = 'རྒྱ་མཚོའང་'
for word in [wordb, wordb_affixed1, wordb_affixed2]:
    print(trie.has_word(word))

{'exists': True, 'data': 'NOUNᛃᛃᛃ'}
{'exists': True, 'data': 'OTHERᛃᛃᛃ'}
{'exists': True, 'data': 'NOUNᛃangᛃ2ᛃ'}
