usage.py prints redundant characters ᛃᛃᛃ #4

mikkokotila · 2018-03-24T13:46:27Z

When I run usage.py code on Jupyter Notebook, I get this:

Loading Trie...
Time: 5.155517816543579
" ཤི་"/VERBᛃᛃᛃ, "བཀྲ་ཤིས་  "/NOUNᛃᛃᛃ, "tr"/non, " བདེ་་ལེ གས"/NOUNᛃᛃᛃ, "།"/punct, " བཀྲ་ཤིས་"/NOUNᛃᛃᛃ, "བདེ་ལེགས་"/NOUNᛃᛃᛃ, "ཀཀ"/non

Are these ᛃᛃᛃ redundant or something is not printing properly? It seems like I'm getting response for the whole original string so I'm guessing redundant?

The text was updated successfully, but these errors were encountered:

drupchen · 2018-03-25T11:58:19Z

This is the current expected behaviour of BoTrie when the inflect_n_add method is used.
The idea is that a tag is different from a POS in that it contains more information that the part of speech per se(which will be nothing more than a primary UD tag.

Simply speaking, imagining you had inflected the entry while creating the Trie, and imagining the input string you would be tokenizing contained 'ཤིའོ་', you would end up with the following:

" ཤིའོ་"/VERBᛃoᛃ2ᛃFalse,

which means that your token is inflected in the terminal case (o), and that in order to reconstruct the unaffixed token, you need to delete 2 chars from the cleaned syllable and not add a འ(False) at the end of the token.

The extra information here is what pertains to the syllables that have affixed casual particles. In order to correctly tokenize affixed words, we want to have the required info to reconstruct the unaffixed word and reconstruct the full version of the casual particle as well.
This is embedded in three fields delimited by the characters you are refering to.
The content of these fields is produced here.

Anyhow, I will be documenting all this, so you will have a clearer idea of what is happening and if you want to modify the behaviour I coded or not.

mikkokotila · 2018-04-29T09:53:52Z

Coming back to this....maybe it could be opened to discuss what is the best way to get to the goal you have. At the moment it seems that the use of these extra characters in the token attribute 'tag' is redundant (as it's always exactly the same). Is it possible to remove it? This will cause a lot of confusion (even if it was documented) and makes the impression that something is broken.

'NOUNᛃᛃᛃ',
'NOUNᛃᛃᛃ',
'ADPᛃᛃᛃ',
'NOUNᛃᛃᛃ',
'VERBᛃᛃᛃ',
'NOUNᛃᛃᛃ',
'VERBᛃᛃᛃ',
'ADPᛃᛃᛃ',

mikkokotila · 2018-04-29T09:59:50Z

Ok, I take it back...I can see now examples where it's not redundant. Nevertheless, the question about what would be the cleanest way to achieve what you want to achieve stands.

drupchen closed this as completed Mar 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.py prints redundant characters ᛃᛃᛃ #4

usage.py prints redundant characters ᛃᛃᛃ #4

mikkokotila commented Mar 24, 2018

drupchen commented Mar 25, 2018

mikkokotila commented Apr 29, 2018

mikkokotila commented Apr 29, 2018

usage.py prints redundant characters ᛃᛃᛃ #4

usage.py prints redundant characters ᛃᛃᛃ #4

Comments

mikkokotila commented Mar 24, 2018

drupchen commented Mar 25, 2018

mikkokotila commented Apr 29, 2018

mikkokotila commented Apr 29, 2018