Skip to content
/ triethai Public

Trie for search, approximate search, and spell correction. Designed for Thai language, in mind, but should be pretty generic for other languages :)

License

Notifications You must be signed in to change notification settings

peune/triethai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TrieThai

Trie for storing dictionary with search, approximate search, and spell correction functions. Designed for Thai language, in mind, but should be pretty generic for other languages :)

ทรัยสำหรับเก็บรายการคำศัพท์ พร้อมกับฟังก์ชันในการ ค้นคืน, ค้นคืน โดยประมาณ, การแก้การสะกดผิด

Descriptions

Search returns the closest word in the dictionary, with respect to edit distance to the input query string. If the query exists in the dictionary, the same string will be returned. Approximate search also perform the same function but with heuristic that tries to reduce the memory during runtime. This should make it faster for large dictionary.

Spell correction perform simultanously correction & word segmentation. If there are no spelling error, it returns similar result as maximal matching word segmentation.

For more detail, please refer to Fun with string, Trie and edit distance (in Thai).

Usage

We need to load dictionary first into list of strings and Trie.

words,root = load_dictionary(dictionary_filename)

Then we may choose which operation to perform

Search

idx, score = nn_search(root, query)
print('query \'%s\' || closest \'%s\', distance %.2f' % (query, words[idx], score))

Approximate search

Here we apply beam search to Trie search. This means that at every newly reading character, we search for best candidate. Then current candidates whose score is higher than best candidate score + beam will be prunned. We can set, for example, beam = 3.

idx, score = approx_nn_search(root, query, 3)
print('query \'%s\' || closest \'%s\', distance %.2f' % (query, words[idx], score))

Spelling correction

Beam can be 3, for example. The transit_cost force the model to prefer path with minimal number of words. We may set, transit_cost=1, for example.

res = spell_correction(words, root, beam, transit_cost, separator, query)
print(res)

Comand line example

Search

python triethai.py search -d dict.txt -q ตัวแปล

Should produce

search
query 'ตัวแปล' || closest 'ตัวแปร', distance 1

Approximate search

python triethai.py approx_search -d dict.txt -q ตัวแปล

Should produce

approx_search
query 'ตัวแปล' || closest 'ตัวแปร', distance 1

Spell correction

python triethai.py spell -d dict -q การปับฅ่าตัสแปฮ

Should produce

spell
การปรับค่าตัวแปร

We can also insert word segmentation separator, for example '|' by

python triethai.py spell -d dict -q การปับฅ่าตัสแปฮ --sep '|'

Should produce

spell
การ|ปรับ|ค่า|ตัวแปร

Contributor

About

Trie for search, approximate search, and spell correction. Designed for Thai language, in mind, but should be pretty generic for other languages :)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages