Trie for storing dictionary with search, approximate search, and spell correction functions. Designed for Thai language, in mind, but should be pretty generic for other languages :)
ทรัยสำหรับเก็บรายการคำศัพท์ พร้อมกับฟังก์ชันในการ ค้นคืน, ค้นคืน โดยประมาณ, การแก้การสะกดผิด
Search returns the closest word in the dictionary, with respect to edit distance to the input query string. If the query exists in the dictionary, the same string will be returned. Approximate search also perform the same function but with heuristic that tries to reduce the memory during runtime. This should make it faster for large dictionary.
Spell correction perform simultanously correction & word segmentation. If there are no spelling error, it returns similar result as maximal matching word segmentation.
For more detail, please refer to Fun with string, Trie and edit distance (in Thai).
We need to load dictionary first into list of strings and Trie.
words,root = load_dictionary(dictionary_filename)
Then we may choose which operation to perform
idx, score = nn_search(root, query)
print('query \'%s\' || closest \'%s\', distance %.2f' % (query, words[idx], score))
Here we apply beam search to Trie search. This means that at every newly reading character, we search for best candidate. Then current candidates whose score is higher than best candidate score + beam will be prunned. We can set, for example, beam = 3.
idx, score = approx_nn_search(root, query, 3)
print('query \'%s\' || closest \'%s\', distance %.2f' % (query, words[idx], score))
Beam can be 3, for example. The transit_cost force the model to prefer path with minimal number of words. We may set, transit_cost=1, for example.
res = spell_correction(words, root, beam, transit_cost, separator, query)
print(res)
python triethai.py search -d dict.txt -q ตัวแปล
Should produce
search
query 'ตัวแปล' || closest 'ตัวแปร', distance 1
python triethai.py approx_search -d dict.txt -q ตัวแปล
Should produce
approx_search
query 'ตัวแปล' || closest 'ตัวแปร', distance 1
python triethai.py spell -d dict -q การปับฅ่าตัสแปฮ
Should produce
spell
การปรับค่าตัวแปร
We can also insert word segmentation separator, for example '|' by
python triethai.py spell -d dict -q การปับฅ่าตัสแปฮ --sep '|'
Should produce
spell
การ|ปรับ|ค่า|ตัวแปร