Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix incorrect matching of Thai characters in FTS5 #216

Merged
merged 3 commits into from
Dec 13, 2021

Conversation

missinglink
Copy link
Member

There appears to be a bug in the FTS5 analysis module regarding Thai characters.

It should be that the tokens.token column and the fulltext.fulltext column contain the same token and therefore have the same matching functionality, the FTS5 table is significantly faster (even when including the JOIN) so we generally use that instead of the index on tokens.

However it turns out, for Thai at least, that this isn't true:

SELECT COUNT(*)
FROM tokens t1
WHERE t1.token = '';
0

SELECT COUNT(*)
FROM fulltext f1
JOIN tokens t1 ON (f1.rowid = t1.rowid)
WHERE f1.fulltext MATCH '"จ"';
2904

It seems that the FTS5 module erroneously performs prefix matching of this character despite the query not containing a star and therefore not requesting a prefix search.

This PR is a simple workaround for this, and any other cases where the two columns don't match, we simple check both.

@missinglink
Copy link
Member Author

missinglink commented Dec 13, 2021

this needs some work before merging, there is one query which passes the variable "${object}" OR "${object}"*, using TRIM('"') wouldn't be sufficient here and would generate a query parsing error.

It may be a simple task of replacing it with foo = $object OR foo LIKE ($object || '%') but might require splitting the file into two again.

@missinglink
Copy link
Member Author

missinglink commented Dec 13, 2021

Resolved issue mentioned above, although I'm not convinced the quoting is correct, its quite messy 🤷
Need to do a review of the quoting before merging

eg. should MATCH $object OR $object* be MATCH "$object" OR "$object"*?
https://www.sqlite.org/fts5.html#fts5_phrases

@missinglink
Copy link
Member Author

missinglink commented Dec 13, 2021

pushed a third commit to fix the quoting, its a little confusing but the double quotes must be applied before passing the variable to the prepared statement.

when passing a string variable to a prepared statement it will be wrapped in single quotes, so doing it like results in '"foo"' rather than generating "'foo'" when trying to do the quoting in the SQL statement itself.

@missinglink missinglink merged commit a60df73 into master Dec 13, 2021
@missinglink missinglink deleted the query-performance branch December 13, 2021 14:01
@missinglink
Copy link
Member Author

for posterity, here are some examples of terms which incorrectly matched:

จังหวัดกาลีมันตันกลาง|102020315|tha|preferred|จังหวัดกาลีมันตันกลาง
จังหวัดกาลีมันตันใต้|102020317|tha|preferred|จังหวัดกาลีมันตันใต้
จังหวัดกาลีมันตันตะวันตก|102020321|tha|preferred|จังหวัดกาลีมันตันตะวันตก
จังหวัดชวาตะวันออก|102020387|tha|preferred|จังหวัดชวาตะวันออก
จังหวัดชวากลาง|102020389|tha|preferred|จังหวัดชวากลาง
จังหวัดชวาตะวันตก|102020393|tha|preferred|จังหวัดชวาตะวันตก
จัมบิ|102020423|tha|preferred|จัมบิ
จัมบี|102020423|tha|variant|จัมบี
จังหวัดปาปัว|102020451|tha|preferred|จังหวัดปาปัว
จีมาฮี|102020649|tha|preferred|จีมาฮี
จังหวัดบันเติน|102020961|tha|preferred|จังหวัดบันเติน
จังหวัดมาลูกูเหนือ|102021159|tha|preferred|จังหวัดมาลูกูเหนือ
จังหวัดปาปัวตะวันตก|102021355|tha|preferred|จังหวัดปาปัวตะวันตก
จังหวัดซูลาเวซีตะวันตก|102021357|tha|preferred|จังหวัดซูลาเวซีตะวันตก
จังหวัดบันเติน|102021535|tha|preferred|จังหวัดบันเติน
จังหวัดจัมบี|102021657|tha|preferred|จังหวัดจัมบี
จังหวัดลัมปุง|102022643|tha|preferred|จังหวัดลัมปุง
จูปิง|102023945|tha|preferred|จูปิง

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant