Persian Language Enhancement #545
ariaieboy
started this conversation in
Feedback & Feature Proposal
Replies: 1 comment
-
|
Hello @ariaieboy, your discussion seems to be highly related to #139. Thank you a lot for your feedback, this helps me a lot to enhance Language support in Meilisearch! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
hello
I am using meilisearch on a new project in persian language. and it's actually great in many cases but there is some edge cases that meili need some enhancement.
first issue is letter
آthe unicode of this letter is U+0622 and it's usually come as the first letter of a word. and it's equal to
اcharacter with unicode of U+0627this letter is shared between arabic and persian language but it's not that complicated like what we have in arabic language. in arabic we have multiple
alefletter but in persian we only have this 2 kind ofalefthat is equal to each others.for example this words are equal :
آب=>اب|آقا=>اقا|آسفالت=>اسفالتa nice quick workaround would be a char replacement of U+0622 with U+0627
second issue is what we call half-space in persian and unicode call's it Zero-width non-joiner the unicode of this char is U+200C :
this character is equal to
spacein persian language.examples:
میتوان=>می توان|کتابها=>کتاب هاin a perfect world this character is meaningful and it's telling that two word are related for example
کتابmeansbookandهاact likesinbooksso instead ofspacewe must useshort spacebut in 95% of the times in computer words users usespaceand that's gonna break meilisearch result.like prev issue the quick workaround is to replace
short spacewithspace.both in the index and user input.
the third issue is Tatweel character :
this char is rarely used specially in persian language and it's not necessary to handle it in you tokenizer. but having this edge cases covered gonna help improve search results in general.
the Tatweel actually means nothing and can be removed from any string this Tatweel or kashida is a type of justification in the arabic and usually for making a word visually better in persian language
examples:
حــمید=>حمید|رحــــــــــــــیــــــــــــــم=>رحیمin this case the quick workaround would be replacing
kashidawithnullBeta Was this translation helpful? Give feedback.
All reactions