-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
International characters: Diacritics normalization in text search #41
Comments
Yes, Alejandro, this is the case, but (likely) has to be addressed when indexing and querying. I have done some experiments in this branch: https://github.com/danka74/snowstorm/tree/swedish-experimental-dk but here indexing is hard-coded, which is not what we want, see this commit: danka74@701f082 |
Hi Daniel, that's exactly what is need, you are right. Thanks |
@alopezo , this would unfortunately not work for Swedish with the characters ÅÄÖ which should not be folded. I see that in the few English words where Scandinavian characters are used (like Ångström, the length unit) SNOMED has used a folded term (here Angstrom), so maybe there is a "universal" set of characters which should be folded (e.g. É to E) which excludes ÅÄÖ. |
Defining that set would be a great first step, and much simpler to implement as it would not require additional configurations on index/search, for example in spanish we would like to fold: áéíóúüñ Maybe we can start with a short list of these and check use cases from other languages. /Alejandro |
@alopezo Elasticsearch has built in support for appropriate character folding in each language. We plan to add a feature to Snowstorm to allow search to work for all languages where the correct language index analyser is picked at index time using the description language code field. The correct analyser would also need to be used at search time in some cases. I'm still thinking about the best way to achieve this. Perhaps the Accept-Language header in the search request could be used to select a set of language specific search analysers? |
Yes, this would be a good solution for us, accents folding for Spanish based on the accept-language-header. I'm reading the documentation of the language analyzers for elastic search: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html They don't propose folding but it seems like it's a straightforward task to add.. Thanks! |
I'm compiling the list of characters for each language which should not be folded/simplified because they are unique in a that language. In Swedish the characters which should not be folded are: I'm making the assumption that all characters can be made lowercase during processing for search, regardless of diacritics, so we only need to capture the lowercase versions of each character which must not be folded. |
Some more: Perhaps a request to the Content Managers AG? |
- New config search.language.charactersNotFolded.xx=yyy - Add Description field termFolded. - Manual term folding implementation. - Config loaded as UTF-8
This is useful if the search configuration for international character handling has changed.
Posted a discussion item on the CMAG discussion page! |
This feature is working so I'll close this ticket. |
The order of the Danish letters is: æ ø å / Æ Ø Å. |
Thanks @CWdanielsen, I've added these in the develop branch. They will go out in the next release. |
Thanks Kai, and they are the last three letters in the DK alphabet after a-z/A-Z. |
The elastic search index does not normalize diacritics, for example, in the spanish edition, using the “findConcepts” API for searching for “vías resp” and “vias resp” (from “vías respiratorias” “respiratory tract” ) produce different results.
Example:
https://snowstorm.msal.gov.ar/MAIN/concepts?activeFilter=true&term=v%C3%ADas%20resp&offset=0&limit=1
https://snowstorm.msal.gov.ar/MAIN/concepts?activeFilter=true&term=vias%20resp&offset=0&limit=1
The browser implementation has a diacritics normalization algorithm on the index creation and search, and spanish users expect that writing the word with or without accent would produce the same results (vía vs via)
Searching on the latest elastic search documentation one way to resolve this is to use multiple fields with different analyzers, and a multi match query with “Most fields”
The text was updated successfully, but these errors were encountered: