International characters: Diacritics normalization in text search #41

alopezo · 2019-04-26T19:29:28Z

The elastic search index does not normalize diacritics, for example, in the spanish edition, using the “findConcepts” API for searching for “vías resp” and “vias resp” (from “vías respiratorias” “respiratory tract” ) produce different results.

Example:

https://snowstorm.msal.gov.ar/MAIN/concepts?activeFilter=true&term=v%C3%ADas%20resp&offset=0&limit=1

https://snowstorm.msal.gov.ar/MAIN/concepts?activeFilter=true&term=vias%20resp&offset=0&limit=1

The browser implementation has a diacritics normalization algorithm on the index creation and search, and spanish users expect that writing the word with or without accent would produce the same results (vía vs via)

Searching on the latest elastic search documentation one way to resolve this is to use multiple fields with different analyzers, and a multi match query with “Most fields”

most_fieldsedit
The most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. For instance, the main field may contain synonyms, stemming and terms without diacritics. A second field may contain the original terms, and a third field might contain shingles. By combining scores from all three fields we can match as many documents as possible with the main field, but use the second and third fields to push the most similar results to the top of the list.

danka74 · 2019-04-29T06:01:58Z

Yes, Alejandro, this is the case, but (likely) has to be addressed when indexing and querying. I have done some experiments in this branch: https://github.com/danka74/snowstorm/tree/swedish-experimental-dk but here indexing is hard-coded, which is not what we want, see this commit: danka74@701f082
/Daniel

alopezo · 2019-04-29T14:39:38Z

Hi Daniel, that's exactly what is need, you are right.
I wonder if a generic analizer that “folds” all accented characters to plain ascii would be good enough for different languages, it would sure be for Spanish. We can provide a list of Spanish Language accented letters that may be added to that configuration.
Does this modification affect the results on english language in some way? It would be ideal to add this as the standard way to index and search.

Thanks

danka74 · 2019-04-29T14:44:19Z

@alopezo , this would unfortunately not work for Swedish with the characters ÅÄÖ which should not be folded. I see that in the few English words where Scandinavian characters are used (like Ångström, the length unit) SNOMED has used a folded term (here Angstrom), so maybe there is a "universal" set of characters which should be folded (e.g. É to E) which excludes ÅÄÖ.
/Daniel

alopezo · 2019-04-29T18:30:06Z

Defining that set would be a great first step, and much simpler to implement as it would not require additional configurations on index/search, for example in spanish we would like to fold:

áéíóúüñ

Maybe we can start with a short list of these and check use cases from other languages.

/Alejandro

kaicode · 2019-04-30T14:51:14Z

@alopezo Elasticsearch has built in support for appropriate character folding in each language. We plan to add a feature to Snowstorm to allow search to work for all languages where the correct language index analyser is picked at index time using the description language code field.

The correct analyser would also need to be used at search time in some cases. I'm still thinking about the best way to achieve this. Perhaps the Accept-Language header in the search request could be used to select a set of language specific search analysers?

alopezo · 2019-05-03T16:07:51Z

Yes, this would be a good solution for us, accents folding for Spanish based on the accept-language-header.

I'm reading the documentation of the language analyzers for elastic search:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

They don't propose folding but it seems like it's a straightforward task to add..

Thanks!

kaicode · 2019-07-15T21:41:56Z

I'm compiling the list of characters for each language which should not be folded/simplified because they are unique in a that language.

In Swedish the characters which should not be folded are: åäö.
In Spanish I think the characters which should not be folded are: áéíóúüñ.

I'm making the assumption that all characters can be made lowercase during processing for search, regardless of diacritics, so we only need to capture the lowercase versions of each character which must not be folded.

danka74 · 2019-07-15T21:50:11Z

Some more:
Danish/Norwegian: å æ ø
Finnish: same as Swedish.

Perhaps a request to the Content Managers AG?
/Daniel

- New config search.language.charactersNotFolded.xx=yyy - Add Description field termFolded. - Manual term folding implementation. - Config loaded as UTF-8

This is useful if the search configuration for international character handling has changed.

danka74 · 2019-07-22T07:15:48Z

Some more:
Danish/Norwegian: å æ ø
Finnish: same as Swedish.

Perhaps a request to the Content Managers AG?
/Daniel

Posted a discussion item on the CMAG discussion page!

kaicode · 2019-08-07T11:49:23Z

This feature is working so I'll close this ticket.
Only Swedish and Spanish characters are in configuration so far.
See "Search International Character Handling" in application.properties
Looking forward to adding more languages to configuration using another issue or pull request.

CWdanielsen · 2019-08-14T13:57:36Z

The order of the Danish letters is: æ ø å / Æ Ø Å.

kaicode · 2019-08-15T14:04:47Z

Thanks @CWdanielsen, I've added these in the develop branch. They will go out in the next release.

CWdanielsen · 2019-08-16T06:53:57Z

Thanks Kai, and they are the last three letters in the DK alphabet after a-z/A-Z.

kaicode self-assigned this Jul 15, 2019

kaicode added the enhancement label Jul 15, 2019

kaicode added a commit that referenced this issue Jul 16, 2019

STORM-242 #41 Configurable multi language search character folding.

ce4d54a

- New config search.language.charactersNotFolded.xx=yyy - Add Description field termFolded. - Manual term folding implementation. - Config loaded as UTF-8

kaicode added a commit that referenced this issue Jul 17, 2019

STORM-242 #41 Fix for folding terms where the character length grows.

7308925

kaicode added a commit that referenced this issue Jul 17, 2019

STORM-242 #41 Admin API operatiion to rebuild description index.

ad191c4

This is useful if the search configuration for international character handling has changed.

kaicode added a commit that referenced this issue Jul 18, 2019

STORM-242 #41 Correct URL of admin description rebuild function.

4e5db7e

kaicode added a commit that referenced this issue Jul 18, 2019

STORM-242 #41 Correct Spanish language code in character config.

2c78146

kaicode added a commit that referenced this issue Jul 18, 2019

STORM-242 #41 Use escaped unicode characters in application.properties.

de90f45

kaicode added the fixed-in-dev label Jul 19, 2019

kaicode mentioned this issue Jul 19, 2019

Collation for non-English languages #9

Closed

kaicode added a commit that referenced this issue Jul 21, 2019

STORM-242 #41 Add French lang config - no diacritics.

fb91830

kaicode closed this as completed Aug 7, 2019

kaicode added a commit that referenced this issue Aug 15, 2019

#41 Configure Danish alphabet additional characters.

7004648

kaicode added a commit that referenced this issue Aug 15, 2019

#41 Configure Norwegian alphabet additional characters.

62622f1

kaicode added a commit that referenced this issue Aug 15, 2019

#41 Configure Finnish alphabet additional characters.

8908d8f

kaicode added a commit that referenced this issue Sep 18, 2019

#41 Configure Danish alphabet additional characters.

596b362

kaicode added a commit that referenced this issue Sep 18, 2019

#41 Configure Norwegian alphabet additional characters.

0e03097

kaicode added a commit that referenced this issue Sep 18, 2019

#41 Configure Finnish alphabet additional characters.

77d095e

moraleja39 mentioned this issue Nov 14, 2019

Wrong Spanish character folding default configuration #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

International characters: Diacritics normalization in text search #41

International characters: Diacritics normalization in text search #41

alopezo commented Apr 26, 2019

danka74 commented Apr 29, 2019

alopezo commented Apr 29, 2019

danka74 commented Apr 29, 2019

alopezo commented Apr 29, 2019

kaicode commented Apr 30, 2019

alopezo commented May 3, 2019

kaicode commented Jul 15, 2019

danka74 commented Jul 15, 2019

danka74 commented Jul 22, 2019

kaicode commented Aug 7, 2019

CWdanielsen commented Aug 14, 2019

kaicode commented Aug 15, 2019

CWdanielsen commented Aug 16, 2019

International characters: Diacritics normalization in text search #41

International characters: Diacritics normalization in text search #41

Comments

alopezo commented Apr 26, 2019

danka74 commented Apr 29, 2019

alopezo commented Apr 29, 2019

danka74 commented Apr 29, 2019

alopezo commented Apr 29, 2019

kaicode commented Apr 30, 2019

alopezo commented May 3, 2019

kaicode commented Jul 15, 2019

danka74 commented Jul 15, 2019

danka74 commented Jul 22, 2019

kaicode commented Aug 7, 2019

CWdanielsen commented Aug 14, 2019

kaicode commented Aug 15, 2019

CWdanielsen commented Aug 16, 2019