Use ASCII synonyms for non-ASCII characters #27

ageorgou · 2019-09-04T15:32:17Z

Will fix #18.

Add a new analyzer for the fields that may contain non-ASCII characters (gw, cf, norms_n and norms_f fields)
Stop indexing the headword field (not necessary but removes duplication)
Make sure that the preprocessing does not remove any non-ASCII characters! (e.g. that Unicode sequences are understood correctly)
Test that the analyzer specification is correct based on the expected substitutions (see examples on the official docs)
Test that other fields like cf.sort behave as before, i.e. don't use the new analyzer
Test that searching with ASCII substitutions works as expected

The headword field poses some problems for the new analyzer we are planning. The analyzer will contain a filter to replace non-ASCII characters with ASCII "synonyms". Some of these synonyms contain characters that the standard tokenizer strips out, therefore the filter cannot be used at the same time as that tokenizer. The headword field is unique in that it both contains non-ASCII characters and needs to be tokenized (to separate the guideword from the cuneiform and part of speech). This means that we would need some more complex way of indexing it. Fortunately, it contains the same information as gw and cf (as well as the part of speech, which we don't do anything with at the moment), so we don't lose anything by removing it from the search.

Currently only used for the cf field

The Index object exposes some methods of the index client (like put_mapping) but gives a higher-level abstraction, which will hopefully lead to more concise code.

It makes more sense there.

Version 7 of ElasticSearch has introduced some breaking changes, particularly the removal of document types. This will not be hard to adapt to (we only have one document type in our index anyway), but newer versions of the elasticsearch_dsl package have changed their API to follow this. As we mostly use ES 5/6 locally and in the Oracc server, for now we should stick to the previous version of the package.

This is needed or the tests that set up the mapping won't run.

Also, don't update/run ES on the PEP8 test instance

Earlier versions of elasticsearch_dsl set a default document type when creating an index. This leads to an error when we try to add the mapping for our own document type ("entry"), because an index can only have one type. This fix avoids this by passing the intended document type from the start. (This problems appears in versions including 6.2.1, but has been fixed by 6.4.0)

ageorgou · 2022-02-09T17:03:41Z

Merging this so we can start from a cleaner point. Remaining tasks noted in #32.

ageorgou added the data Ingestion and preprocessing of data label Sep 4, 2019

ageorgou added 15 commits September 9, 2019 14:34

Move setting spec to own function for future work

798c602

Rename variable to not shadow inbuilt dir

bb4ddc6

Define a new analyzer for non-ASCII synonyms

0bc11cd

Currently only used for the cf field

Split index creation off for easier modification

a5ef865

Create the analyzer using the higher-level DSL

475fa88

Start moving towards using the higher-level Index

692ae83

The Index object exposes some methods of the index client (like put_mapping) but gives a higher-level abstraction, which will hopefully lead to more concise code.

Return check for ICU plugin to upload module

3bfa607

It makes more sense there.

Also create the mappings using the DSL

abb3c0d

Fix one character mapping

5abe53d

Start testing the behaviour of the analyzer

05f0b2c

Test for some more synonyms

652ffa8

Replace more characters, except diacritics

cf46352

Also test multiple words

8ffbe3d

ageorgou force-pushed the non-ascii branch from 2d9ba54 to 8ffbe3d Compare September 9, 2019 13:37

ageorgou added 3 commits September 9, 2019 17:00

Share uploaded entries fixture with all tests

bebec68

Test search-time analysis

3a15ebc

Fix typos

a864a53

ageorgou changed the base branch from master to development September 9, 2019 16:47

ageorgou added 3 commits September 9, 2019 18:08

Use the analyzer in all relevant fields

fdcbcf0

Install the ICU Analysis plugin on Travis

9e14923

This is needed or the tests that set up the mapping won't run.

Restart nodes after installing ICU plugin

2d92a2e

ageorgou force-pushed the non-ascii branch from 3026598 to 2d92a2e Compare September 9, 2019 21:26

ageorgou added 4 commits September 9, 2019 22:39

Add reminder to restart ES after installing plugin

fc362c6

Only start ES on Travis after installing plugin

bfa868d

Also, don't update/run ES on the PEP8 test instance

Replace diacritics and subscripts

8110688

ageorgou mentioned this pull request Feb 9, 2022

Further tests on non-ascii processing #32

Open

2 tasks

ageorgou marked this pull request as ready for review February 9, 2022 17:03

ageorgou merged commit 3e4980c into development Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ASCII synonyms for non-ASCII characters #27

Use ASCII synonyms for non-ASCII characters #27

ageorgou commented Sep 4, 2019 •

edited

Loading

ageorgou commented Feb 9, 2022

Use ASCII synonyms for non-ASCII characters #27

Use ASCII synonyms for non-ASCII characters #27

Conversation

ageorgou commented Sep 4, 2019 • edited Loading

ageorgou commented Feb 9, 2022

ageorgou commented Sep 4, 2019 •

edited

Loading