Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ASCII synonyms for non-ASCII characters #27

Merged
merged 25 commits into from
Feb 9, 2022
Merged

Conversation

ageorgou
Copy link
Contributor

@ageorgou ageorgou commented Sep 4, 2019

Will fix #18.

  • Add a new analyzer for the fields that may contain non-ASCII characters (gw, cf, norms_n and norms_f fields)
  • Stop indexing the headword field (not necessary but removes duplication)
  • Make sure that the preprocessing does not remove any non-ASCII characters! (e.g. that Unicode sequences are understood correctly)
  • Test that the analyzer specification is correct based on the expected substitutions (see examples on the official docs)
  • Test that other fields like cf.sort behave as before, i.e. don't use the new analyzer
  • Test that searching with ASCII substitutions works as expected

@ageorgou ageorgou added the data Ingestion and preprocessing of data label Sep 4, 2019
The headword field poses some problems for the new analyzer we
are planning.

The analyzer will contain a filter to replace non-ASCII characters
with ASCII "synonyms". Some of these synonyms contain characters
that the standard tokenizer strips out, therefore the filter cannot
be used at the same time as that tokenizer.
The headword field is unique in that it both contains non-ASCII
characters and needs to be tokenized (to separate the guideword
from the cuneiform and part of speech). This means that we would
need some more complex way of indexing it.
Fortunately, it contains the same information as gw and cf (as well
as the part of speech, which we don't do anything with at the
moment), so we don't lose anything by removing it from the search.
Currently only used for the cf field
The Index object exposes some methods of the index client (like
put_mapping) but gives a higher-level abstraction, which will
hopefully lead to more concise code.
Version 7 of ElasticSearch has introduced some breaking changes,
particularly the removal of document types. This will not be hard
to adapt to (we only have one document type in our index anyway),
but newer versions of the elasticsearch_dsl package have changed
their API to follow this. As we mostly use ES 5/6 locally and in
the Oracc server, for now we should stick to the previous version
of the package.
@ageorgou ageorgou changed the base branch from master to development September 9, 2019 16:47
Also, don't update/run ES on the PEP8 test instance
Earlier versions of elasticsearch_dsl set a default document type
when creating an index. This leads to an error when we try to add
the mapping for our own document type ("entry"), because an index
can only have one type. This fix avoids this by passing the
intended document type from the start.
(This problems appears in versions including 6.2.1, but has been
fixed by 6.4.0)
@ageorgou
Copy link
Contributor Author

ageorgou commented Feb 9, 2022

Merging this so we can start from a cleaner point. Remaining tasks noted in #32.

@ageorgou ageorgou marked this pull request as ready for review February 9, 2022 17:03
@ageorgou ageorgou merged commit 3e4980c into development Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ingestion and preprocessing of data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ASCII representations of non-ASCII characters
1 participant