Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure parent endonyms exist for all countries and megacities #314

Merged
merged 2 commits into from
Sep 23, 2022

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Sep 13, 2022

This PR attempts to resolve a long-standing issue in Pelias where parent properties can only be specified in English (or in the 'default language').

For example querying for a country directly works fine, you can query for Germany, Deutschland or Allemagne to find Germany, the search logic usually targets the 'default language' and the target language of the User-Agent.

The issue is when using the country name in support of another query, such as the example 10 Torstraße Germany which works as expected, but the query 10 Torstraße Deutschland fails.

This is really not ideal since it's very English-centric, in this German example it's particularly odd that the official language of the country isn't supported but English is.

The reason for this dates back to the original schema design back in ~2014, where the parent properties weren't modelled with the idea of multiple languages like the name.* fields were, so it's been tricky to fix.

Coupled with that was the design of the PIP service and this repo wof-admin-lookup, the service is designed in such a way that it only ever loads and serves a single name for a place, changing this interface would be a breaking change that I don't have the bandwidth to tackle at the moment.

This PR provides some relief to the situation by providing dictionaries of Endonyms for countries and mega cities which will optionally be added as aliases to every record (under a pelias/config flag).

It's not clear at this stage what effect adding multiple aliases to half a billion records will have on the size of the index, performance and query quality, so for now I've pared it down to just countries and megacities.

In the future, depending on the success of this PR we can expand to cover Exonyms (likely only a subset of languages), however it may be preferable to reconsider the schema design at that point rather than clump all languages in the same field.


how it works:

  • the src/data/aliases/country-language-map.json file is generated using the provided sql file from a WOF bundle
  • the file contains a list of countries, their wof ID and a list of their local and spoken languages
  • for each placetype we can generate a src/data/aliases/{placetype}-endonyms.psv file containing aliases
  • when the config flag imports.adminLookup.useEndonyms is enabled, the code in this PR is activated
  • for each record, modifying only the parent.* properties, add any missing aliases based on the parent ID previously assigned

@missinglink
Copy link
Member Author

Couple of open questions:

  • Do we want to handle dependency in the same way we handle country?
  • The pelias/whosonfirst importer doesn't use wof-admin-lookup, how can we cover that codebase too?

@missinglink
Copy link
Member Author

missinglink commented Sep 14, 2022

enabling this feature for openstreetmap and openaddresses (the vast majority of records in the index) resulted in a modest ~1% increase in the elasticsearch snapshot size:

Screenshot 2022-09-14 at 14 59 44

@missinglink
Copy link
Member Author

This PR seems to be effective in resolving the issue and comes with negligible additional disk requirements ~1%:

Screenshot 2022-09-15 at 14 20 56

I'm happy to merge this, ideally we can pair it with a PR to the acceptance-tests repo to cover this feature.

@missinglink
Copy link
Member Author

missinglink commented Sep 19, 2022

I spent some more time testing this today, it works great, but there's another class of problem I hadn't considered which can be resolved with the same method.

What I didn't realize is that the inverse of this issue is also a problem, where WOF uses the endonym as the primary label rather than English, which I had assumed to be a policy.

So for example I expected to find Köln with the wof:name of Cologne (ie. in English), which is the case with Germany for example, but this isn't universally true.

The issue in Pelias (autocomplete) is that you can find a record with "Domkloster 4 Köln" but not "Domkloster 4 Cologne", the inverse of the issue mentioned above.

The fix is very simple, actually I already wrote the code but had left it commented out: if (k === 'name:eng_x_preferred') { return true; }, this line means that the English name is always added as an alias.

The new commit 1387a75 shows the changes this line makes to the dictionaries.

I'll re-run the build and test again to ensure it's ready to merge

@Joxit
Copy link
Member

Joxit commented Sep 19, 2022

Hi there,

This PR reminds me of another one I did a few years ago pelias/whosonfirst#492 but I added all exonyms on WOF documents. The result was a bit disappointing for a world build

orignal PR
Size 3,2G 48G
Time 41m6,438s 5h36m20,755s

Endonyms seems to be a good first step anyway 👍

related: pelias/api#1296

@missinglink
Copy link
Member Author

This looks good, it adds about ~1% volume to the disk requirements and possibly some additional build time.

Since this is behind a feature flag and demonstrates that the test cases pass, I'm happy to squash-and-merge this.

There's still some opportunity to extend this PR in the future, since I know there's things other developers might want to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants