Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 221 (partially) #288

Merged
merged 6 commits into from
Mar 11, 2018
Merged

Fix issue 221 (partially) #288

merged 6 commits into from
Mar 11, 2018

Conversation

hbruch
Copy link
Collaborator

@hbruch hbruch commented Jan 10, 2018

Fix #221 (partially). This PR improves housenumber matching, especially in cases where the OSM housenumber suffix

Tests are provided via geocoders/geocoder-tester#28 which highlight the improvements as well as further existing issues.

Still open:

  • Missing fallback in case a housenumber suffix not existing in the data is requested (e.g. Lister Meile 29D, Hannover)
  • instead of the exactly matching housenumber a match with a name attribute and matching numeral ngram might be returned (Anderter Straße 1 Hannover)

@karussell
Copy link
Collaborator

Cool, thanks!

May I ask how you test and iterate in general? Do you have a nominatim installation and feed photon after each change or do you reindex somehow?

@hbruch
Copy link
Collaborator Author

hbruch commented Jan 10, 2018

I'm running a local nominatim installation with a smaller region from which I feed photon. For this region, I'm running a small test set via geocoder/geocoder-tester. If results look promising, I deploy it to a server running a nominatim instance with a germany dataset which is fed to photon again (takes about 10h). For this I run geocoder-test for germany and an internal reference dataset of ~1000 addresses.

For the reference dataset, some aggregated stats (Max distance to reference coord < 100m? Dice similarity coefficient > 0,5? housenumber found?) give a good indication of match quality. However, these tests have a bias to whole address geocoding, so type-ahead behavior is something I check manually via (limited) explorative testing.

Reindexing is a good idea and might significantly speed up development cycles, did you already? Another option I thought of was importing the json-dump. This would lower initial barriers for new contributers, as they would not need to setup nominatim (but currently no JsonImporter exists, does it?).

How do you iterate?

@karussell
Copy link
Collaborator

How do you iterate?

Never touched major feeding stuff before and modified just the querying part. But as I'm probably now a bit more into this topic I'll do similar to you and will also create country extracts easier see #281

We should invest in this topic as this makes developing new stuff easier. Also performance testing is important but complex ... so many things to do :)

Reindexing is a good idea and might significantly speed up development cycles, did you already?

No

This would lower initial barriers for new contributers, as they would not need to setup nominatim

Yes, good idea. And I hope with the extracts this will improve the work too

but currently no JsonImporter exists

Yes

@hbruch
Copy link
Collaborator Author

hbruch commented Jan 10, 2018

I just realized that the json dump format corresponds to the elastic bulk format, so it should be feasible to provide an importer. Various bulk importer exist (stream2es, bulkEs), but photon should provide a self-contained json import.

I think a reindexing of an existing photon index would not work, as all context information is dropped from source during the import. (if it should be indexed at all is another question, I'm currently thinking about, as many unexpected matches are caused by 'invisible' context matches...)

@lonvia
Copy link
Collaborator

lonvia commented Jan 11, 2018

@karussell Should we create a world-wide DB with this PR to play around with?

Can we move the (very interesting) testing discussion into a separate issue?

@karussell
Copy link
Collaborator

@hbruch are you using your newly added tests or can you recommend what else has to be green?

@karussell Should we create a world-wide DB with this PR to play around with?

Let us fix&merge the "decay" problem before (#296) and then I'll do one global export with this PR.

@hbruch
Copy link
Collaborator Author

hbruch commented Jan 22, 2018

@hbruch are you using your newly added tests or can you recommend what else has to be green?

Yes, though there are some failing tests because photon does not return the expected result as first result. Returning them at least in the first couple of matches is better than not at all, but I was unsure about tuning the test case for photon by increasing the limit to allow looking into not only the first result.

I checked against an internal reference database of (german) addresses and for the germany test-suite of geocoder-tester, all formerly green tests stayed green. This is hopefully the case for the planet db as well, but still to check. What I didn't check, but surely is worth doing, is behavior for british addresses, i.e. postal codes.

@karussell karussell mentioned this pull request Feb 19, 2018
4 tasks
@karussell
Copy link
Collaborator

karussell commented Feb 24, 2018

@karussell Should we create a world-wide DB with this PR to play around with?

@lonvia please have a look #314 (comment)

@karussell
Copy link
Collaborator

Ah, the problem is now that photon.komoot.de also runs with this data and we cannot easily see the improvements. This was my fault writing this into the "latest DB"

@karussell
Copy link
Collaborator

There are many more tests passing now, nice! Especially in geocoder_tester/world/france/nordpasdecalais/test_public_buildings_addresses.csv which are related to this issue, I think.

But it looks like some got worse but when looking in the details it seems that the tests are too strict (radius=100m) and the OSM data is too bad (just a way, no housenumbers, like for 60, avenue de Lattre-de-Tassigny 59190 Hazebrouck)

@lonvia
Copy link
Collaborator

lonvia commented Feb 25, 2018

I noticed one oddity: partial addresses with a numerical house number do much worse now than the same address with a housenumber with letter. Search for 'Kiefernstr 9a' and you get the buildings returned. Search for 'Kiefernstr 9' and there are only roads, even though there are results that should match in the database ('Kiefernstr 9, dresden' for one example). Is that somehow related to these changes or should I open a new issue?

@hbruch
Copy link
Collaborator Author

hbruch commented Feb 25, 2018

Could we verify, if it performs worse than before? Current photon.komoot.de db is built with the same version, isn't it?

A slightly different weighting could occur due to an added 'unique' filter, but I think the general issue existed before.

Searching for Kiefernstr 9 returns streets with postalcodes starting with 9 (as postal codes are sliced by the egde-ngram tokenizer).

Not sure, if we should support search as you type for partial numerals, neither postal codes nor housenumbers.

@lonvia
Copy link
Collaborator

lonvia commented Feb 25, 2018

Ok, the partial search would explain it. I noticed some other oddities with that but that's really for a different issue. Otherwise this looks ok to me.

Copy link
Collaborator

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kiefernstr anomaly has disappeared with the latest test instance. Maybe that was just due to the mixed setup. All tests I could come up with look good.

@hbruch
Copy link
Collaborator Author

hbruch commented Mar 8, 2018

I still wonder why there are apparently more new failing than new passing tests now. Will have a look into it tomorrow.

@karussell
Copy link
Collaborator

karussell commented Mar 8, 2018

Thanks!

And yes, this is strange. Still at least 10 are wrongly reported as written in #314 (comment) regarding the geocoder-tester bug.

@hbruch
Copy link
Collaborator Author

hbruch commented Mar 9, 2018

Thank you for setting up the both instances.

Analyzing the differences, I see different issues, non of which I see related to this PR:

  1. Most cases I looked into had no housenumber in OSM.
    As street deduplication is not deterministic, in some cases photon returns matching
    osm ways, which may be farer away than the usually allowed tolerance of 100m.
    This will result in flaky tests, as they may fail for one run/instance but not for another.
    I think we should change the tolerance, if this is ok for the geocoder-tester team as well.

  2. querying only for one/two digit numbers with location bias is sometimes not
    returning a house number close to the location, but only a street with a postal code
    starting with these two digits (which is be pretty unspecific)
    http://138.201.17.75:2288/api?q=57&lon=6.155775&lat=49.08675680

  3. a couple of tests are marked as failing and passing.
    Last whitespace character in the query string results in a comparison issue, as lines from compare report are imported with line.strip(). For me, utf-8/cp1252 issues caused trouble as well.

  4. A very strange issue to me is " 20 rue jules saulnier, 93200 saint denis".
    This has a different postal code (93400) in the new photon instance.
    Don't think anything changed there with Fix issue 221 (partially) #288, but possibly there is a non-determinisnm/bug exporting
    estimated postal_codes from nominatim(?) .

  5. differing result counts.
    E.g. for airport tests I checked, I get less than limit results for the Fix issue 221 (partially) #288 branch. Probably because Fix issue 221 (partially) #288 is not rebased and master already contains increase limit to improve de-duplication count #301

So to me, #288 seems to perform better on housenumbers.
However, some non-determinisms intervene with easy test result interpretation.

@karussell
Copy link
Collaborator

Oh, wow. Thanks for this in-depth analysis. This makes sense! So as a result: IMO we can merge this issue and need to improve our testing pipeline.

For me, utf-8/cp1252 issues caused trouble as well.

This could be an issue with the one server I had with utf-8 (will try on a different)

a couple of tests are marked as failing and passing

I think this is unrelated to the utf-8 things as this happened to me with geocoder-tester for my local proper functioning laptop too (you can run just these tests to see what I mean)

Could we reduce the non-deterministic behaviour somehow (e.g. using a sorted search)?

@karussell karussell merged commit a73a5da into komoot:master Mar 11, 2018
@karussell karussell added this to the 0.3.0 milestone Mar 25, 2018
@hbruch hbruch deleted the issue_221 branch May 23, 2020 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to find address even if exact match
3 participants