Fix issue 221 (partially) #288

hbruch · 2018-01-10T13:46:33Z

Fix #221 (partially). This PR improves housenumber matching, especially in cases where the OSM housenumber suffix

is in uppercase as in Malplaquetstraße 5A, 13347 Berlin, Germany (see Unable to find address even if exact match #221) or
is separated via whitespace as in Osterstr. 42 A Hannover, Germany.

Tests are provided via geocoders/geocoder-tester#28 which highlight the improvements as well as further existing issues.

Still open:

Missing fallback in case a housenumber suffix not existing in the data is requested (e.g. Lister Meile 29D, Hannover)
instead of the exactly matching housenumber a match with a name attribute and matching numeral ngram might be returned (Anderter Straße 1 Hannover)

Signed-off-by: Holger Bruch <holger.bruch@systect.de>

karussell · 2018-01-10T13:49:13Z

Cool, thanks!

May I ask how you test and iterate in general? Do you have a nominatim installation and feed photon after each change or do you reindex somehow?

hbruch · 2018-01-10T15:03:22Z

I'm running a local nominatim installation with a smaller region from which I feed photon. For this region, I'm running a small test set via geocoder/geocoder-tester. If results look promising, I deploy it to a server running a nominatim instance with a germany dataset which is fed to photon again (takes about 10h). For this I run geocoder-test for germany and an internal reference dataset of ~1000 addresses.

For the reference dataset, some aggregated stats (Max distance to reference coord < 100m? Dice similarity coefficient > 0,5? housenumber found?) give a good indication of match quality. However, these tests have a bias to whole address geocoding, so type-ahead behavior is something I check manually via (limited) explorative testing.

Reindexing is a good idea and might significantly speed up development cycles, did you already? Another option I thought of was importing the json-dump. This would lower initial barriers for new contributers, as they would not need to setup nominatim (but currently no JsonImporter exists, does it?).

How do you iterate?

karussell · 2018-01-10T15:18:59Z

How do you iterate?

Never touched major feeding stuff before and modified just the querying part. But as I'm probably now a bit more into this topic I'll do similar to you and will also create country extracts easier see #281

We should invest in this topic as this makes developing new stuff easier. Also performance testing is important but complex ... so many things to do :)

Reindexing is a good idea and might significantly speed up development cycles, did you already?

No

This would lower initial barriers for new contributers, as they would not need to setup nominatim

Yes, good idea. And I hope with the extracts this will improve the work too

but currently no JsonImporter exists

Yes

hbruch · 2018-01-10T20:47:32Z

I just realized that the json dump format corresponds to the elastic bulk format, so it should be feasible to provide an importer. Various bulk importer exist (stream2es, bulkEs), but photon should provide a self-contained json import.

I think a reindexing of an existing photon index would not work, as all context information is dropped from source during the import. (if it should be indexed at all is another question, I'm currently thinking about, as many unexpected matches are caused by 'invisible' context matches...)

lonvia · 2018-01-11T19:55:28Z

@karussell Should we create a world-wide DB with this PR to play around with?

Can we move the (very interesting) testing discussion into a separate issue?

karussell · 2018-01-22T11:48:39Z

@hbruch are you using your newly added tests or can you recommend what else has to be green?

@karussell Should we create a world-wide DB with this PR to play around with?

Let us fix&merge the "decay" problem before (#296) and then I'll do one global export with this PR.

hbruch · 2018-01-22T13:25:56Z

@hbruch are you using your newly added tests or can you recommend what else has to be green?

Yes, though there are some failing tests because photon does not return the expected result as first result. Returning them at least in the first couple of matches is better than not at all, but I was unsure about tuning the test case for photon by increasing the limit to allow looking into not only the first result.

I checked against an internal reference database of (german) addresses and for the germany test-suite of geocoder-tester, all formerly green tests stayed green. This is hopefully the case for the planet db as well, but still to check. What I didn't check, but surely is worth doing, is behavior for british addresses, i.e. postal codes.

karussell · 2018-02-24T17:38:19Z

@karussell Should we create a world-wide DB with this PR to play around with?

@lonvia please have a look #314 (comment)

karussell · 2018-02-24T17:42:53Z

Ah, the problem is now that photon.komoot.de also runs with this data and we cannot easily see the improvements. This was my fault writing this into the "latest DB"

karussell · 2018-02-25T13:07:51Z

There are many more tests passing now, nice! Especially in geocoder_tester/world/france/nordpasdecalais/test_public_buildings_addresses.csv which are related to this issue, I think.

But it looks like some got worse but when looking in the details it seems that the tests are too strict (radius=100m) and the OSM data is too bad (just a way, no housenumbers, like for 60, avenue de Lattre-de-Tassigny 59190 Hazebrouck)

lonvia · 2018-02-25T15:40:37Z

I noticed one oddity: partial addresses with a numerical house number do much worse now than the same address with a housenumber with letter. Search for 'Kiefernstr 9a' and you get the buildings returned. Search for 'Kiefernstr 9' and there are only roads, even though there are results that should match in the database ('Kiefernstr 9, dresden' for one example). Is that somehow related to these changes or should I open a new issue?

hbruch · 2018-02-25T17:57:11Z

Could we verify, if it performs worse than before? Current photon.komoot.de db is built with the same version, isn't it?

A slightly different weighting could occur due to an added 'unique' filter, but I think the general issue existed before.

Searching for Kiefernstr 9 returns streets with postalcodes starting with 9 (as postal codes are sliced by the egde-ngram tokenizer).

Not sure, if we should support search as you type for partial numerals, neither postal codes nor housenumbers.

lonvia · 2018-02-25T18:53:12Z

Ok, the partial search would explain it. I noticed some other oddities with that but that's really for a different issue. Otherwise this looks ok to me.

lonvia

The Kiefernstr anomaly has disappeared with the latest test instance. Maybe that was just due to the mixed setup. All tests I could come up with look good.

hbruch · 2018-03-08T22:01:55Z

I still wonder why there are apparently more new failing than new passing tests now. Will have a look into it tomorrow.

karussell · 2018-03-08T22:04:52Z

Thanks!

And yes, this is strange. Still at least 10 are wrongly reported as written in #314 (comment) regarding the geocoder-tester bug.

hbruch · 2018-03-09T13:17:55Z

Thank you for setting up the both instances.

Analyzing the differences, I see different issues, non of which I see related to this PR:

Most cases I looked into had no housenumber in OSM.
As street deduplication is not deterministic, in some cases photon returns matching
osm ways, which may be farer away than the usually allowed tolerance of 100m.
This will result in flaky tests, as they may fail for one run/instance but not for another.
I think we should change the tolerance, if this is ok for the geocoder-tester team as well.
querying only for one/two digit numbers with location bias is sometimes not
returning a house number close to the location, but only a street with a postal code
starting with these two digits (which is be pretty unspecific)
http://138.201.17.75:2288/api?q=57&lon=6.155775&lat=49.08675680
a couple of tests are marked as failing and passing.
Last whitespace character in the query string results in a comparison issue, as lines from compare report are imported with line.strip(). For me, utf-8/cp1252 issues caused trouble as well.
A very strange issue to me is " 20 rue jules saulnier, 93200 saint denis".
This has a different postal code (93400) in the new photon instance.
Don't think anything changed there with Fix issue 221 (partially) #288, but possibly there is a non-determinisnm/bug exporting
estimated postal_codes from nominatim(?) .
differing result counts.
E.g. for airport tests I checked, I get less than limit results for the Fix issue 221 (partially) #288 branch. Probably because Fix issue 221 (partially) #288 is not rebased and master already contains increase limit to improve de-duplication count #301

So to me, #288 seems to perform better on housenumbers.
However, some non-determinisms intervene with easy test result interpretation.

karussell · 2018-03-09T13:23:59Z

Oh, wow. Thanks for this in-depth analysis. This makes sense! So as a result: IMO we can merge this issue and need to improve our testing pipeline.

For me, utf-8/cp1252 issues caused trouble as well.

This could be an issue with the one server I had with utf-8 (will try on a different)

a couple of tests are marked as failing and passing

I think this is unrelated to the utf-8 things as this happened to me with geocoder-tester for my local proper functioning laptop too (you can run just these tests to see what I mean)

Could we reduce the non-deterministic behaviour somehow (e.g. using a sorted search)?

hbruch added 6 commits January 10, 2018 13:51

For indexing replace ,. by whitespace so tokens are split

bfc05d2

Signed-off-by: Holger Bruch <holger.bruch@systect.de>

Analyze housenumbers with standard analyzer

1d6aa42

Signed-off-by: Holger Bruch <holger.bruch@systect.de>

Add specific analyzer to index housenumbers

072a646

Add housenumber and suffix joining to index_ngram

8f1e0e2

Add punctuationgreedy for ngram analyzers too

211edf5

Add unique token filter to avoid overweighting same single char ngrams

8501964

Signed-off-by: Holger Bruch <holger.bruch@systect.de>

hbruch mentioned this pull request Jan 12, 2018

Provide json dump importer #291

Open

karussell mentioned this pull request Jan 22, 2018

Integration external test cases #23

Open

karussell mentioned this pull request Feb 19, 2018

release 0.3 preparation #314

Closed

4 tasks

lonvia approved these changes Mar 8, 2018

View reviewed changes

karussell merged commit a73a5da into komoot:master Mar 11, 2018

This was referenced Mar 13, 2018

Hyphenated housenumber are not always found #265

Closed

Housenumber tuning #197

Closed

karussell added this to the 0.3.0 milestone Mar 25, 2018

hbruch deleted the issue_221 branch May 23, 2020 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 221 (partially) #288

Fix issue 221 (partially) #288

hbruch commented Jan 10, 2018 •

edited by karussell

karussell commented Jan 10, 2018

hbruch commented Jan 10, 2018

karussell commented Jan 10, 2018

hbruch commented Jan 10, 2018

lonvia commented Jan 11, 2018

karussell commented Jan 22, 2018

hbruch commented Jan 22, 2018

karussell commented Feb 24, 2018 •

edited

karussell commented Feb 24, 2018

karussell commented Feb 25, 2018

lonvia commented Feb 25, 2018

hbruch commented Feb 25, 2018

lonvia commented Feb 25, 2018

lonvia left a comment

hbruch commented Mar 8, 2018

karussell commented Mar 8, 2018 •

edited

hbruch commented Mar 9, 2018

karussell commented Mar 9, 2018

Fix issue 221 (partially) #288

Fix issue 221 (partially) #288

Conversation

hbruch commented Jan 10, 2018 • edited by karussell

karussell commented Jan 10, 2018

hbruch commented Jan 10, 2018

karussell commented Jan 10, 2018

hbruch commented Jan 10, 2018

lonvia commented Jan 11, 2018

karussell commented Jan 22, 2018

hbruch commented Jan 22, 2018

karussell commented Feb 24, 2018 • edited

karussell commented Feb 24, 2018

karussell commented Feb 25, 2018

lonvia commented Feb 25, 2018

hbruch commented Feb 25, 2018

lonvia commented Feb 25, 2018

lonvia left a comment

Choose a reason for hiding this comment

hbruch commented Mar 8, 2018

karussell commented Mar 8, 2018 • edited

hbruch commented Mar 9, 2018

karussell commented Mar 9, 2018

hbruch commented Jan 10, 2018 •

edited by karussell

karussell commented Feb 24, 2018 •

edited

karussell commented Mar 8, 2018 •

edited