-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release 0.3 preparation #314
Comments
Have created a branch where I'll use the HEAD for the tests https://github.com/komoot/photon/tree/test03 |
This is now testable at http://46.4.67.134:2322/api?q=dresden |
Somehow many tests fail now in geocoder_tester/world/poi/test_airports.csv (which issue could cause this?)
|
@karussell Do we have a version with master somewhere for comparison? photon.komoot.de seems to have the test03 branch already as well. |
Yeah, photon uses the current data (where due to indexing most of this is fixed) although using an older jar file (beta3). I've therefor triggered an export with latest master again and will later on trigger another export with only #288 merged. (issue #306 can be added later for both and #311 should be handled separately as otherwise we do not know what issue is causing what problems) |
ok, I have now:
For e.g. q=80&lat=49.166561&lon=6.040720 you see that the installations are really different :) (Might be very slow as I have not really enough RAM for both on the same server) |
Uhm, no. There is something wrong with the indices of the branch |
Have now only merged #288: http://46.4.67.134:2288/api?q=Malplaquetstra%C3%9Fe%205A,%2013347%20Berlin,%20Germany The master seems to return also the correct value, but probably I did something wrong there as when you test autocompletion only on GraphHopper Maps then this cannot be found. So I would merge this feature. But the real issue is that we have to do world wide testing for any pull request that changes indexing to be sure that we have no regression. If we do this without automation then the reviewer of the PR could have a hard time reproducing because the submitter used a different data set or whatever or we would risk merging broken code. But as a world wide export takes so long we cannot easily do an export in parallel to our regular "master-export mechanism", at least not if we have more than 1 PR per month. Also requiring a world wide export from the PR submitter would make things worse. The ideal scenario would be:
|
Perhaps we can improve the situation by distinguishing between index changes and export changes. If only indexing is affected, #293 (with identical id handling added) will improve import times by a factor of 5-10. Providing a world wide nominatim json dump also would reduce the burden on submitters at least a bit as they would not need to set up a planet nominatim instance to be able to verify their changes themselves before requesting a pull (thinking of me :-) |
To be strictly comparable I would work with static test data in any case. Json dump is one option. The other option would be to create a database dump of a Nominatim database that only contains what is strictly necessary to create a Photon dump. I'd then set that up on a different machine and just to exports round the clock of all open PRs as well as latest master. It still takes two days before they are available but should be manageable with the amount of PRs we have at the moment. We could also provide the dump publicly for developers, so that their Nominatim setup would just consist of running pg_restore. |
Oha, now I understand why the state of master seems to be like the new branch and why there were two indices: the two photon instances on the same server joined the cluster :D ! |
New attempt:
Now we see the difference for e.g. q=Malplaquetstra%C3%9Fe%205A,%2013347%20Berlin,%20Germany I ran geocoder-tester for both and the difference is listed here: https://gist.github.com/karussell/c847cf4915151fac4c473a932f285e1c |
Fixed the utf problem of the test runner and used the latest geocoder-tester that avoid the duplicate entries: basically all failures & successes for geocoder_tester/world/france/lorraine/test_addresses.csv disappeared and the failure of "eiffel to" and the success of "piazza di villa carpegna 00165 roma" See the nice analysis of @hbruch #288 (comment) |
The easiest was to provide a master branch with 306 merged: but 288 with 306 would be also easy as 306 only affects. query time The result are 8 new tests pass: |
Ok, tested 288 with 306 and there are 17 new tests passing but 2 new failures: |
The first test is
which now just returns "Berlin" but should return "Werder (Havel)" The second test is
which now returns 1. "Doberlug-Kirchhain Rosa ..." and 2. "Berlin Rosa ..." but should be 1. "Berlin" and 2. "Doberlug-Kirchhain" |
The Doberlug-Kirchhain address has a name attribute with value "22". This match will boosts it's ranking more than only the context match of the Berlin match. To me, this looks more like a data issue than a photon issue. |
Ah, ok. Makes sense. Have edited also the comments regarding first test which I initially misunderstood. Now this should pass IMO but it is not even returned for limit=10 only if I use |
It is also not wrong to return berlin and already |
Using a limit of 64, the seventh (and last) result is the intended one. I suppose many matches are discarded due to street deduplication: http://138.201.17.75:2288/api?lat=52.3879&lon=13.0582&limit=64&q=Friedrichstra%C3%9Fe But it seems that the formular does not spread sufficiently and I'd guess that elasticsearch's BM25 similarity ranking with field-length norm ranks Berlin Friedrichstraße much higher than Werder (Havel). |
The Friedrichstrasse ranking comes from the importance. The street in Berlin is important enough to have its own Wikipedia entry. I wouldn't call it an error if it appears first in the list when anywhere near Berlin. I've played around with the 306 test instance and it looks good. Pending my nitpicking comments over in the PR merging sounds fine. |
#311 is now ready for testing: http://138.201.17.75:2311/api?q=%20rue%20du%20mont%20d%27or%20Saint-Didier-au-Mont-d%27Or And when comparing to the old state where I merged 306 into 288 (which should be ~master) this results into 19 new failures and 81 new passings Hmmh, maybe I made an error somewhere as the house numbers should not be make a difference (?) Will trigger a clean master export to make this sure. |
Hm. Comparing http://138.201.17.75:2311 and http://photon.komoot.de for all failures, I don't see any differences(?) The 8 failures for cities in Poland have their root cause in #95: photon returns admin level 6, 7, 8 boundaries, and currently admin level 8 (city/town) is apparently returned second for the failing tests. Quick fix would increase limit in geocoder_tester, but I think solving #95 would be the better fix. For 10 failures, there seems to be no housenumber information in OSM, resulting again (as already for #288) in ways farter away than 100m. Only for 5 RUE LEO LAGRANGE 59265 AUBIGNY AU BAC the correct housenumber is returned second. These partial postal code matches are annoying and merrit their own issue. To me, all failures are unrelated to 311. |
The problem is that the data is updated there automatically. I need to avoid to link these test exports to "photon-db-latest". As a new export will be triggered we can wait and see the differences in a few days.
Thanks for your analysis! |
The new master including a fresh export is now running at http://46.4.67.134:2322/api?q=dresden and it does not work for cases mentioned in #311 (probably official photon demo will automatically update soonish too to it) Will test tomorrow how #311 compares to it. Update: result of comparison is here |
So this looks even better than the comparison before, except these failures were added:
I don't understand the Dresden problem although the Leipziger Bahnhof is really close but it doesn't have any "Neustadt" name in its properties. Maybe we need to weight the station according to the number of platforms? E.g. Dresden-Neustadt has 4 and also historic should get lower importance as it means "inactive". It looks like this node for Dresden-Neustadt does not appear under the first 10 results. |
Oh, but in general this POI search is bad (for the master instance too!). E.g. if you search for "Bahnhof Dresden" the main station is not returned under the first 10 results because its official name is "Dresden Hbf" and just the ref name is "Dresden Hauptbahnhof" (so still different to "Dresden Bahnhof")! So probably we merge #311 and improve this via #318 or related. |
Oh, yes. That is the right question. Maybe the random ordering wie get when feeding like @hbruch already speculated in a different issue? I think we should abandon all iterations over HashSets and use LinkedHashSets instead (potentially possible with maps too)
Will merge and then produce a new export again |
New data for the latest branch is here: http://46.4.67.134:2322/api?q=berlin |
And objections to release this state as 0.3? |
Fine with me. |
Fine with me as well. |
Released now at https://github.com/komoot/photon/releases/tag/0.3.0 |
Will try to build a global and nl+de export for the 0.3-beta3 release including the following PRs:
If this is positive we'll include the specific PR. If not we'll release 0.3.0 without it.
The text was updated successfully, but these errors were encountered: