Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Places: Normalize Location Names #1664

Closed
lastzero opened this issue Oct 25, 2021 · 33 comments
Closed

Places: Normalize Location Names #1664

lastzero opened this issue Oct 25, 2021 · 33 comments
Assignees
Labels
enhancement Optimization, improvement or maintenance task released Available in the stable release ux Impacts User Experience

Comments

@lastzero
Copy link
Member

Location metadata sometimes contains abbreviations for US states instead of their full name:

https://assets.ltkcontent.com/files/US-State-Abbreviations.pdf

This leads to different albums for the same state in https://demo.photoprism.org/states.

The obvious solution is to expand abbreviations and always use the full name.

@lastzero lastzero added the enhancement Optimization, improvement or maintenance task label Oct 25, 2021
@lastzero lastzero self-assigned this Oct 25, 2021
@lastzero lastzero added the ux Impacts User Experience label Oct 27, 2021
@lastzero lastzero added the please-test Ready for acceptance test label Nov 9, 2021
@kvalev
Copy link
Contributor

kvalev commented Nov 11, 2021

Quick question - do I need to run a full rescan for the new names to be applied? And are existing "duplicate" state albums going to be merged?

@lastzero lastzero changed the title Places: Expand abbreviations of US states to full names Places: Expand abbreviations of states to their full names Nov 11, 2021
@lastzero
Copy link
Member Author

Please wait with updating your production instance. It's not done yet! Found a few issues while testing...

@lastzero
Copy link
Member Author

In particular, the problem also exists for other countries. Not just the US. Doing our best to also add a command for updating location data only, see GitHub Discussions. Was already answered there today.

@kvalev
Copy link
Contributor

kvalev commented Nov 11, 2021

Please wait with updating your production instance. It's not done yet! Found a few issues while testing...

Okay will postpone it for now. I guess you can remove the please-test label just in case anyone else decides to give it a go as well.

@kvalev
Copy link
Contributor

kvalev commented Nov 11, 2021

In particular, the problem also exists for other countries. Not just the US. Doing our best to also add a command for updating location data only, see GitHub Discussions. Was already answered there today.

Yes, I encountered the same problem with other countries as well, for example ones that do not use latin script - in my case arabic or cyrillic. Sometimes the state is in latin sometimes in arabic/cyrillic.

@lastzero
Copy link
Member Author

Well, we are testing right now.... please-test doesn't mean there are no issues anymore! 😉

@lastzero
Copy link
Member Author

@kvalev You are welcome to send PRs with additional mappings 👇

https://github.com/photoprism/photoprism/blob/develop/pkg/txt/states.go

@lastzero
Copy link
Member Author

Started a new Development Preview build for testing (wait until it is green and has been uploaded to Docker Hub):

https://drone.photoprism.app/photoprism/photoprism/2300

@lastzero lastzero added the please-test Ready for acceptance test label Nov 11, 2021
@lastzero lastzero changed the title Places: Expand abbreviations of states to their full names Places: Normalize state names worldwide Nov 11, 2021
@lastzero lastzero changed the title Places: Normalize state names worldwide Places: Normalize state names Nov 12, 2021
@lastzero lastzero changed the title Places: Normalize state names Places: Normalize State Names Nov 12, 2021
@lastzero
Copy link
Member Author

Next Development Preview comes with a photoprism places update command! 💐

https://drone.photoprism.app/photoprism/photoprism/2306/1/5

It's brand new and pretty much untested outside my development environment. Please report issues if you find any. Since this will send many requests per second to our backend API, we need to have an eye on performance / server load. Therefore, at this time, it is only for sponsors who help us finance the infrastructure.

@lastzero
Copy link
Member Author

lastzero commented Nov 20, 2021

No we don't... and our todo list doesn't allow spending weeks with this either :D

Let's see if the current solution works for everyone. It's actually "better" than what other geodata services provide in that we try to fix the data. Just like we normalize a ton of other details you might not notice.

Take this example:

https://places.photoprism.app/v1/location/357ca2f2d44c

The international names "deoksugung-gil" and "seoul" are returned as keywords for searching whereas they were originally put in the city and street fields (in brackets after the local name).

@kvalev
Copy link
Contributor

kvalev commented Nov 20, 2021

No we don't... and our todo list doesn't allow spending weeks with this either :D

Pity. But if you agree with this feature in principle and are willing to host the data and open source the project, I can contribute it. This will also significantly improve the data quality.

Let's see if the current solution works for everyone. It's actually "better" than what other geodata services provide in that we try to fix the data. Just like we normalize a ton of other details you might not notice.

I ran few more tests on my data and in some cases it works quite nice, for example Mexico City.
In other cases it's probably better if there isn't a fallback :D For example the Khumjung village in Nepal is now in the state of Khumjung, instead of Eastern Development Region. The solution I mentioned above correctly computes the state. Even more, the data provides several translations, which opens the door for some localization options.

For completeness sake, here is the Overpass query:

is_in(27.991812625201433, 86.84189688596203);
area._[admin_level="4"];
out meta;

@lastzero
Copy link
Member Author

No we don't... and our todo list doesn't allow spending weeks with this either :D

Pity. But if you agree with this feature in principle and are willing to host the data and open source the project, I can contribute it. This will also significantly improve the data quality.

What server hardware does it need? When our funding improves, we can consider it ;)

Maybe we can also use a public API for free if we ask politely. You're welcome to reach out in our name.

For completeness sake, here is the Overpass query:

is_in(27.991812625201433, 86.84189688596203);
area._[admin_level="4"];
out meta;

From what I understood, a simple HTTP request with this as query parameter will do, in case a field is missing?

@kvalev
Copy link
Contributor

kvalev commented Nov 20, 2021

What server hardware does it need? When our funding improves, we can consider it ;)

The hardware requirements are surprisingly low. From http://overpass-api.de/no_frills.html:

Concerning hardware, I suggest at least 4 GB of RAM. The more RAM is available, the better, because caching of disk content in the RAM will significantly speed up Overpass API. The processor speed will have little relevance. For the hard disk, it depends on what you want to install. A full planet database with minutely updates should have at least 250 GB of hard disk space at disposal. Without minute diffs and meta data, 100 GB would already suffice.

Maybe we can also use a public API for free if we ask politely. You're welcome to reach out in our name.

The usage policy of some of the public instances is fairly liberal:
https://wiki.openstreetmap.org/wiki/Overpass_API#Public_Overpass_API_instances

I can ask the Kumi Systems project on your behalf if you want to go this route.

From what I understood, a simple HTTP request with this as query parameter will do, in case a field is missing?

Exactly, the cURL request would be:

curl --location --request POST 'https://overpass-api.de/api/interpreter' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'data=is_in(48.2082517,16.3742072);area._[admin_level="4"];out meta;'

@lastzero
Copy link
Member Author

Picture and data quality is very important to us. If this is the (best) way to improve it, then let's do it.

Next, we need to finish multi-user support, which should keep us busy until holiday season starts. So other enhancements may need to wait until early next year.

@lastzero
Copy link
Member Author

Started another Development Preview build for testing: https://drone.photoprism.app/photoprism/photoprism/2337/1/5

@kvalev
Copy link
Contributor

kvalev commented Nov 27, 2021

I see that you have modified the behavior of the places API wrt to the state. Do you mind sharing what's changed?

@lastzero
Copy link
Member Author

Are you happy with the results? It's a custom S2 cell based backend which adds and merges missing information from Photon and Nominatim. We've set up a new instance using the latest OSM data in the past few days. That's also why we didn't release earlier.

@kvalev
Copy link
Contributor

kvalev commented Nov 28, 2021

Hi @lastzero, I think it works quite good now, especially compared to the previous version (I dont remember the one before that tbh). All of the errors that I have encountered in my test so far can be grouped in one of the following two categories:

  • In border regions (most often exactly on the border) sometimes the data is wrong (correct state, but wrong country or the other way around). For example here is one cell where Bayern is in Austria (which might even be offensive to some :D)
  • In Eastern Europe, Asia and Africa the state data is a bit unreliable - sometimes a "state" is just a single beach or maybe a neighborhood (not even the whole city).

@lastzero
Copy link
Member Author

@kvalev see:

If we have different data for a border region, it is probably because we are using S2 Cells, which are cells and not exact coordinates. So if 1m decides the state or city, you may get a different response.

On the other hand, you will be revealing less specific information that will protect your privacy to a greater degree than other APIs.

If you really want to give us the exact location down to the millimeter, I suppose we could make that possible. But that only makes sense if the GPS in your device has that kind of resolution.

@kvalev
Copy link
Contributor

kvalev commented Nov 28, 2021

In my example you already have the exact location down to a centimeter, so that should help with debugging. Anyway I understand GPS accuracy quite well, so in few edge cases like borders most users would accept if the location is a bit off, but in my example the data is straight up wrong. The state was correctly determined, but it was placed in a different country?!?! Especially when you consider that the Nominatim API returns the correct data.

@lastzero
Copy link
Member Author

lastzero commented Nov 28, 2021

We don't use the exact location, our cells are about 5m:

👉  https://s2geometry.io/resources/s2cell_statistics

@lastzero
Copy link
Member Author

Going to take a look at the Bayern example as soon as possible. Grilling a steak now. It's Sunday evening :)

@lastzero
Copy link
Member Author

Probably Photon is off and Nominatim is right. The merged result then doesn't make sense.

@lastzero
Copy link
Member Author

Send more examples so we can use them for our automated tests. Can't fix issues we don't know about.

@kvalev
Copy link
Contributor

kvalev commented Nov 28, 2021

I looked through most of the locations and here are the most obvious ones that I could spot:

Places on the German/Austrian border where the state is Bayern and the country is Austria:

s2:479cf539b754
s2:479cf539c424
s2:479cf539c6b4
s2:479cf539da1c

s2:479c8df0b68c
s2:479c9209e37c
s2:479c920b7314
s2:479c920b731c

s2:479c8acaae24
s2:479c8acaae74
s2:479c8acab2cc
s2:479cf5350b64
s2:479cf5350b7c

Places on the Bulgarian/Greek border, where the name is in Bulgarian, the street is in Greek, label and country are Bulgaria, but the state and keywords are in Greek:

s2:14a988c8c95c
s2:14a988ce6e44
s2:14a988ce6fa4
s2:14a988d41bdc
s2:14a988d50efc

And as a good example, here is a place on the Austrian/Swiss border, where everything is correct:

s2:479b4b5ebf44

Hope this helps.

@lastzero
Copy link
Member Author

Excellent, that helps a lot! If it's that obvious, it should be easy to fix.

The reason why we don't just forward all requests to Nominatim is performance: usually it takes more than 100 ms to process a request. This can significantly reduce indexing performance if you otherwise have a fast CPU and internet connection. It's all about tradeoffs here.

When you can waste unlimited resources, it's easier to implement a "great" solution.

@lastzero
Copy link
Member Author

lastzero commented Nov 28, 2021

Follow up issue:

@lastzero lastzero added released Available in the stable release and removed please-test Ready for acceptance test labels Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Optimization, improvement or maintenance task released Available in the stable release ux Impacts User Experience
Projects
Status: Release 🌈
Development

No branches or pull requests

3 participants