Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase may key and value max length #2025

Open
westnordost opened this issue Oct 12, 2018 · 50 comments
Open

Increase may key and value max length #2025

westnordost opened this issue Oct 12, 2018 · 50 comments

Comments

@westnordost
Copy link

westnordost commented Oct 12, 2018

Some OSM tags do not (always) fit into the 255-character-limit imposed by the OSM API. This is a feature request to either increase the limit or remove it altogether.

Discussion

There is a discussion in the tagging mailing list right now about how to deal with a very complicated opening hours which is too long to fit into one value.

Citing @simonpoole :

We have a number of keys for which the values can easily exceed 255 chars besides opening_hours, lane destinations and conditional restrictions are good candidates. Not to mention changeset tags. With other words it is a general problem which should be tackled with a general solution.

So, we passed the point where value-lengths above 255 characters are unrealistic, even for structured data, data that is not prose.

Solutions

In the above discussion, the general opinion seemed to be that the OSM API will not change (until the unicorny API 0.7 appears) and thus we have to workaround this.

There have been some suggestions towards a general-purpose syntax splitting up the actual value in different keys, like opening_hours, opening_hours#2, but I think this is a very bad solution, because it requires each and every application and library that works with OSM data to implement this kind of tag concatenation - for any key, not just opening_hours, because this kind of enumeration scheme is supposed to be a generic solution.
We can already extrapolate how well that would play out when looking at the software support for things like shop=books;stationary. (Hint: not at all ;-) )

In #1593, I read that the limitation to 255 characters is actually a legacy from back when the OSM database was running on MySQL. So, removing that limitation, or increasing it somewhat, seems to be the swiftest solution because a weird workaround is avoided and it doesn't require a change in applications that use OSM data.
The only change that should be done would be to remove/change the client-side checks for tag lengths (in editors, primarily).

@tomhughes
Copy link
Member

I think the answer to how to store complicated opening hours is not to store opening hours in OSM because it's insane to store something as volatile and non-geo related as that ;-)

@bhousel
Copy link
Member

bhousel commented Oct 12, 2018

From my perspective, it would be super to just increase the tagvalue length in the database to something like varchar(2000), bump the API minor version, and move on. Inventing tags to workaround the existing limit is just asking for trouble.

@tomhughes
Copy link
Member

Well you might as well get rid of it altogether because any length limit in postgres is entirely arbitrary - the storage for char(N), varchar(N) and text is all identical - the first two just do a length check on insert is all.

That said such large values is really stretching the basic concept of tags if you ask me, especially for the name.

@bhousel
Copy link
Member

bhousel commented Oct 12, 2018

@tomhughes : there are a few legitimate geo uses of larger tagvalues, like destination sign text or inscriptions on physical monuments, things like that. We should have some kind of limit, but it should be larger than varchar(255).

Aside: I don't think OSM note text is limited by the API or database, and is should be. Uploading a multi-gigabyte OSM note is left as an exercise for the reader.

@woodpeck
Copy link
Contributor

I agree with TomH - there's no technical reason for having a limit, but it serves as a reminder for people that tags are supposed to be human-readable and if you find yourself adding structured data to OSM in a way that makes you hit the 255 character limit, then you are very likely doing things you shouldn't. An opening hours string longer than 255 characters is practically impossible to understand or edit without specialist software (i.e. an editor that has support for a complex opening hours schema). This is undesirable to have, because once our tags are so complicated that they need software support to edit, they also lose any flexibility - once support is baked into N software programs, the tag is certain to never change, even if the circumstances would make it necessary.

So my suggestion is, don't use tags longer than 255 characters, and if you invent a hack around this limitation, at least it's going to be unsupported by most of the applications and therefore condemned to be a niche thing.

If opening hours become more complicated than 255 characters then I would prefer it to just revert to a human-readable string that describes the issue, which is very likely to fit in 255 characters.

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 12, 2018

it would be super to just increase the tagvalue length in the database to something like varchar(2000), bump the API minor version, and move on

That will for sure not work. There are lots of downstream data processors, and you need to look at this change end-to-end - starting with osmosis for the replication, tools like osmium that have string lengths restrictions and special size optimizations in place, many different editors assuming certain size limits, ...

@bhousel
Copy link
Member

bhousel commented Oct 12, 2018

That will for sure not work. There are lots of downstream data processors, and you need to look at this change end-to-end - starting with osmosis for the replication, tools like osmium that have string lengths restrictions and special size optimizations in place, many different editors assuming certain size limits, ...

Oh to be sure, I don't think this will actually ever happen. But if it did, iD would be fine with it.

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 12, 2018

Yeah, right, my point was more to raise a bit of awareness to think about such a change in a much wider context. People tend to be too much focused on the API itself, but there's so much other stuff going on in other parts of the ecosystem. Things like a size limit can be buried deep into some library, and scattered across lots of applications that all need updating then.

See this tiny bit in osmium: https://github.com/osmcode/libosmium/blob/master/include/osmium/osm/types.hpp#L69 - it affects object handling + pbf parsing...

@westnordost
Copy link
Author

westnordost commented Oct 12, 2018

That said such large values is really stretching the basic concept of tags if you ask me, especially for the name.

@tomhughes Hmm, is this a real argument? What is the basic concept of tags, and why does it make sense for OSM to adhere to that definition of this basic concept of tags?

It is a matter of fact that there are approved and widely used definitions for tags that contain structured data, which can also become quite long, as we see. If you think that these things have no place in OSM and looking how well established these tags like opening hours, are, does that mean that you are unhappy with the general direction OSM has taken?

@westnordost
Copy link
Author

@simonpoole
Copy link
Contributor

simonpoole commented Oct 12, 2018

Just changing the API minor version (without anything else) would require a major effort to get everything working again.

Not against doing it, but it will literally break nearly everything.

PS: OSM API numbering happend before semver as we know it now, so no data consumer can assume any specific guarantees wrt backwards compatibility.

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 12, 2018

After all, the API supports 255 Unicode characters. Instead of writing 13:00 you can simply write 🕐 which counts as a single character. This saves a whopping 4 characters already, and it's human readable (sort of).

(This comment may contain a bit of irony and shouldn't be taken too seriously.)

@westnordost
Copy link
Author

westnordost commented Oct 12, 2018

Okay, so I see there are two concerns in this thread about changing this:

  1. will break applications that work with OSM data to different amounts
  2. the general notion that tags should be short and/or not contain text or long structured data

Regarding point 1, if it really is a major problem the clear answer to that is API versioning. API 0.6 will then simply not return any keys or values longer than 255 characters, API 0.7 will. Not sure if it is currently arranged to be able to deploy several API versions in parallel, but the URI scheme certainly looks like it was alloted for that use case.

Regarding point 2, addressed here. Either I do not understand the argument, or it is a matter of opinion.

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 12, 2018

To answer your question: cgimap is already prepared to run multiple API versions in parallel (assuming this is somehow meaningful from a data model pov). Not sure about the Rails port, though.

I'm quite skeptical, if hiding some key-value pairs depending on the length is a good idea. Let's say, a 0.6 user wants to add some new tag with some value, and uploads that to the server. However, the same tag already exists on the server with a much longer value, and the server would have to refuse such an upload. The error message will be quite confusing though, as the existing entry is invisible to the 0.6 user. Even worse would be overwriting an existing entry, that a user has never seen.

What's your take on minutely diffs and planet files, then? --> https://planet.openstreetmap.org/

@westnordost
Copy link
Author

Well, v0.6 would need to show the too-long entries but clearly marked as abbreviated and perhaps read-only (v0.6 api rejects changes to this tag)

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 12, 2018

Hmm... API 0.6 has no way to convey an out-of-band information about tag values having more than 255 characters. So whatever schema you use to abbreviate long strings, you still don't know if the server holds more than 255 characters or some user is only playing some funny tricks by adding their own abbreviation characters (or you need to handle that special case...). There's probably some need for a convention for abbreviation characters, and editor applications need be aware of that as well to have those values read only.

@westnordost
Copy link
Author

That is correct. The goal and intention of this construct would only be to provide backwards compatibility until all the data consumers made that change to v0.7.

@matkoniecz
Copy link
Contributor

inscriptions on physical monuments

I also run into this problem.

I'm quite skeptical, if hiding some key-value pairs depending on the length is a good idea. Let's say, a 0.6 user wants to add some new tag with some value, and uploads that to the server.

What about keeping read access of v0.6 but disabling edit support? That may allow to avoid "it will literally break nearly everything" problem.

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 14, 2018

As a very first step, the API should include the current max{key,value} lengths in https://api.openstreetmap.org/api/0.6/capabilities. Then editors could start respecting those settings rather than using some hardcoded value. It would enable us to to switch at least all editors to a larger field length all at once, simply by rolling out a new version of the osm website (+cgimap).

This sounds much easier than it really is: we've got lots of editing applications, and sometimes people use ancient versions. Some older apps might be unsupported, and will break. Preventing them from uploading corrupted data to OSM seems like a logical requirement.

My optimistic assumption is that this step alone takes at least 1 year.

Meanwhile, we need to figure out, what to do with all those data consumers. Expect plenty of time for research. I have no idea, what kind of fixes will be needed for tools like osmconvert, osmium, and many other binary file format based apps, including many mobile apps. This will be the most difficult part of the whole endeavour. Changing a few limits in editing apps will be a piece of cake in comparison.

Deliberately I left the decision open as to whether switching from 0.6 -> 0.7 would be necessary, as it causes all sorts of pains on its own.

@HolgerJeromin
Copy link
Contributor

Just for the record:

In #1593, I read that the limitation to 255 characters is actually a legacy from back when the OSM database was running on MySQL

This seems to be wrong. API v0.5 had in fact some longer content in the db:
https://wiki.openstreetmap.org/wiki/API_v0.6_(Archive)

@simonpoole
Copy link
Contributor

@HolgerJeromin the way I understood it, that for 0.6 it was reverted back to use 255 all over the place "to be more consistent" that doesn't change why most of the strings where originally 255 chars long to start with.

@tomhughes
Copy link
Member

So mysql really did have limits - it doesn't have default unlimited text fields like postgres. So there had to be some limit and I think as much as anything that 255 was possibly just the rails default.

@mmd-osm
Copy link
Contributor

mmd-osm commented Oct 15, 2018

SVN revision 2489 has a create_database.sql script for MySQL with explicit k + v varchar(255):

https://svn.openstreetmap.org/!svn/bc/2489/sites/rails_port/db/create_database.sql

DROP TABLE IF EXISTS `current_way_tags`;
CREATE TABLE `current_way_tags` (
  `id` bigint(64) default NULL,
  `k` varchar(255) default NULL,
  `v` varchar(255) default NULL,
  KEY `current_way_tags_id_idx` (`id`),
  FULLTEXT KEY `current_way_tags_v_idx` (`v`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

I don't see the actual date for rev 2489 on the web interface, but it should be somewhat early in the project. It's from 2007. Earlier MySQL schema files indicate that it's probably even back to 2005.

@westnordost
Copy link
Author

@tomhughes Would you accept a PR that extends the capabilities API call as @mmd-osm suggested?

@tomhughes
Copy link
Member

Don't see why not.

@pyrog
Copy link

pyrog commented Dec 29, 2019

there are a few legitimate geo uses of larger tagvalues, like inscriptions on physical monuments

Some contributions use multiples inscription tags, i.e. node 700852731

  • historic=memorial
  • inscription=At the top of these steps is Mount Skipet, the key site in the birth of the town's carpet industry. Here from c. 1749 John Pearsall and John Broom based their partnership. Pearsall is regarded as the founder of the industry in 1735 when he wove the first
  • inscription:2="Kidderminster Carpet". Named after the town, the carpet was a flat reversible weave with a pattern on both sides, last woven in the town in 1932. The story goes that Broom brought the secrets of "Brussels" weaving to the town, and the first pile carpet
  • inscription:3=was woven by Pearsall and Broom on Moun Skipet.

@pyrog
Copy link

pyrog commented Dec 29, 2019

@woodpeck

it serves as a reminder for people that tags are supposed to be human-readable […]
An opening hours string longer than 255 characters is practically impossible to understand or edit without specialist software

opening hours string are difficult to understand for humans 😉
But softwares like OsmAnd display it in a human friendly form (or at least just say open or close).
For inscriptions, the tool is a simple editor with a copy and paste to OSM editor.

If opening hours become more complicated than 255 characters then I would prefer it to just revert to a human-readable string that describes the issue, which is very likely to fit in 255 characters.

We found a case where a public swimming pool have 5 different timetables per week, and 7 for school holidays.
The total length was near 255 characters, but bigger.

SH Mo,Fr 12:00-20:00;
SH Tu 12:00-21:00;
SH We 10:00-18:00;
SH Th 12:00-20:15;
SH Sa,Su 10:00-13:00,15:00-18:00;

Mo 12:00-14:00,17:00-20:00;
Tu 11:45-14:00,16:30-21:00;
We 11:30-13:00,15:00-18:00;
Th 11:45-14:00,16:30-20:00;
Fr 11:45-14:00,17:00-20:00

@mmd-osm
Copy link
Contributor

mmd-osm commented Jul 25, 2020

Fast forward 1.5 years, there’s no progress even on the very first step (extending the capabilities endpoint).

My suggestion would be to close this issue. Beyond the capabilities issue, there’s large number of downstream consumers that assume 255 chars and abort processing, etc. You’d align with every one of them, which is not realistic imho.

@pyrog
Copy link

pyrog commented Jul 25, 2020

We have the same issue with multi-valued keys like website, wikimedia_commons… if URL are (two) long.
Or with subject:wikidata i.e. for a memorial with a lot a persons.

there's no technical reason for having a limit, but it serves as a reminder for people that tags are supposed to be human-readable

Editors could warn users but let them the choice to go beyond 255 😃

@mmd-osm
Copy link
Contributor

mmd-osm commented Jul 25, 2020

As I mentioned, you need to change every downstream consumer, and the API. Letting one editor go beyond the limit will not work.

@pyrog
Copy link

pyrog commented Jul 25, 2020

Ok. But what is the risk if some downstream consumers are not updated ?

  • Analysers could crash : yes, but they could be "quickly" patched.
  • Editors may limit values length to 255 chars and lost the end : we could retrieve previous value with the history. And it's not worse than now.

First, we could list all editors currently used. (see comment in Overpass-API/issues/189).

Then, we could check them with a test API (a static URL that send one object with a long value)
And/or send emails to their developpers to check their tools themselves.

Finally, only after main "consumers" are tested/updated, push this in production.

@mmd-osm
Copy link
Contributor

mmd-osm commented Jul 25, 2020

Just to name a few examples:
Nominatim: will no longer update.
Tile server: will no longer update.
Replication: will no longer work in the future.
Planet generation: will create incompatible pbf or fail.
Editors truncating data to 255 chars causing data corruption.
...and the list goes on and on.

@pyrog
Copy link

pyrog commented Jul 25, 2020

They will or they could no longer update ?
How could be sure if we don't test them ?

I know the task is difficult but I don't suggest to do this in one week-end 😉

@mmd-osm
Copy link
Contributor

mmd-osm commented Jul 25, 2020

Well the process will abort with an error message, so it’s basically k.o.

You can see those checks in the source code all over.

@pyrog
Copy link

pyrog commented Jul 25, 2020

Right, but if we remove theses tests, all the tools will really crash ??

In Nominatim, I can't find 255 or 256 in the source code.
In nominatim.c the "limit" is the size of the int C type (±32768).
In libosmium the limit is hardcoded to 256 characters.

@mmd-osm
Copy link
Contributor

mmd-osm commented Jul 25, 2020

Yes, this limit in libosmium affects osm2pgsql and subsequent consumers like Nominatim. As that’s a header only library, you would need to recompile every tool that uses it using a new version of the library.

That’s just one example... you would need to evaluate a larger part of the osm ecosystem to assess the overall impact.

Coordinating the rollout of such a change isn’t trivial either and would probably take quite a significant amount of time.

@pyrog
Copy link

pyrog commented Jul 25, 2020

  1. It's not a big deal for their developers.

  2. Currently, we don't have any tool to know the OSM "clients" precisely.

  3. Yes, it's not trivial, but we have time. Don't we ?

@pyrog
Copy link

pyrog commented Jul 25, 2020

Planet generation: will create incompatible pbf or fail.

The length of a serialized object is coded as a variable length integer:
"Protocol Buffers use a variable-bit encoding for integers. An integer is encoded at 7 bits per byte, where the high bit indicates whether or not the next byte is to be read.…"
Source

So the size of a key or a value is "unlimited" in practice.

@xorgy
Copy link

xorgy commented Sep 20, 2022

Is this limitation documented for real anywhere? When you say "char" what exactly do you mean? Do you mean 255 UTF-8 bytes? libosmium says the strings can be up to 256 four byte codepoints, which is 1024 bytes of whatever you want. MySQL schema interpretation of varchar is different between versions, so can either mean 255 codepoints or 255 bytes of UTF-8.

It seems that people are further confused about this in the ecosystem, because MarcusWolschon/osmeditor4android#1401 references this issue, but the "fix" they have only truncates values based on UTF-16 values, which will allow values larger than 255 UTF-8 bytes, but will fail on values 255 codepoints or fewer, when those codepoints are represented with a surrogate pair.

@tomhughes
Copy link
Member

I can't actually find any documentation of exactly what the rails validator polices but my guess is that it's the ruby string length which is probably codepoints rather than bytes or grapheme clusters.

So 255 UTF-16 values while wrong isn't totally terrible as characters outside the BMP will be relatively rare, and it will reject strings that the API would allow rather than allowing strings that the API will reject.

@xorgy
Copy link

xorgy commented Sep 20, 2022

@tomhughes Where would I look for the actual code that ingests these values and decides whether to accept them or not?

@mmd-osm
Copy link
Contributor

mmd-osm commented Sep 20, 2022

Relevant code is here: https://github.com/zerebubuth/openstreetmap-cgimap/blob/master/include/cgimap/util.hpp#L21-L35
If that function returns a value > 255, the diff upload will reject it.

@tomhughes
Copy link
Member

Well there are two such pieces of code, one in rails which uses uses the rails validates_length which I discussed and the cgimap version which @mmd-osm referenced.

@xorgy
Copy link

xorgy commented Sep 20, 2022

Relevant code is here: https://github.com/zerebubuth/openstreetmap-cgimap/blob/master/include/cgimap/util.hpp#L21-L35 If that function returns a value > 255, the diff upload will reject it.

So it is the UTF-16 length after all... unless wchar_t isn't UTF-16LE, which it has every right not to be thanks to the C++ standard. ;+ )

@tomhughes
Copy link
Member

tomhughes commented Sep 20, 2022

It won't be in rails and I really hope it isn't in cgimap but frankly I have no idea what the C multibyte routines will do on any given platform and frankly I'm rather concerned to discover we're relying on them!

Really this is not the appropriate place to be discussing it though - between us we've just polluted his ticket with a dozen or more off topic comments :-(

@xorgy
Copy link

xorgy commented Sep 20, 2022

Really this is not the appropriate place to be discussing it though - between us we've just polluted his ticket with a dozen or more off topic comments :-(

Yeah, I felt that way, but I couldn't find any other discussion of the limit aside from tickets on specific pieces of software, nor any documentation of it except the vague "255 characters" you see throughout the wiki. Is there a better venue?

@tomhughes
Copy link
Member

Opening a new ticket or asking on the dev or rails-dev lists would have been good choices.

@mmd-osm
Copy link
Contributor

mmd-osm commented Sep 20, 2022

Maybe you could provide a testcase on https://cpp.godbolt.org/ to demonstrate the issue?

@tomhughes
Copy link
Member

Are you talking to me about my concerns about the multibyte stuff? My worry is that the routines as I understand things leave a lot open to interpretation and may be sensitive to the environment so what happens on godbolt may not match what happens in production.

I mean probably it works fine given the setlocale I've just always steered clear of them myself.

If it works as intended then it's counting codepoints though (which is good) not UTF16 values.

@mmd-osm
Copy link
Contributor

mmd-osm commented Sep 20, 2022

That's why we have a testcase for that. See the other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants