Increase may key and value max length #2025

westnordost · 2018-10-12T15:59:04Z

Some OSM tags do not (always) fit into the 255-character-limit imposed by the OSM API. This is a feature request to either increase the limit or remove it altogether.

Discussion

There is a discussion in the tagging mailing list right now about how to deal with a very complicated opening hours which is too long to fit into one value.

Citing @simonpoole :

We have a number of keys for which the values can easily exceed 255 chars besides opening_hours, lane destinations and conditional restrictions are good candidates. Not to mention changeset tags. With other words it is a general problem which should be tackled with a general solution.

So, we passed the point where value-lengths above 255 characters are unrealistic, even for structured data, data that is not prose.

Solutions

In the above discussion, the general opinion seemed to be that the OSM API will not change (until the unicorny API 0.7 appears) and thus we have to workaround this.

There have been some suggestions towards a general-purpose syntax splitting up the actual value in different keys, like opening_hours, opening_hours#2, but I think this is a very bad solution, because it requires each and every application and library that works with OSM data to implement this kind of tag concatenation - for any key, not just opening_hours, because this kind of enumeration scheme is supposed to be a generic solution.
We can already extrapolate how well that would play out when looking at the software support for things like shop=books;stationary. (Hint: not at all ;-) )

In #1593, I read that the limitation to 255 characters is actually a legacy from back when the OSM database was running on MySQL. So, removing that limitation, or increasing it somewhat, seems to be the swiftest solution because a weird workaround is avoided and it doesn't require a change in applications that use OSM data.
The only change that should be done would be to remove/change the client-side checks for tag lengths (in editors, primarily).

The text was updated successfully, but these errors were encountered:

tomhughes · 2018-10-12T16:06:22Z

I think the answer to how to store complicated opening hours is not to store opening hours in OSM because it's insane to store something as volatile and non-geo related as that ;-)

bhousel · 2018-10-12T16:10:57Z

From my perspective, it would be super to just increase the tagvalue length in the database to something like varchar(2000), bump the API minor version, and move on. Inventing tags to workaround the existing limit is just asking for trouble.

tomhughes · 2018-10-12T16:16:35Z

Well you might as well get rid of it altogether because any length limit in postgres is entirely arbitrary - the storage for char(N), varchar(N) and text is all identical - the first two just do a length check on insert is all.

That said such large values is really stretching the basic concept of tags if you ask me, especially for the name.

bhousel · 2018-10-12T16:19:37Z

@tomhughes : there are a few legitimate geo uses of larger tagvalues, like destination sign text or inscriptions on physical monuments, things like that. We should have some kind of limit, but it should be larger than varchar(255).

Aside: I don't think OSM note text is limited by the API or database, and is should be. Uploading a multi-gigabyte OSM note is left as an exercise for the reader.

woodpeck · 2018-10-12T16:23:04Z

I agree with TomH - there's no technical reason for having a limit, but it serves as a reminder for people that tags are supposed to be human-readable and if you find yourself adding structured data to OSM in a way that makes you hit the 255 character limit, then you are very likely doing things you shouldn't. An opening hours string longer than 255 characters is practically impossible to understand or edit without specialist software (i.e. an editor that has support for a complex opening hours schema). This is undesirable to have, because once our tags are so complicated that they need software support to edit, they also lose any flexibility - once support is baked into N software programs, the tag is certain to never change, even if the circumstances would make it necessary.

So my suggestion is, don't use tags longer than 255 characters, and if you invent a hack around this limitation, at least it's going to be unsupported by most of the applications and therefore condemned to be a niche thing.

If opening hours become more complicated than 255 characters then I would prefer it to just revert to a human-readable string that describes the issue, which is very likely to fit in 255 characters.

mmd-osm · 2018-10-12T16:51:21Z

it would be super to just increase the tagvalue length in the database to something like varchar(2000), bump the API minor version, and move on

That will for sure not work. There are lots of downstream data processors, and you need to look at this change end-to-end - starting with osmosis for the replication, tools like osmium that have string lengths restrictions and special size optimizations in place, many different editors assuming certain size limits, ...

bhousel · 2018-10-12T16:55:07Z

That will for sure not work. There are lots of downstream data processors, and you need to look at this change end-to-end - starting with osmosis for the replication, tools like osmium that have string lengths restrictions and special size optimizations in place, many different editors assuming certain size limits, ...

Oh to be sure, I don't think this will actually ever happen. But if it did, iD would be fine with it.

mmd-osm · 2018-10-12T17:02:08Z

Yeah, right, my point was more to raise a bit of awareness to think about such a change in a much wider context. People tend to be too much focused on the API itself, but there's so much other stuff going on in other parts of the ecosystem. Things like a size limit can be buried deep into some library, and scattered across lots of applications that all need updating then.

See this tiny bit in osmium: https://github.com/osmcode/libosmium/blob/master/include/osmium/osm/types.hpp#L69 - it affects object handling + pbf parsing...

westnordost · 2018-10-12T17:04:15Z

That said such large values is really stretching the basic concept of tags if you ask me, especially for the name.

@tomhughes Hmm, is this a real argument? What is the basic concept of tags, and why does it make sense for OSM to adhere to that definition of this basic concept of tags?

It is a matter of fact that there are approved and widely used definitions for tags that contain structured data, which can also become quite long, as we see. If you think that these things have no place in OSM and looking how well established these tags like opening hours, are, does that mean that you are unhappy with the general direction OSM has taken?

westnordost · 2018-10-12T17:05:48Z

@mmd-osm It's also here: https://github.com/westnordost/osmapi/blob/master/src/main/java/de/westnordost/osmapi/map/data/OsmTags.java#L38 but I can change this in a matter of minutes.

simonpoole · 2018-10-12T17:12:53Z

Just changing the API minor version (without anything else) would require a major effort to get everything working again.

Not against doing it, but it will literally break nearly everything.

PS: OSM API numbering happend before semver as we know it now, so no data consumer can assume any specific guarantees wrt backwards compatibility.

mmd-osm · 2018-10-12T17:40:46Z

After all, the API supports 255 Unicode characters. Instead of writing 13:00 you can simply write 🕐 which counts as a single character. This saves a whopping 4 characters already, and it's human readable (sort of).

(This comment may contain a bit of irony and shouldn't be taken too seriously.)

westnordost · 2018-10-12T18:45:06Z

Okay, so I see there are two concerns in this thread about changing this:

will break applications that work with OSM data to different amounts
the general notion that tags should be short and/or not contain text or long structured data

Regarding point 1, if it really is a major problem the clear answer to that is API versioning. API 0.6 will then simply not return any keys or values longer than 255 characters, API 0.7 will. Not sure if it is currently arranged to be able to deploy several API versions in parallel, but the URI scheme certainly looks like it was alloted for that use case.

Regarding point 2, addressed here. Either I do not understand the argument, or it is a matter of opinion.

mmd-osm · 2018-10-12T18:51:51Z

To answer your question: cgimap is already prepared to run multiple API versions in parallel (assuming this is somehow meaningful from a data model pov). Not sure about the Rails port, though.

I'm quite skeptical, if hiding some key-value pairs depending on the length is a good idea. Let's say, a 0.6 user wants to add some new tag with some value, and uploads that to the server. However, the same tag already exists on the server with a much longer value, and the server would have to refuse such an upload. The error message will be quite confusing though, as the existing entry is invisible to the 0.6 user. Even worse would be overwriting an existing entry, that a user has never seen.

What's your take on minutely diffs and planet files, then? --> https://planet.openstreetmap.org/

westnordost · 2018-10-12T19:15:09Z

Well, v0.6 would need to show the too-long entries but clearly marked as abbreviated and perhaps read-only (v0.6 api rejects changes to this tag)

mmd-osm · 2018-10-12T19:30:31Z

Hmm... API 0.6 has no way to convey an out-of-band information about tag values having more than 255 characters. So whatever schema you use to abbreviate long strings, you still don't know if the server holds more than 255 characters or some user is only playing some funny tricks by adding their own abbreviation characters (or you need to handle that special case...). There's probably some need for a convention for abbreviation characters, and editor applications need be aware of that as well to have those values read only.

westnordost · 2018-10-12T19:40:22Z

That is correct. The goal and intention of this construct would only be to provide backwards compatibility until all the data consumers made that change to v0.7.

matkoniecz · 2018-10-13T09:06:57Z

inscriptions on physical monuments

I also run into this problem.

I'm quite skeptical, if hiding some key-value pairs depending on the length is a good idea. Let's say, a 0.6 user wants to add some new tag with some value, and uploads that to the server.

What about keeping read access of v0.6 but disabling edit support? That may allow to avoid "it will literally break nearly everything" problem.

mmd-osm · 2018-10-14T10:36:21Z

As a very first step, the API should include the current max{key,value} lengths in https://api.openstreetmap.org/api/0.6/capabilities. Then editors could start respecting those settings rather than using some hardcoded value. It would enable us to to switch at least all editors to a larger field length all at once, simply by rolling out a new version of the osm website (+cgimap).

This sounds much easier than it really is: we've got lots of editing applications, and sometimes people use ancient versions. Some older apps might be unsupported, and will break. Preventing them from uploading corrupted data to OSM seems like a logical requirement.

My optimistic assumption is that this step alone takes at least 1 year.

Meanwhile, we need to figure out, what to do with all those data consumers. Expect plenty of time for research. I have no idea, what kind of fixes will be needed for tools like osmconvert, osmium, and many other binary file format based apps, including many mobile apps. This will be the most difficult part of the whole endeavour. Changing a few limits in editing apps will be a piece of cake in comparison.

Deliberately I left the decision open as to whether switching from 0.6 -> 0.7 would be necessary, as it causes all sorts of pains on its own.

HolgerJeromin · 2018-10-14T20:29:19Z

Just for the record:

In #1593, I read that the limitation to 255 characters is actually a legacy from back when the OSM database was running on MySQL

This seems to be wrong. API v0.5 had in fact some longer content in the db:
https://wiki.openstreetmap.org/wiki/API_v0.6_(Archive)

simonpoole · 2018-10-14T21:48:34Z

@HolgerJeromin the way I understood it, that for 0.6 it was reverted back to use 255 all over the place "to be more consistent" that doesn't change why most of the strings where originally 255 chars long to start with.

tomhughes · 2018-10-14T21:52:53Z

So mysql really did have limits - it doesn't have default unlimited text fields like postgres. So there had to be some limit and I think as much as anything that 255 was possibly just the rails default.

mmd-osm · 2018-10-15T10:26:27Z

SVN revision 2489 has a create_database.sql script for MySQL with explicit k + v varchar(255):

https://svn.openstreetmap.org/!svn/bc/2489/sites/rails_port/db/create_database.sql

DROP TABLE IF EXISTS `current_way_tags`;
CREATE TABLE `current_way_tags` (
  `id` bigint(64) default NULL,
  `k` varchar(255) default NULL,
  `v` varchar(255) default NULL,
  KEY `current_way_tags_id_idx` (`id`),
  FULLTEXT KEY `current_way_tags_v_idx` (`v`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

~~I don't see the actual date for rev 2489 on the web interface, but it should be somewhat early in the project.~~ It's from 2007. Earlier MySQL schema files indicate that it's probably even back to 2005.

westnordost · 2018-10-21T10:56:37Z

@tomhughes Would you accept a PR that extends the capabilities API call as @mmd-osm suggested?

tomhughes · 2018-10-21T11:00:21Z

Don't see why not.

pyrog · 2019-12-29T14:22:53Z

there are a few legitimate geo uses of larger tagvalues, like inscriptions on physical monuments

Some contributions use multiples inscription tags, i.e. node 700852731

historic=memorial
inscription=At the top of these steps is Mount Skipet, the key site in the birth of the town's carpet industry. Here from c. 1749 John Pearsall and John Broom based their partnership. Pearsall is regarded as the founder of the industry in 1735 when he wove the first
inscription:2="Kidderminster Carpet". Named after the town, the carpet was a flat reversible weave with a pattern on both sides, last woven in the town in 1932. The story goes that Broom brought the secrets of "Brussels" weaving to the town, and the first pile carpet
inscription:3=was woven by Pearsall and Broom on Moun Skipet.

pyrog · 2019-12-29T14:35:05Z

@woodpeck

it serves as a reminder for people that tags are supposed to be human-readable […]
An opening hours string longer than 255 characters is practically impossible to understand or edit without specialist software

opening hours string are difficult to understand for humans 😉
But softwares like OsmAnd display it in a human friendly form (or at least just say open or close).
For inscriptions, the tool is a simple editor with a copy and paste to OSM editor.

If opening hours become more complicated than 255 characters then I would prefer it to just revert to a human-readable string that describes the issue, which is very likely to fit in 255 characters.

We found a case where a public swimming pool have 5 different timetables per week, and 7 for school holidays.
The total length was near 255 characters, but bigger.

SH Mo,Fr 12:00-20:00;
SH Tu 12:00-21:00;
SH We 10:00-18:00;
SH Th 12:00-20:15;
SH Sa,Su 10:00-13:00,15:00-18:00;

Mo 12:00-14:00,17:00-20:00;
Tu 11:45-14:00,16:30-21:00;
We 11:30-13:00,15:00-18:00;
Th 11:45-14:00,16:30-20:00;
Fr 11:45-14:00,17:00-20:00

mmd-osm · 2020-07-25T09:53:23Z

Fast forward 1.5 years, there’s no progress even on the very first step (extending the capabilities endpoint).

My suggestion would be to close this issue. Beyond the capabilities issue, there’s large number of downstream consumers that assume 255 chars and abort processing, etc. You’d align with every one of them, which is not realistic imho.

pyrog · 2020-07-25T10:13:05Z

We have the same issue with multi-valued keys like website, wikimedia_commons… if URL are (two) long.
Or with subject:wikidata i.e. for a memorial with a lot a persons.

there's no technical reason for having a limit, but it serves as a reminder for people that tags are supposed to be human-readable

Editors could warn users but let them the choice to go beyond 255 😃

mmd-osm · 2020-07-25T10:15:41Z

As I mentioned, you need to change every downstream consumer, and the API. Letting one editor go beyond the limit will not work.

pyrog · 2020-07-25T12:56:37Z

Ok. But what is the risk if some downstream consumers are not updated ?

Analysers could crash : yes, but they could be "quickly" patched.
Editors may limit values length to 255 chars and lost the end : we could retrieve previous value with the history. And it's not worse than now.

First, we could list all editors currently used. (see comment in Overpass-API/issues/189).

Then, we could check them with a test API (a static URL that send one object with a long value)
And/or send emails to their developpers to check their tools themselves.

Finally, only after main "consumers" are tested/updated, push this in production.

mmd-osm · 2020-07-25T13:02:02Z

Just to name a few examples:
Nominatim: will no longer update.
Tile server: will no longer update.
Replication: will no longer work in the future.
Planet generation: will create incompatible pbf or fail.
Editors truncating data to 255 chars causing data corruption.
...and the list goes on and on.

pyrog · 2020-07-25T13:06:48Z

They will or they could no longer update ?
How could be sure if we don't test them ?

I know the task is difficult but I don't suggest to do this in one week-end 😉

mmd-osm · 2020-07-25T13:08:03Z

Well the process will abort with an error message, so it’s basically k.o.

You can see those checks in the source code all over.

pyrog · 2020-07-25T13:12:32Z

Right, but if we remove theses tests, all the tools will really crash ??

In Nominatim, I can't find 255 or 256 in the source code.
In nominatim.c the "limit" is the size of the int C type (±32768).
In libosmium the limit is hardcoded to 256 characters.

mmd-osm · 2020-07-25T13:45:21Z

Yes, this limit in libosmium affects osm2pgsql and subsequent consumers like Nominatim. As that’s a header only library, you would need to recompile every tool that uses it using a new version of the library.

That’s just one example... you would need to evaluate a larger part of the osm ecosystem to assess the overall impact.

Coordinating the rollout of such a change isn’t trivial either and would probably take quite a significant amount of time.

pyrog · 2020-07-25T14:38:53Z

It's not a big deal for their developers.
Currently, we don't have any tool to know the OSM "clients" precisely.
Yes, it's not trivial, but we have time. Don't we ?

pyrog · 2020-07-25T14:43:47Z

Planet generation: will create incompatible pbf or fail.

The length of a serialized object is coded as a variable length integer:
"Protocol Buffers use a variable-bit encoding for integers. An integer is encoded at 7 bits per byte, where the high bit indicates whether or not the next byte is to be read.…"
Source

So the size of a key or a value is "unlimited" in practice.

xorgy · 2022-09-20T19:56:52Z

Is this limitation documented for real anywhere? When you say "char" what exactly do you mean? Do you mean 255 UTF-8 bytes? libosmium says the strings can be up to 256 four byte codepoints, which is 1024 bytes of whatever you want. MySQL schema interpretation of varchar is different between versions, so can either mean 255 codepoints or 255 bytes of UTF-8.

It seems that people are further confused about this in the ecosystem, because MarcusWolschon/osmeditor4android#1401 references this issue, but the "fix" they have only truncates values based on UTF-16 values, which will allow values larger than 255 UTF-8 bytes, but will fail on values 255 codepoints or fewer, when those codepoints are represented with a surrogate pair.

tomhughes · 2022-09-20T20:04:56Z

I can't actually find any documentation of exactly what the rails validator polices but my guess is that it's the ruby string length which is probably codepoints rather than bytes or grapheme clusters.

So 255 UTF-16 values while wrong isn't totally terrible as characters outside the BMP will be relatively rare, and it will reject strings that the API would allow rather than allowing strings that the API will reject.

xorgy · 2022-09-20T20:06:34Z

@tomhughes Where would I look for the actual code that ingests these values and decides whether to accept them or not?

mmd-osm · 2022-09-20T20:07:27Z

Relevant code is here: https://github.com/zerebubuth/openstreetmap-cgimap/blob/master/include/cgimap/util.hpp#L21-L35
If that function returns a value > 255, the diff upload will reject it.

tomhughes · 2022-09-20T20:08:16Z

Well there are two such pieces of code, one in rails which uses uses the rails validates_length which I discussed and the cgimap version which @mmd-osm referenced.

xorgy · 2022-09-20T20:14:37Z

Relevant code is here: https://github.com/zerebubuth/openstreetmap-cgimap/blob/master/include/cgimap/util.hpp#L21-L35 If that function returns a value > 255, the diff upload will reject it.

So it is the UTF-16 length after all... unless wchar_t isn't UTF-16LE, which it has every right not to be thanks to the C++ standard. ;+ )

tomhughes · 2022-09-20T20:16:29Z

It won't be in rails and I really hope it isn't in cgimap but frankly I have no idea what the C multibyte routines will do on any given platform and frankly I'm rather concerned to discover we're relying on them!

Really this is not the appropriate place to be discussing it though - between us we've just polluted his ticket with a dozen or more off topic comments :-(

xorgy · 2022-09-20T20:19:44Z

Really this is not the appropriate place to be discussing it though - between us we've just polluted his ticket with a dozen or more off topic comments :-(

Yeah, I felt that way, but I couldn't find any other discussion of the limit aside from tickets on specific pieces of software, nor any documentation of it except the vague "255 characters" you see throughout the wiki. Is there a better venue?

tomhughes · 2022-09-20T20:20:56Z

Opening a new ticket or asking on the dev or rails-dev lists would have been good choices.

mmd-osm · 2022-09-20T20:25:40Z

Maybe you could provide a testcase on https://cpp.godbolt.org/ to demonstrate the issue?

tomhughes · 2022-09-20T20:33:48Z

Are you talking to me about my concerns about the multibyte stuff? My worry is that the routines as I understand things leave a lot open to interpretation and may be sensitive to the environment so what happens on godbolt may not match what happens in production.

I mean probably it works fine given the setlocale I've just always steered clear of them myself.

If it works as intended then it's counting codepoints though (which is good) not UTF16 values.

mmd-osm · 2022-09-20T20:43:34Z

That's why we have a testcase for that. See the other issue.

pyrog mentioned this issue Jul 25, 2020

Search by changeset metadata ? drolbr/Overpass-API#189

Open

simonpoole mentioned this issue May 23, 2021

ConditionalRestrictionFragment — OSM API error; long tag values yield XML lines which exceed 255 characters MarcusWolschon/osmeditor4android#1401

Closed

Lee-Carre mentioned this issue Nov 1, 2021

StreetComplete does not expect the 255-character limit on keys/values streetcomplete/StreetComplete#3471

Closed

matkoniecz mentioned this issue Aug 16, 2022

What https://blog.openstreetmap.org/2022/06/02/announcement-data-model-study/ will cover? osmlab/osm-data-model#7

Closed

xorgy mentioned this issue Sep 20, 2022

The specific limits on keys and values are unclear. #3706

Closed

xorgy mentioned this issue Sep 20, 2022

Clarify and document restrictions on key and value strings. zerebubuth/openstreetmap-cgimap#278

Closed

Increase may key and value max length #2025

Increase may key and value max length #2025

Comments

westnordost commented Oct 12, 2018 • edited

Discussion

Solutions

tomhughes commented Oct 12, 2018

bhousel commented Oct 12, 2018

tomhughes commented Oct 12, 2018

bhousel commented Oct 12, 2018

woodpeck commented Oct 12, 2018

mmd-osm commented Oct 12, 2018

bhousel commented Oct 12, 2018

mmd-osm commented Oct 12, 2018 • edited

westnordost commented Oct 12, 2018 • edited

westnordost commented Oct 12, 2018

simonpoole commented Oct 12, 2018 • edited

mmd-osm commented Oct 12, 2018 • edited

westnordost commented Oct 12, 2018 • edited

mmd-osm commented Oct 12, 2018 • edited

westnordost commented Oct 12, 2018

mmd-osm commented Oct 12, 2018 • edited

westnordost commented Oct 12, 2018

matkoniecz commented Oct 13, 2018

mmd-osm commented Oct 14, 2018 • edited

HolgerJeromin commented Oct 14, 2018

simonpoole commented Oct 14, 2018

tomhughes commented Oct 14, 2018

mmd-osm commented Oct 15, 2018 • edited

westnordost commented Oct 21, 2018

tomhughes commented Oct 21, 2018

pyrog commented Dec 29, 2019

pyrog commented Dec 29, 2019

mmd-osm commented Jul 25, 2020

pyrog commented Jul 25, 2020

mmd-osm commented Jul 25, 2020

pyrog commented Jul 25, 2020

mmd-osm commented Jul 25, 2020 • edited

pyrog commented Jul 25, 2020

mmd-osm commented Jul 25, 2020 • edited

pyrog commented Jul 25, 2020 • edited

mmd-osm commented Jul 25, 2020

pyrog commented Jul 25, 2020

pyrog commented Jul 25, 2020

xorgy commented Sep 20, 2022

tomhughes commented Sep 20, 2022

xorgy commented Sep 20, 2022

mmd-osm commented Sep 20, 2022

tomhughes commented Sep 20, 2022

xorgy commented Sep 20, 2022 • edited

tomhughes commented Sep 20, 2022 • edited

xorgy commented Sep 20, 2022

tomhughes commented Sep 20, 2022

mmd-osm commented Sep 20, 2022

tomhughes commented Sep 20, 2022

mmd-osm commented Sep 20, 2022

westnordost commented Oct 12, 2018 •

edited

mmd-osm commented Oct 12, 2018 •

edited

westnordost commented Oct 12, 2018 •

edited

simonpoole commented Oct 12, 2018 •

edited

mmd-osm commented Oct 12, 2018 •

edited

westnordost commented Oct 12, 2018 •

edited

mmd-osm commented Oct 12, 2018 •

edited

mmd-osm commented Oct 12, 2018 •

edited

mmd-osm commented Oct 14, 2018 •

edited

mmd-osm commented Oct 15, 2018 •

edited

mmd-osm commented Jul 25, 2020 •

edited

mmd-osm commented Jul 25, 2020 •

edited

pyrog commented Jul 25, 2020 •

edited

xorgy commented Sep 20, 2022 •

edited

tomhughes commented Sep 20, 2022 •

edited