Complex Properties Overhaul #121

flippmoke · 2018-07-20T19:40:51Z

This is different then the more simple approach to adding lists and maps in #117. This is a complete overhaul of the way properties are encoded. The thought process behind this change is to allow for higher levels of compression of values by:

Allowing integer values to be inlined rather then point to indexes
Indexed positions of properties are now to packed types
Keys and values share the same string storage system

This also allows for null values.

Solves #75 and #62

…ties

flippmoke · 2018-07-20T19:43:35Z

Another slight changed proposed by @kkaefer I should mention (but not reflected currently)

using the special index "0" to mean "specified inline" for any of the index types

This could allow all types to be inlined.

e-n-f · 2018-07-20T19:57:32Z

3.0/vector_tile.proto

+                repeated double double_values = 8 [ packed = true ];
+                repeated float float_values = 9 [ packed = true ];
+                repeated int64 int64_values = 10 [ packed = true ];
+                repeated uint64 uint64_values = 11 [ packed = true ];


Please use sint64 instead of int64 since this would only be preferred over uint64 when the value is negative.

e-n-f · 2018-07-20T19:59:05Z

3.0/vector_tile.proto

+                //              |     | (if 4th bit is 1 is map)
+                //              |     |   remaining bits are the number of key_index and 
+                //              |     |   complex_value pairs to follow (same as properties)
+                repeated uint64 properties = 5 [ packed = true ];


I will try an experimental implementation of this so we can see if it actually makes the tiles significantly smaller.

e-n-f · 2018-07-20T19:59:37Z

3.0/vector_tile.proto

+                // 
+                //     Type     | Id  | Parameter
+                // ---------------------------------
+                // inline int   |  0  | value of integer ( values between -2^60+1 to 2^60-1 )


Please use sint instead of int since this would only be preferred over uint when values are negative.

e-n-f · 2018-07-20T21:54:35Z

An experimental implementation of this scheme is in mapbox/tippecanoe#611

In a first, superficial test, it reduces the Natural Earth countries (to z5) to 99.55% of their usual size. There may be other data sets that show meaningful improvement though.

flippmoke · 2018-07-23T17:31:07Z

I have experimented a little more with your branch and put in a few different sets of data:

North American Road Data

For one set of data, I used some north american road data, here are the results of file sizes:

-rw-r--r-- 1 mthompson 410M Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 430M Jul 23 11:42 na_roads_master.mbtiles

or

-rw-r--r-- 1 mthompson 428904448 Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 450768896 Jul 23 11:42 na_roads_master.mbtiles

This is about ~5% reduction in size.

Sample data format properties:

"properties": { "prefix": null, "number": "331", "class": "State", "type": "Other Paved", "divided": null, "country": "United States", "state": "Alabama", "note": null, "scalerank": 11, "uident": 142, "length": 4.559190, "rank": 0, "continent": "North America" }

My guess would be that we are seeing a reduction due to inline integers mostly in this case.

OSM Based Point Data

File sizes:

-rw-r--r-- 1 mthompson 832K Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 888K Jul 23 11:43 kathmandu_master.mbtiles

or

-rw-r--r-- 1 mthompson 851968 Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 909312 Jul 23 11:43 kathmandu_master.mbtiles

This is about a 7% reduction.

Example properties:

"properties": { "osm_id": 3483919843.0, "access": null, "aerialway": null, "aeroway": null, "amenity": null, "area": null, "barrier": null, "bicycle": null, "brand": null, "bridge": null, "boundary": null, "building": null, "capital": null, "covered": null, "culvert": null, "cutting": null, "disused": null, "ele": null, "embankment": null, "foot": null, "harbour": null, "highway": null, "historic": null, "horse": null, "junction": null, "landuse": null, "layer": null, "leisure": null, "lock": null, "man_made": null, "military": null, "motorcar": null, "name": null, "natural": null, "oneway": null, "operator": null, "poi": null, "population": null, "power": null, "place": null, "railway": null, "ref": null, "religion": null, "route": null, "service": null, "shop": null, "sport": null, "surface": null, "toll": null, "tourism": null, "tower:type": null, "tunnel": null, "water": null, "waterway": null, "wetland": null, "width": null, "wood": null, "z_order": null, "tags": "\"ford\"=>\"yes\"" }

e-n-f · 2018-07-23T18:04:46Z

Interesting. Thanks for the additional research. Can you add links to the files you are testing with?

e-n-f · 2018-07-23T22:12:39Z

New experiment with sorting the values but retaining the v2 encoding:

➤ ./tippecanoe -Voriginal -zg -f -o kathmandu-original.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vreordered -zg -f -o kathmandu-reordered.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vblake -zg -f -o kathmandu-blake.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ls -l kathmandu-*mbtiles
-rw-r--r-- 1 enf staff 577536 Jul 23 14:58 kathmandu-blake.mbtiles
-rw-r--r-- 1 enf staff 655360 Jul 23 14:58 kathmandu-original.mbtiles
-rw-r--r-- 1 enf staff 602112 Jul 23 14:58 kathmandu-reordered.mbtiles

Blake format: 88% of previous size
Reordered format: 92% of original size

➤ ./tippecanoe -Voriginal --no-tile-size-limit -zg -f -o neroads-original.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vreordered --no-tile-size-limit -zg -f -o neroads-reordered.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vblake --no-tile-size-limit -zg -f -o neroads-blake.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/57/96
➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18317312 Jul 23 15:04 neroads-reordered.mbtiles

Blake format: 92% of original size
Reordered format: 97% of original size

At least now I know it's worth spending a little extra time to sort the values before writing out the tile, even if there is some additional advantage to either inlining values or using repeated messages.

e-n-f · 2018-07-23T22:19:45Z

The Natural Earth roads are improved slightly by also sorting the keys:

➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18292736 Jul 23 15:17 neroads-reordered.mbtiles

e-n-f · 2018-07-23T23:00:53Z

Blake's format, but without inline ints:

Kathmandu: 94% instead of 88%
Natural Earth roads: 97% instead of 92%

So I think inlining is helping more than repeated messages are.

e-n-f · 2018-07-24T17:29:01Z

Inlining floats does help a little, but the difference is in the noise (91.97% vs 92.16%):

-rw-r--r-- 1 enf staff 17403904 Jul 24 10:21 neroads-blake-float.mbtiles
-rw-r--r-- 1 enf staff 17440768 Jul 24 10:20 neroads-blake-regular.mbtiles

This also highlights that we need more than 3 bits for types. In fact this PR already actually requires 4, because it specifies "list / map" as type 8, which won't fit in 3 bits. I'll add a 4th type bit to the test implementation and recalculate.

e-n-f · 2018-07-24T17:33:20Z

Adding the 4th type bit raises the roads from using 92.16% to 92.49% of the original tileset size.

-rw-r--r-- 1 enf staff 17502208 Jul 24 10:31 neroads-blake-regular.mbtiles

mourner

Overall I really like the flat approach. While we introduce 4 bits per value, this should be more than offset by not wrapping each value as a separate tagged message, and nested properties fit here naturally.

mourner · 2018-07-27T10:53:11Z

3.0/vector_tile.proto

+                // list / map   |  8  | (if 4th bit is 0 is list)
+                //              |     |   remaining bits are length of the list where
+                //              |     |   each item in the list is a complex value
+                //              |     | (if 4th bit is 1 is map)


This is a bit confusing — the id 8 is 0b1000, but if the 4th bit is 1 (so that it becomes 0b1001), the id equals 9. Then why not just indicate 8 for list and 9 for map instead of mentioning the fourth bit?

I was attempting to get away with just using 3 bits so that we could represent higher int values with out having to using the int index system. I am not against 4 bits.

mourner · 2018-07-27T10:54:55Z

3.0/vector_tile.proto

+                //
+                // The properties field is much like the tags value in the it is two integers
+                // pairs that reference key and value pairs however, it is broken out into a
+                // "key_index" and an "complex_value". 


Nit: had to read many times to understand the sentence. the -> that? also, ; before "however" would help

mourner · 2018-07-27T11:18:43Z

Also question — how does an encoder decide whether to inline a value or put it in the packed array? Should we always inline int/sint vlues? And if we remove "index to int/sint" types, this makes the types fit into 3 bytes again (including list/map without an additional bit).

flippmoke · 2018-07-27T14:49:33Z

@mourner We can not currently remove the indexed integer types because there is a limit to the size that the inlined values can represent currently. Here is how @ericfischer currently did the implementation for when to use inline vs indexed. I think this makes sense overall, the only time there might be a bit savings by using index over inline would be values that are larger then could be represented by 24 bit integers that are highly repeated (probably more then 3 or 4 times) and would have a low index value. This could be calculated on the fly if required, but I don't know that we need to have such a complex implementation. I think overall the inline seems to save space on average so we may not need such complex code.

joto · 2018-08-09T14:50:02Z

Here are some random thoughts:

Large integers that can't be inlined (due to us using those 4 bits for the type) could still be inlined by having a special type that says: next integer is the actual value.
We might not need both the uint64/sint64_values tables, because internally they are both varints and we know the type from the type field
We could put all tables into one large buffer using offsets instead of indexes into those tables. the value type we know already. From an encoding point of view all those tables in a layer are a problem, because we need to keep them around in memory until the layer is finished, so this could simplify things. (A variant of this would be to store the data inline the first time and after that use offsets.)
The proposal removes the distinction between keys and values tables, only has the string_values field. This simplifies things slightly (and saves a few bytes for the second table header) but makes the often used keys values larger probably (unless you take care to first put all keys into the string table). It also makes reuse of the keys table not possible in shaving or similar use case. Also creating this table is more expensive in the first place, because the key space is usually small which makes it easier to find the index of a given key. On the other hand with nested maps having a single table is a bit easier, because there is no question where all the strings are.
Keys are always strings, so we don't need to store the type for the keys. Saves the 4 bits.
How common is the case where float/double values are used that are actually multiple times in the layer so the lookup table makes sense? Maybe it is better to convert them to ints somehow and store them inline?

e-n-f · 2018-08-09T17:32:51Z

I am OK with inlining large integers as internal varints if we're also doing that with lists and maps. The value I see to not making any types variable-length is that it is nice to be able to know how many attributes there are just by dividing the length of the list by 2. But I'm not sure how much that really matters.
We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.
Is there a way to represent the one-large-buffer approach in standard protobuf syntax, or does that make the format protozero-only?
No objection from me to a separate keys table.
I tried inlining floats and it didn't make much difference in size, and makes the format harder to describe and implement. Inlining doubles would require a larger-than-64-bit integer type to pack them into. It might be worth trying encoding the mantissa and exponent as a pair of varints and see how that works out, though.

e-n-f · 2018-08-09T21:10:04Z

The mantissas of floating point numbers seem to be fairly uniformly distributed across the [.5…1) interval, so there's probably not much potential for giving more common mantissas shorter representations. Low exponents are more common than high ones, though, so we might be able to squeeze a little bit out there.

joto · 2018-08-16T12:41:41Z

Regarding special encoding of floating point numbers: I don't think it is worth it to come up with complex schemes here. I had thought about just using raw bytes stored in a string field or something. But while that might be easy to use in C++, it will be more difficult in JS or so.

Regarding the keys/string_values tables: With the encoding proposed here it doesn't cost us anything to split these up, because each string is encoded by itself anyway. But as mentioned it will lead to smaller index numbers which, especially for the keys case is probably worth it. Here is another idea though: Currently all keys/string_values are directly in the layer object, if we push this down one level and have an intermediate string_table object, it could be more efficient. It would allow us to jump over the whole table or copy the whole table in one go. The cost is one more byte for the type and one varint for the length of the whole table. Double that if we have separate keys/string_values tables.

Regarding integer value encodings: If we put large integers that can't be inlined as separate varint in the properties array, we will hit a bad case for varints. Because they are always large, chances are they will get even larger as varints (max 10 bytes compared to 8 bytes for the int itself). So there is some inefficiency there. On the other hand, if we want to put them in an index table, we can use a fixed size type instead of a varint, which would avoid this and also make access more efficient, because we can directly address values in those tables without having to decode them first. So I think if we keep the tables, they should be of type (s)fixed32/64 instead of (s/u)int32/64. It would still be an indirect access which likely is slower than inlining though.

We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.

This is one of those cases where we are hitting the limits of protobuf encoding again. We know the type, so we could do the zigzag encoding ourselfs for sints and not for uints. For the C++ code this doesn't matter, because we do the zigzag encoding ourselves anyway, but for anybody using the protobuf encodings, we either need two tables, or they have to do the zigzag encoding outside the protobuf lib.

e-n-f · 2018-08-16T16:47:29Z

Glad to hear that inlining floats sounds like it is off the table.
I'm fine with putting keys and strings in separate tables, and within a container object if that is considered useful. I'll try changing my prototype to do that. Should we put all the attributes inside that message, or is there a case where readers just want to skip/copy the strings?
Good point that the integers-by-reference should be fixed-size instead of varint, since they will always be large if they didn't get inlined. I'll change my prototype and this .proto file to do that.
If we talk about inlining signed integers at all, we are inherently doing bit-packing, so we have to explicitly talk about either zigzagging or sign extension, and zigzagging is probably the better choice of the two. All clients, no matter what language they are written in, will have to be able to unpack inlined integers.

On a different topic:

If we inline lists and hashes, meaning that we mix single-word and multi-word values in the attribute list, we need to be clear about which sets of types occupy only a single slot and which refer to suffixes, so that soon-to-exist clients can skip over types that future versions of the standard may define. I think we should be explicit that types 9 through 15 only use a single slot and are either single-word inline types or reference types, not multi-word inline types.

… into blake_properties

joto · 2018-08-17T16:12:57Z

3.0/vector_tile.proto

-                // uses the properties field instead. This would only be used if version
-                // for a layer is 3 or greater and tags should not be used at that point
+                // Additional tags (or all the tags) of this feature may be
+                // encoded as repeated pairs of 32-bit integers, to take


Didn't we want to get away from the "tags" name and use "properties" instead? Also the properties field has 64bit uints, not 32 bit ints. And this is not necessarily "pairs" when we deal with lists and maps.

Good point, will reword. @flippmoke and I want to consistently call it "attributes" to match what OGC does.

joto · 2018-08-17T16:14:52Z

3.0/vector_tile.proto

@@ -69,6 +112,12 @@ message Tile {
                // See https://github.com/mapbox/vector-tile-spec/issues/47
                optional uint32 extent = 5 [ default = 4096 ];

+                repeated string string_values = 7;
+                repeated double double_values = 8 [ packed = true ];
+                repeated float float_values = 9 [ packed = true ];


Can we try to keep the types ordered consistently throughout the .proto file, ie some places have float first, then double, others in different order.

Sounds good to me. I'll make that edit.

joto · 2018-08-17T16:15:58Z

3.0/vector_tile.proto

+                repeated float float_values = 9 [ packed = true ];
+                repeated sfixed64 sfixed64_values = 10 [ packed = true ];
+                repeated fixed64 fixed64_values = 11 [ packed = true ];
+


I suggest these should get "logical" names like signed_integer_values or so instead of ones based on the encoding sfixed....

Also fine with me.

joto · 2018-08-17T16:17:49Z

3.0/vector_tile.proto

+                //              |     |   each item in the list is a complex value
+                //              |     | (if 4th bit is 1 is map)
+                //              |     |   remaining bits are the number of key_index and 
+                //              |     |   complex_value pairs to follow (same as properties)


Can we simply make these list -> 8, map > 9? The extra bit is confusing and doesn't buy us anything, because we already have 9 values (0-8) for the Id anyway.

Sounds good to me, since we have enough type fields to spare. I think they were combined only because it looked like the types would fit in 3 bits.

joto · 2018-08-17T16:18:00Z

3.0/vector_tile.proto

+                // an index position into a value storage of the layer.
+                // 
+                // uint64t type = complex_value & 0x0F; // First 4 Bits
+                // uint64t parameter = complex_value >> 4;


Good point, will fix

flippmoke · 2018-08-17T18:16:04Z

3.0/vector_tile.proto

+                // bool/null    |  7  | value of 0 = false, 1 = true, 2 = null
+                // list         |  8  | value is the number of sub-attributes to follow:
+                //              |     |   each item in the list is a complex value
+                // map          |  9  | value is the number of sub-attributes to follow:


Question on intent of wording here, is the number of sub attributes to follow based on number of key value pairs or the number of keys and values.

I meant it to be the number of pairs, not the total number of words to follow. Thanks. I'll reword.

flippmoke · 2018-08-23T15:26:03Z

3.0/vector_tile.proto

+                // int          |  3  | index into layer.attribute_pool.signed_integer_values
+                // uint         |  4  | index into layer.attribute_pool.unsigned_integer_values
+                // inline uint  |  5  | value of unsigned integer (values between 0 to 2^60-1)
+                // inline sint  |  6  | value of zigzag-encoded integer (values between -2^59 to 2^59-1)


We probably should change this to be 2^56 for uint and 2^55 for signed due to the way varints are encoded.

I think we need to differentiate between what encodings are possible and what encodings are recommended. The spec may well say that this or that encoding is recommended becaus it is usually better, but still require readers to understand a different encoding.

flippmoke · 2018-09-27T20:26:19Z

Closing in favor of #123

Example complex value setup to perhaps increase compression of proper…

7217923

…ties

flippmoke changed the base branch from master to v3.0-development July 20, 2018 19:40

Small fixes to properties wording

fe9139a

Fix wording yet again

736b0ec

e-n-f reviewed Jul 20, 2018

View reviewed changes

e-n-f mentioned this pull request Jul 20, 2018

EXPERIMENTAL work on prototyping the VT3 vector tile spec revision mapbox/tippecanoe#611

Open

4 bits are needed for value types. Use sint64 instead of int64.

5f954d6

mourner reviewed Jul 27, 2018

View reviewed changes

Correct grammar

66e465c

e-n-f added 2 commits August 16, 2018 11:08

Change tables of integers by reference to be fixed-size

5bd6daa

Merge branch 'blake_properties' of github.com:mapbox/vector-tile-spec…

d5d96e5

… into blake_properties

joto reviewed Aug 17, 2018

View reviewed changes

e-n-f added 2 commits August 17, 2018 10:44

Reword and reorder for consistency. Add separate attribute pool message.

af1d5d9

Be clear about undefined types 10 through 15

3f262a4

flippmoke commented Aug 17, 2018

View reviewed changes

Be clear that the count for a map refers to the number of pairs

522bca8

joto mentioned this pull request Aug 20, 2018

Changes for advanced attributes. #123

Merged

flippmoke commented Aug 23, 2018

View reviewed changes

e-n-f mentioned this pull request Sep 1, 2018

Feature Request: Minify attribute names mapbox/tippecanoe#638

Closed

e-n-f added 2 commits September 4, 2018 16:12

Revise to match #123

001e3ed

Move string values back to the top level of the layer

2f95642

joto mentioned this pull request Sep 22, 2018

VT3 support mapbox/vtzero#43

Open

7 tasks

flippmoke closed this Sep 27, 2018

Complex Properties Overhaul #121

Complex Properties Overhaul #121

Conversation

flippmoke commented Jul 20, 2018

flippmoke commented Jul 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-n-f commented Jul 20, 2018

flippmoke commented Jul 23, 2018

North American Road Data

OSM Based Point Data

e-n-f commented Jul 23, 2018

e-n-f commented Jul 23, 2018

e-n-f commented Jul 23, 2018

e-n-f commented Jul 23, 2018

e-n-f commented Jul 24, 2018

e-n-f commented Jul 24, 2018

mourner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mourner commented Jul 27, 2018 • edited Loading

flippmoke commented Jul 27, 2018

joto commented Aug 9, 2018

e-n-f commented Aug 9, 2018

e-n-f commented Aug 9, 2018

joto commented Aug 16, 2018

e-n-f commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flippmoke commented Sep 27, 2018

mourner commented Jul 27, 2018 •

edited

Loading