Phase 3: encoding #20

springmeyer · 2017-06-30T17:24:27Z

This issue is to track overall discussion on the next phase of development for this library: encoding.

The previous phases were Phase 1 (enumerations) and Phase 2 (decoding) (landed in mapbox-gl-native in mapbox/mapbox-gl-native#9312 🎉 ).

The primary goals for the encoding implementation are:

High performance / minimal overhead / low to zero allocation
Headers are modular and able to be used in a zero cost way (you don't get any extra dependencies than absolutely required for a given header - e.g. no dependency on geometry.hpp or variant unless the header absolutely needs it)
API is low-level like protozero. Downstream developers will be able to easily abstract on top of it be be able to integrate into libraries like mapnik-vector-tile / tippecanoe.

Roles

@joto will lead development of design over next 3 weeks (July 3-21)
@flippmoke will support, test, advise on viability of API for integration downstream

/cc @mapbox/core-tech

joto · 2017-07-06T07:38:58Z

Here are some notes/todo items:

Use cases

Common use cases that the library must support:

Reading VT, decoding all layers.
Reading VT, only decode some layers based on type and/or name.
When reading layer, only use some features based on tags, only decode geometry when needed
Copy parts of VT based on layer type/name and/or feature tags
Copy parts of a VT and add newly generated parts (layers, features)
Create VT from scratch
Work together with geometry.hpp lib. Should be able to create/read geometry.hpp geometries.

It would be good to have small test programs for all use cases. They can be used for development to make sure use cases are supported and the interface is easy to use and later are the basis for tests and docs.

Features

When writing features, some kind of rollback is needed if the user decides to abandon writing a feature. This is used for instance when it turns out a geometry is invalid.
Think about range checking for geometry values (fit in extent plus some extra? fit in 16 bit?). Avoid values that might overflow in our code or in user code.
Think about range checks in names and string tags to avoid overflows and excessive memory use.
Geometry pre-allocation (reserve()) based on count in geometry commands?
Error reporting. Add error messages in exceptions. Think about classification of exception classes.
Check iterator type traits for layer/feature iterators. Is this an input iterator or a forward iterator?

TODO

We need more test data. A collection of "typical" vector tiles with varying content would be useful. Small tiles, large tiles, just on layer, many layers, different tag value types, different geometries etc.

mapsam · 2017-07-11T17:53:28Z

Thanks for writing this up @joto!

We need more test data

I'd be interested in helping out with this. I started working on mvt-fixtures (which is currently being used by this library) and can add more there. Also recognizing having fixtures in a separate repository might be a pain to maintain, happy to add some here.

joto · 2017-07-13T12:49:26Z

@mapsam I think having the fixtures in an extra repository is okay. Yes, a bit harder to use here, but easier to use across projects.

Here are some things you can help with:

We could use a collection of vector tiles from different sources. Something like the one started for the benchmarks, but more diverse sources: Different zoom levels, different types (street map tiles, QA tiles, very simple tiles without tags, complex tiles with many tags and layers, etc.) Whatever people use vector tiles for. Doesn't have to be all copied in here, but some URLs could also help as a first step. The collection is a good basis to make sure different use cases are handled well in the lib. It is too easy to overlook use cases and assume things which might not be true for all vector tiles. Typical vector tiles will only have very few tag keys per layer for instance, but that is not true for QA tiles. So this is to make sure we cover all bases.
We need some systematic way of creating tiles with all (or as many as possible) legal and illegal corner cases (like no features in a layer, etc.) There is a good start in mvt-fixtures/fixtures/invalid, but I suspect there must be more.
We need tiles that are deliberately created with invalid data. Such as a length field in the data saying there are 20 billion points in this linestring even if there aren't. This is essential to test our library against attackss. This needs knowledge of the underlying data format. Best approach here: Write a program that creates a normal tile, then change a byte here or there and write it to disk. Keep the program as documentation and when you want to change the test examples later.
We should think about some way of structuring the test vector tiles and have some description of the tests not just in the README, but more structured format. For instance a json file for each test that has the information: Is this a valid tile according to spec version 1? version 2? Not valid but should be readable anyway? Not valid and must throw error? Recoverable error (ignore this feature but keep reading other features) or unrecoverable error (somebody is messing with us, stop reading from this tile)? I am sure there are more criteria here. It is important to think about the different categories of errors here, so the library can do the right thing.

mapsam · 2017-07-19T17:40:45Z

@joto I like the idea of improving how to describe fixtures, such as a JSON object. Could we use https://github.com/yinqiwen/pbjson to create protobufs from these JSON objects?

joto · 2017-07-24T08:19:12Z

@mapsam I was just thinking about using JSON as a way to encode some metadata about the tests not the test data itself. The test data can be created in many ways, for some it might be possible to use pbjson or something like it, for other we really need our own code where we can fiddle with the bits.

Test data should be easy to create and we also want to make sure that we don't introduce too many dependencies. After all we don't want to test pbjson or anything like it but our own code, so it might be better to stick with tried and tested libs like the official Google protocol buffers libraries to create the test data. And for the tests we don't need the performance that a C++ lib will provide, maybe the convenience of a Javascript or Python library is better here. But feel free to experiment!

springmeyer · 2017-08-01T20:21:32Z

We need tiles that are deliberately created with invalid data. Such as a length field in the data saying there are 20 billion points in this linestring even if there aren't. This is essential to test our library against attackss. This needs knowledge of the underlying data format. Best approach here: Write a program that creates a normal tile, then change a byte here or there and write it to disk. Keep the program as documentation and when you want to change the test examples later.

This is really well articulated and would be very very valuable. I agree that it is critical to have the changes to make the tiles invalid be targeted, specific, and documented in code (so you could return when you think of another invalid change).

I will add however that the random approach is also valuable in finding edge cases (fuzzing) things that would break with invalid input we can't think of and I noticed https://github.com/google/libprotobuf-mutator which could be used for that effort.

springmeyer · 2017-09-10T23:08:15Z

Going to close this ticket as I think we are resolved on all the issues discussed above.

To recap:

The encoding implementation is done, thanks to @joto. We've decided it will live in a newly named repo that has no dependency on anything but protozero. See mapbox/vtzero#1.

The small, single-purpose/malicious fixture discussion has blossomed into mvt-fixtures 3.x. The suite is ready to use to add many more fixtures and we can track work there. See mapbox/mvt-fixtures#15 and mapbox/mvt-fixtures#21

The large/varied tile samples discussed above is not done, but would be easy to start working on if we had a ticket dedicated to the idea. So I've created mapbox/mvt-fixtures#22.

springmeyer mentioned this issue Jul 3, 2017

Align property parsing with JS implementation #16

Merged

springmeyer mentioned this issue Aug 7, 2017

JS refactor mapbox/mvt-fixtures#12

Merged

3 tasks

springmeyer mentioned this issue Sep 10, 2017

Minimal vector tile C++ decoder and encoder mapbox/vtzero#1

Closed

6 tasks

springmeyer closed this as completed Sep 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: encoding #20

Phase 3: encoding #20

springmeyer commented Jun 30, 2017 •

edited

Loading

joto commented Jul 6, 2017 •

edited

Loading

mapsam commented Jul 11, 2017

joto commented Jul 13, 2017

mapsam commented Jul 19, 2017

joto commented Jul 24, 2017

springmeyer commented Aug 1, 2017 •

edited

Loading

springmeyer commented Sep 10, 2017

Phase 3: encoding #20

Phase 3: encoding #20

Comments

springmeyer commented Jun 30, 2017 • edited Loading

The primary goals for the encoding implementation are:

Roles

joto commented Jul 6, 2017 • edited Loading

Use cases

Features

TODO

mapsam commented Jul 11, 2017

joto commented Jul 13, 2017

mapsam commented Jul 19, 2017

joto commented Jul 24, 2017

springmeyer commented Aug 1, 2017 • edited Loading

springmeyer commented Sep 10, 2017

springmeyer commented Jun 30, 2017 •

edited

Loading

joto commented Jul 6, 2017 •

edited

Loading

springmeyer commented Aug 1, 2017 •

edited

Loading