Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fancy encoding for properties #26

Closed
tmcw opened this issue Dec 10, 2014 · 7 comments
Closed

Fancy encoding for properties #26

tmcw opened this issue Dec 10, 2014 · 7 comments

Comments

@tmcw
Copy link
Contributor

tmcw commented Dec 10, 2014

Since @mourner is checking this library out, might as well:

Could we encode properties in a more efficient way? Right now we're re-encoding property name & value for every single feature. This could save a lot of space if (a) we encode names once and then do an array or (b) we do some magic around structs

@mourner
Copy link
Member

mourner commented Dec 10, 2014

I like the indexes trick from vector-tile-spec, it's pretty neat: https://github.com/mapbox/vector-tile-spec/blob/master/1.0.1/vector_tile.proto#L32-L39 — encode both names and values once; use a single integer array for pairs of indexes to both

@brendan-ward
Copy link

For stricter schema based formas (e.g., shapefile) going into this, I think it would make sense to encode names once, and then encode an array of values (using oneof approach) by feature, e.g.,

message feature {
    repeated geometry geometries = 1;
    repeated property properties = 2;
    optional id id = 3;
}

message property {
    oneof value {
        string string_value = 1;
        float float_value = 2;
        ....
    }
}

Can you get away with the same approach for non-strict schema formats like GeoJSON? I was thinking in those cases it would be possible to scan the GeoJSON and extract all unique keys, then store a bitmask (encoded as repeated ints?) of the keys present with each feature to use for decoding. Something like:

pseudocode:

all keys = ["A", "B", "C"]
bitmask (for feature 1) =  010     (meaning only "B" was present for this feature)
bitmask encoded as int value =  2   (obviously bitmask would need to be divided into word sizes appropriate for the ```int``` type used)

@mourner
Copy link
Member

mourner commented Dec 11, 2014

@brendan-ward it may work but it's hard and slow to decode and is a bit complicated... What do you think about storing values as a global array as well? The way I commented above is simple and should be almost as efficient.

@brendan-ward
Copy link

@mourner The index approach you suggested could work great too; I was only offering up other alternatives I had been considering (should have made that more clear, sorry). Definitely not advocating for slower performance or complexity of implementation.

I guess my concern would be that while keys at the layer level would be relatively small, values array could grow very large in this case in a way that wouldn't be the case for tiled data: it would be num features * num properties per feature. Still within bounds of uint32 in the cases I can think through though!

@mourner
Copy link
Member

mourner commented Dec 11, 2014

Well, we could bump it to uint64 — it's still varint-encoded so shouldn't affect sizes, but I think uint32 (0 to 4294967295) is plenty.

@brendan-ward
Copy link

@mourner the more I look at your approach, the more I'm also convinced it is the right way to go, especially if you are planning to store unique string values (why not extend that to ints as well?). I definitely retract my much more complicated approach!

@mourner
Copy link
Member

mourner commented Dec 16, 2014

Closing in favor of #27

@mourner mourner closed this as completed Dec 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants