Evaluate options for a language-independent checkpoint/serialization format for the CLA #333

rhyolight · 2013-10-25T00:01:02Z

The current CLA model checkpoint uses the pickle module. As we move towards multiple language support and more external model sharing, we should define a language-independent format for serializing the CLA.

The major objectives for this would be cross-language implementation (i.e. we don't have to create serialize functions separately for each language), speed, checkpoint size, and ease of development and versioning.

Status

Protocol Buffers - The most mature. Has implementations in many languages.
Cap'n Proto - Has C++11 and Python implementations. Has speed advantages and potentially future ability for mmap. Much less mature than Protocol Buffers but seems better designed and has active development.
Rejected Flat Buffers - From Google, similar in spirit to Capn Proto but less mature. Probably not as good as Capn Proto as it has a pretty specific use case driving the implementation details. Also has no Python implementation.
Rejected Some form of binary JSON (MessagePack, BJSON) - Rejected because these options don't have easy backwards compatibility.
Rejected Thrift - Rejected. Inferior to Protocol Buffers, more difficult to use.

iandanforth · 2013-10-25T15:04:49Z

Thrift is a PITA both conceptually and in implementation.

--Please excuse brevity, sent from phone.--
On Oct 24, 2013 5:01 PM, "Matthew Taylor" notifications@github.com wrote:

The current CLA model checkpoint uses the pickle module. As we move
towards multiple language support and more external model sharing, we
should define a language-independent format for serializing the CLA.

The major objectives for this would be cross-language implementation (i.e.
we don't have to create serialize functions separately for each language),
speed, checkpoint size, and ease of development and versioning.

Some options include:

Protocol buffers

Thrift

Cap'n proto - a little early in development but has some nice
properties

Some form of binary JSON

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/333
.

scottpurdy · 2013-10-26T00:15:31Z

@iandanforth Yes I have heard (and experienced) similar opinions. Frank mentioned the same in JIRA I believe. Something like MessagePack is easy to use but is harder to maintain and doesn't work as well across languages (as far as I can tell).

I am leaning towards protobufs or similar. I quite like Cap'n Proto and there is active development but previously is was still a bit raw. One potential advantage of it over PB is that it has planned support for memory mapping that could make deserialize-run-serialize operations very fast (this is the current Grok use case).

scottpurdy · 2014-06-04T23:10:46Z

Here is my branch with initial Python spatial pooler test:
https://github.com/scottpurdy/nupic/tree/sp-capnp

The files are in nupic/proto.

My initial results weren't very good. I suspect there is type conversion going on. It would probably be better to do the first tests in C++ where it is more obvious what is happening. And I wasn't doing thorough profiling, just saw that the time to create the model and the time to feed a record in were longer. From a theoretical standpoint, I am fairly confident that we can make serialization/deserialization and runtime both faster though.

utensil · 2014-06-05T04:41:18Z

👍

scottpurdy · 2014-06-18T22:19:01Z

There are currently four different options we want to measure times for: C++ and Python protocol buffers and C++ and Python Cap'n Proto buffers.

There are two ways to use these. In the simple scenario, we leave the code pretty much the same and simply create and populate the buffers when we need to serialize. The other is to actually use the buffer in memory during execution. This doesn't work for some fields like sparse matrices but it works for most everything else.

In the former case, we want to measure the times for the following operations:

Time to create and populate the buffer
Time to serialize the buffer to disk
Time to deserialize the buffer into memory
Time to create the object and copy data from the buffer into it

In the case that we use the buffer in memory during execution, we would want to measure the time it takes to run records through in addition to the times for copying/serializaing/deserialization.

Finally, when doing these timing tests it is important to run some records through the pre-serialization and post-de-serialization objects and compare the results to ensure that everything is implemented correctly (wouldn't be fair timing tests is some pieces were left out!).

utensil · 2014-06-21T08:51:25Z

Yeah, seen the code and it's taking the approach of "actually use the buffer in memory during execution".

utensil · 2014-06-21T08:57:46Z

It would probably be better to do the first tests in C++ where it is more obvious what is happening. And I wasn't doing thorough profiling, just saw that the time to create the model and the time to feed a record in were longer.

It seems that Cap’n Proto does the encoding when feeding the data, and decoding on retrieval:

Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!

So, I guess it's definitely not an option to use Cap’n Proto as internal structure of SP and such, it will slow things down.

vsinha · 2014-06-23T16:02:04Z

Captain Proto is currently C++11 only and I haven't been able to get it to link with the current nupic.core (or even Marek's C++11branch)

Performance with protocol buffers looks good: both the stages of creating the protocol buffer and serializing, as well as deserialization and loading variables back into the class fields were about 2 times faster than the current implementation. Right now my implementation doesn't use the protobuf object in memory throughout SP execution, but allocates and populates it when the save function is called.

h2suzuki · 2014-07-15T16:34:52Z

I'm wondering... what kind of API should we need for language bindings?
Should it be an IDL kind of stuff to allow generic RPC --- procedural?
or should it be a memcached or SQL kind of text/binary protocol to allow generic data access -- data oriented?

It also depends on the medium to communicate upon; dynamic linking mechanism or mere network packets. We can do binding in multiple levels. Does the efficiency matter? The speed can be optimized for throughput, latency, operations per second, ..., etc.

rhyolight · 2014-08-07T17:36:53Z

When I get back from vacation, I want to have a discussion about updating to C++11. That might change the implementation of this issue significantly.

utensil · 2014-08-08T02:52:32Z

There's #130 taking the approach of using google protocol buffer. Just wonder if anyone has noticed and evaluate FlatBuffers which is also the successor of protocol buffer and developed by google. At first glance, it also provides "Access to serialized data without parsing/unpacking" and has miscellaneous efficiency improvements just like Cap'n proto .

rhyolight · 2014-08-20T22:32:33Z

@scottpurdy Would you call this ticket complete?

utensil · 2014-08-22T02:28:24Z

rhyolight commented a day ago

@scottpurdy Would you call this ticket complete?

If it's complete, what's the conclusion?

scottpurdy · 2014-09-18T21:30:59Z

We don't have a decision on this yet. I think it makes sense to keep tracking here. I will create a follow up issue to track the implementation to be done after we finalize the decision.

scottpurdy · 2014-09-18T21:33:41Z

Now that we have a C++11 nupic.core I am going to attempt Capn Proto again.

scottpurdy · 2014-09-18T21:35:02Z

We have more motivation for this issue from #1231 which is a somewhat serious bug in NuPIC.

rhyolight · 2014-10-03T15:53:18Z

@scottpurdy Once C++11 is finished across nupic and nupic.core, the decision is to start a Cap'n Proto implementation, right?

scottpurdy · 2014-10-03T17:28:35Z

@rhyolight - that is my current plan, yes

rhyolight · 2014-10-03T20:01:21Z

Shall we close this yet?

Matt Taylor
OS Community Flag-Bearer
Numenta

On Fri, Oct 3, 2014 at 10:28 AM, Scott Purdy notifications@github.com
wrote:

@rhyolight https://github.com/rhyolight - that is my current plan, yes

—
Reply to this email directly or view it on GitHub
#333 (comment).

rhyolight · 2014-10-07T22:30:49Z

Closing, assuming we're going with Cap'n Proto in #1336.

ghost assigned scottpurdy Oct 25, 2013

scottpurdy assigned vsinha and unassigned scottpurdy May 30, 2014

scottpurdy changed the title ~~Decide on a language-independent checkpoint/serialization format for the CLA~~ Evaluate options for a language-independent checkpoint/serialization format for the CLA Jun 2, 2014

rhyolight added this to the Sprint 25 milestone Jun 23, 2014

rhyolight mentioned this issue Jun 25, 2014

Replaces current serialization system with protocol buffers (don't merge) numenta/nupic.core-legacy#130

Closed

rhyolight modified the milestones: Sprint 26, Sprint 25 Jul 9, 2014

rhyolight mentioned this issue Jul 15, 2014

Rename SpatialPooler to SP #1114

Closed

rhyolight modified the milestones: Sprint 27, Sprint 26 Jul 25, 2014

rhyolight added the blocked label Aug 7, 2014

rhyolight modified the milestones: Sprint 27, Sprint 28 Aug 8, 2014

rhyolight unassigned vsinha Aug 20, 2014

rhyolight closed this as completed Aug 22, 2014

scottpurdy reopened this Sep 18, 2014

scottpurdy mentioned this issue Sep 18, 2014

Integrate platform-independent serialization format #1336

Closed

3 tasks

rhyolight modified the milestones: Sprint 28, Serialization Sep 18, 2014

rhyolight assigned scottpurdy Sep 22, 2014

rhyolight added priority:2 and removed status:blocked labels Oct 2, 2014

rhyolight closed this as completed Oct 7, 2014

rhyolight removed the priority:2 label Oct 7, 2014

rhyolight mentioned this issue Oct 14, 2014

Platform-Independent Model Serialization - Integrate Cap'n Proto into nupic.core and nupic #1435

Closed

2 tasks

rhyolight modified the milestones: Alpha, Serialization Oct 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate options for a language-independent checkpoint/serialization format for the CLA #333

Evaluate options for a language-independent checkpoint/serialization format for the CLA #333

rhyolight commented Oct 25, 2013

iandanforth commented Oct 25, 2013

scottpurdy commented Oct 26, 2013

scottpurdy commented Jun 4, 2014

utensil commented Jun 5, 2014

scottpurdy commented Jun 18, 2014

utensil commented Jun 21, 2014

utensil commented Jun 21, 2014

vsinha commented Jun 23, 2014

h2suzuki commented Jul 15, 2014

rhyolight commented Aug 7, 2014

utensil commented Aug 8, 2014

rhyolight commented Aug 20, 2014

utensil commented Aug 22, 2014

scottpurdy commented Sep 18, 2014

scottpurdy commented Sep 18, 2014

scottpurdy commented Sep 18, 2014

rhyolight commented Oct 3, 2014

scottpurdy commented Oct 3, 2014

rhyolight commented Oct 3, 2014

rhyolight commented Oct 7, 2014

Evaluate options for a language-independent checkpoint/serialization format for the CLA #333

Evaluate options for a language-independent checkpoint/serialization format for the CLA #333

Comments

rhyolight commented Oct 25, 2013

Status

iandanforth commented Oct 25, 2013

scottpurdy commented Oct 26, 2013

scottpurdy commented Jun 4, 2014

utensil commented Jun 5, 2014

scottpurdy commented Jun 18, 2014

utensil commented Jun 21, 2014

utensil commented Jun 21, 2014

vsinha commented Jun 23, 2014

h2suzuki commented Jul 15, 2014

rhyolight commented Aug 7, 2014

utensil commented Aug 8, 2014

rhyolight commented Aug 20, 2014

utensil commented Aug 22, 2014

scottpurdy commented Sep 18, 2014

scottpurdy commented Sep 18, 2014

scottpurdy commented Sep 18, 2014

rhyolight commented Oct 3, 2014

scottpurdy commented Oct 3, 2014

rhyolight commented Oct 3, 2014

rhyolight commented Oct 7, 2014