Skip to content

Design and API

haberman edited this page Apr 7, 2011 · 9 revisions

upb is a standalone implementation of protocol buffers written in C. The primary goals are:

  • performance, but without requiring compile-time per-message code generation.
  • portability (ANSI C with no dependencies).
  • small code size and memory footprint.
  • good dynamic language support (an API that is easy to write efficient dynamic language extensions for).

The core library is ANSI C and implements a table-based decoder. However, there is an x86-64 specific JIT that can be used to deliver a significant (~3x) speedup over the table-based decoder. The client code does not need to change to use the JIT.

Event-Based Decoding

upb's lowest layer is an event-based (or "stream-based") decoder. This means that your register callbacks with the decoder and these callbacks are called when a value is parsed, much like SAX for XML. This is in contrast to Google's protobuf release, which takes a more DOM-like approach where the data is unpacked into a data structures with getters and setters.

  // Google protobuf approach: I create a message, and parse from the
  // string into the message.
  MyMessageType msg;
  msg.ParseFromString(str);
  cout << msg.my_field();

  // upb approach: I register callbacks that will be called
  // when specific values are parsed.
  upb::Handlers handlers;
  upb::FieldDef *f = m->GetFieldByName("my_field");
  handlers.RegisterValueCallback(f, &MyCallback, upb::Value(5));
  upb::Decoder d(&handlers);
  d.Reset(input_stream, closure);

  // Calls MyCallback() when "my_field" is parsed.
  d.Decode(); 

upb::Flow MyCallback(void *closure, upb::Value fval, upb::Value val) {
  // "closure" is our context where we can put our app's data for this message.
  // "fval" is the value we bound to this field when we registered the callback (5).
  // "val" is the value that was actually parsed for the message.
  MyMessage *m = (MyMessage*)closure;
  m->SetMyField(val);

  // Could return UPB_BREAK instead if we want to stop decoding now.
  return UPB_CONTINUE;
}

The streaming decoder takes more code to use and is therefore somewhat less convenient. However it is significantly more efficient, and for applications that need to be maximally efficient, the extra work is worth it.

Using a streaming decoder also lets you decode protobuf data into your own data structures. For example, suppose you wanted to represent the data in memory as an STL map or set -- with an event-based decoder you can put the values into your custom data structure efficiently. With Google's protobuf, you would waste a lot of work copying from one structure type to another.

Also, you can efficiently layer a DOM-like model on top of a SAX-like model. In other words, you could efficiently make Google protobuf structs use upb's parser, and the result would be as efficient as what Google protobuf does already. The reverse is not true -- implementing a SAX parser on top of a DOM-based parser is not as efficient as using SAX alone.

upb::Handlers: specifying callbacks for the streaming parse.

The upb::Handlers object is logically a table mapping fields to action handlers. For every scalar field (numbers, enums, strings) there can be a corresponding value handler. Note that each field can have its own callback (the handler for my_field1 can be different from my_field2), so the callback can be highly specialized for the field. You can also bind a value to the field's callback called the "fval" (field value), which is also specialized for one specific field.

For submessages and groups, there can be a "startsubmsg" and "endsubmsg" handler attached, and like the value handlers each separate submessage field can have a separate set of startsubmsg/endsubmsg handlers. And finally there can be overall "startmsg" and "endmsg" handlers that are called when parsing begins and ends, respectively.

A user-provided void *closure is passed by the decoder to all of the callbacks, and can be used to store whatever data is required (for example, the message we are populating). A field-specific fval can be bound to a specific callback to pass field-specific information to the callback. And of course the value itself it passed to the decoder.

// Empty set of handlers
upb::Flow startmsg(void *closure) {
  // Called when the message begins.
  return UPB_CONTINUE;
}

void endmsg(void *closure, upb::Status *status) {
  // Called when processing ends, whether in success or failure.
  // This callback is guaranteed to be called eventually, so it can be used
  // to perform any cleanup that is required.
  //
  // "status" indicates the final status of processing, and can be modified
  // to affect the final status.
}

upb::Flow value(void *closure, upb::Value fval, upb::Value val) {
  // A single value was parsed.
  return UPB_CONTINUE;
}

upb::Flow startsubmsg(void *closure, upb::Value fval) {
  // A submsg is beginning.
  return UPB_CONTINUE;
}

upb::Flow endsubmsg(void *closure, upb::Value fval) {
  // A submessage has ended.
  return UPB_CONTINUE;
}

upb::Flow unknownval(void *closure, upb::String *str) {
  // An unknown value was encountered.
  return UPB_CONTINUE;
}

Note that the upb::Handlers object is not coupled to the Decoder -- any class can call the given handlers. For example, the class to parse the protobuf text format can use the same handlers class, which means that the handlers themselves do not need to be aware what format the data is coming from.

upb::SymbolTable: Loading types dynamically at runtime.

While it's possible to use a upb::Handlers object by specifying the field numbers and types yourself, it is more convenient to load protobuf descriptors at runtime. Descriptors contain a .proto definition in binary form along with all the names and types of the messages. With upb, these can be loaded and stored in a upb::SymbolTable class, which contains a namespace of messages that were parsed from one or more descriptors.

  upb::SymbolTable *s = new upb::SymbolTable;
  upb::ParseDescriptor(s, descriptor_str, &status);
  upb::MessageDef *m = s->Lookup("MyMessageType");
  upb::FieldDef *f = m->GetFieldByName("my_field");

Protobuf descriptors are themselves protobufs, so the regular protobuf decoder is used internally by upb::ParseDescriptor. However, descriptors could be loaded from other formats (like protobuf text format) since the SymbolTable internally is just using a upb::Handlers object.

JITing for speed

While the upb table-based decoder can parse at 300MB/s or more, we can do better by generating code for a specific message type. The message-specific code takes advantage of the fact that protobufs are usually encoded in field number order. This yields more predictable branches than a table-based approach, which on modern superscalar CPUs performs much better. We can also bind the callback and fval tightly into the code as immediate arguments. Using this optimizations we can parse protobufs at >1GB/s.