Skip to content

Latest commit

 

History

History
136 lines (84 loc) · 11 KB

L64-python-runtime-proto-parsing.md

File metadata and controls

136 lines (84 loc) · 11 KB

gRPC Python Runtime Proto Parsing

Abstract

Generated protobuf code is hard to manage for those without monorepos equipped with a hermetic build system. This has been echoed by the maintainers of libraries wrapping gRPC as well as direct users of gRPC. In this document, we aim to lay out a more Pythonic way of dealing with Protocol Buffer-backed modules.

Background

gRPC and protocol buffers evolved in a Google ecosystem supported by a monorepo and hermetic build system. When engineers check a .proto file into source control using that toolchain, the resultant code generated by the protocol compiler does not end up in source control. Instead, the code is generated on demand during builds and cached in a distributed build artifact store. While such sophisticated machinery has started to become available in the open source community (e.g. Bazel), it has not yet gained much traction. Instead, small Git repos and non-hermetic language-specific build tools are the norm. As a result, code generated by the protocol compiler is often checked into source control alongside the code which uses it.

At the least, this results in surprises when an update to a .proto file does not result in an update to the behavior of the user’s code. However, when client and server code lives in separate repos, this can result in aliasing, where one repository houses generated code from an earlier version of the protocol buffer than the other.

Open source users are aware of this gap in the ecosystem and are actively looking for ways to fill it. Many have settled on protocol buffer monorepos as a solution to the problem, wherein all .proto files for an organization are placed in a single source code repository and included by all other repositories as a submodule. But even this is not a complete solution. In addition, some mechanism must be put in place for the repositories housing the client and server code to retrieve the desired protocol buffer and generate code for the target language.

The protocol compiler paired with Google's build system also means that the average engineer never has to manually invoke protoc. Instead, when an update is made to a .proto file and some code file references it, the code for the protocol buffer is regenerated without any manual intervention on the part of the engineer. Compare that to today's workflow for gRPC users:

  1. Update .proto file.
  2. Manually regenerate code. (remembering how to use all of the CLI flags)
  3. Make necessary corresponding updates to code using the protocol buffer.
  4. Rerun.

It's easy for several of those steps to slip one's mind while developing. Moreover, figuring out how to invoke protoc in a way that meshes with your imports can be quite difficult. Python developers in particular are unused to build-time steps such as these; it is much more common to perform them at runtime.

Related Proposals

Proposal

Whereas today, users import protobuf code by invoking protoc to generate a file called foo_pb2.py and including the following line of code:

import foo_pb2
import foo_pb2_grpc

They will now also have the option of completely skipping the manual protoc invocation and instead writing

protos = grpc.protos('foo.proto')
services = grpc.services('foo.proto')

These two new functions return the same module objects as the import foo_pb2 and import foo_pb2_grpc statements. In order to maintain interoperability with any proto-backed modules loaded in the same process, after these functions are invoked, import foo_pb2 and import foo_pb2_grpc will be no-ops. That is, a side-effect of calling grpc.protos and grpc.services is insertion of the returned modules into the per-process cache. This ensures that regardless of whether the application calls grpc.protos('foo.proto') or import foo_pb2 or in which order, only a single version of the module will ever be loaded into the process. This avoids situations in which interoperability breaks due to two modules expecting the same protobuf-level message type, but expecting two different Python-level Message classes, one backed by a _pb2.py file and one backed by a .proto file.

The wrapper function around the import serves several purposes here. First, it puts the user in control of naming the module (in a manner similar to Javascript), meaning the user never has to concern themself with the confusing _pb2 suffix. Second, the function provides a wrapping layer through which the library can provide guidance in the case of failed imports.

To be precise, we propose the introduction of three new functions with the following signatures:

def protos(proto_file: Text,
           runtime_generated: Optional[bool] = True) -> types.ModuleType:
    pass

def services(proto_file: Text,
             runtime_generated: Optional[bool] = True) -> types.ModuleType:
    pass

def protos_and_services(proto_file: Text,
                        runtime_generated: Optional[bool] = True) -> Tuple[types.ModuleType, types.ModuleType]:
    pass

The final function, protos_and_services is a simple convenience function allowing the user to import protos and services in a single function call. All three of these functions will be idempotent. That is, like the python built-in import statement, after an initial call, subsequent inbvocations will not result in a reload of the ".proto" file from disk.

The change will be entirely backward compatible. Users manually invoking protoc today will not be required to change their build process.

Import Paths

These functions will behave like normal import statements. sys.path will be used to search for .proto files. The path under which each particular .proto file was found will be passed to the protobuf parser as the root of the tree (equivalent to the -I flag of protoc). This means that a file located at ${SYS_PATH_ENTRY}/foo/bar/baz.proto will result in the instantation of a module with the fully qualified name foo.bar.baz_pb2. Users are expected to have a directory structure mirroring their desired import structure.

Users have reported that getting Protobuf paths to match Python import paths is quite tricky today using protoc. In the case of a failure to import, the library will print troubleshooting information alongside the error.

Best Practices

In general, our recommendation will be for users to align their Python import path, Python fully qualified module name, Protobuf import path, and Protobuf package name. This will ensure that only one entry will ever be required on the PYTHONPATH. As an example, suppose you had the following file at "src/protos/foo/bar.proto":

syntax = "proto3";

package foo;

message BarMessage {
  ...
}

And the following at "src/protos/foo/baz.proto":

syntax = "proto3";

package foo;

import "foo/bar.proto";

...

Then, ensuring that the src/protos/ directory is on sys.path either by running from that directory or specifying it with the PYTHONPATH environment variable, you will be able to import as follows:

bar_protos = grpc.protos('foo/bar.proto')
baz_protos = grpc.protos('foo/baz.proto')

The critical bit here is that all import statements within ".proto" files must be resolvable along a path on sys.path. Suppose instead that baz.proto had imported bar.proto with import "src/protos/foo/bar.proto". Now, in order to resolve the import, at least two paths would have to be on sys.path: the repo root and src/protos/. For simplicity's sake, the root used for both calls to grpc.protos and the protobuf import statement should be unified.

In order to avoid naming clashes between protos you've authored yourself and any other protos pulled in by third-party dependencies, the root directory of your proto source tree should have a universally unique name. Ideally, this uniqueness should be guaranteed by your organization having ownership of the package by that name on PyPI. See PEP 423 for a deeper discussion of how package name uniqueness is handled in the Python ecosystem.

Arbitrary Protos

It should be stated that, in practice, it is possible to use these new functions to load totally arbitrary protos. Suppose you wrote a server that took .proto files as inputs from clients, instatiated modules from them, and returned some data about the file. For example, the number of message types contained within the supplied file. This could become problematic as new syntax features are added to the Protobuf specification. In the worst case, this would require a redeploy of the server with a sufficiently up-to-date version of grpcio-tools. But, regardless, we will claim no support for this use case. The intent of these functions is to enable import of fixed .proto files, known at build time.

Dependency Considerations

gRPC makes a point not to incur a direct dependency on protocol buffers. It is not the intent of this feature to change that. Instead, the implementations of these new functions will live in the grpcio-tools package, which necessarily already has a hard dependency on protobuf. If the grpcio package finds that grpc_tools is importable, it will do so and use the implementations found there to back the protos and services functions. Otherwise, it will raise a NotImplementedError.

In order to take advantage of newer language features, support will only be added for Python versions 3.6+. All other versions will result in a NotImplementedError.

Implementation

This proposal has been implemented here. This implementation uses the C++ protobuf runtime already built into the grpcio-tools C extension to parse the protocol buffers and generate textual Python code in memory. This code is then used to instantiate the modules to be provided to the calling application.

Alternatives Considered

Consideration was given to implementing the functionality of what is here presented as grpc.protos in the protobuf Python package. After a thorough investigation, we found that this feature could not be implemented in a way that satisfied both the compatibility requirements of the gRPC library and the Protobuf library. However, we are able to provide all desired functionality with an implementation entirely in the grpcio package. If at some point in the future this is no longer the case and an implementation of the required functionality in the protobuf repo is feasible, our grpc.protos function can simply proxy to it.