# Why you should use the new proto3 optional keyword
> "In this post I will attempt to explain the importance of optional in proto3, and what the consequences are of its absence."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- hide: false
- search_exclude: true

In [1]:
# hide
# Here we install the latest version of protobuf so far so that this demo works
!curl -o /tmp/protoc-3.15.8-linux-x86_64.zip -LO https://github.com/protocolbuffers/protobuf/releases/download/v3.15.8/protoc-3.15.8-linux-x86_64.zip 
!mkdir -p $HOME/.local/proto
!unzip -o /tmp/protoc-3.15.8-linux-x86_64.zip -d $HOME/.local/proto
# ensuring the directories exist
!mkdir -p code_snippets_2021_05_29

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   636  100   636    0     0   7310      0 --:--:-- --:--:-- --:--:--  7310
100 1600k  100 1600k    0     0  1679k      0 --:--:-- --:--:-- --:--:-- 24.0M
Archive:  /tmp/protoc-3.15.8-linux-x86_64.zip
  inflating: /home/julien/.local/proto/include/google/protobuf/wrappers.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/field_mask.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/api.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/struct.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/descriptor.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/timestamp.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/compiler/plugin.proto  
  inflating: /home/julien/.local/proto/include/google/protobuf/empty.prot

## A convincing argument, not by definition, but by observation

Protobufs are a quite popular message protocol used as an alternative to JSON.
Although less readable, it offers a lot more compression.
Message passing between services is pretty critical for
isolated machine learning services, which makes proto3 an ideal
candidate for these applications (for example, a classifier 
running on a GPU).

There are two versions: proto2 and proto3. Proto3 is supposed to be the next
version of proto2. However up until version 3.12, 
it was missing a very important feature: `optional` values. See [here](https://github.com/protocolbuffers/protobuf/issues/1606)
for a long discussion about this.

Since proto 3.12 [onwards](https://github.com/protocolbuffers/protobuf/releases/tag/v3.12.0) 
(May 2020) as an experimental feature and then officially 
supported 3.15 [onwards](https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0)
(Feb 2021), there is an new feature allowing for defining optional 
fields for primitive fields.

Why has this change been made?
In this post, I will attempt to convince you why the optional field is necessary
not by definition, but by observation.
It is not the intention of this post to go into any details specific to 
the protos themselves which is well described in the references mentioned
above.


All the code run here can be run by clicking the google collab link on this page.

Let's just begin by creating a proto.

In [2]:
# Here we'll write to this file in python, since we are in a jupyter notebook.
PROTO_DIR = 'code_snippets_2021_05_29'
filename = PROTO_DIR + '/user_1.proto'

my_proto = """
syntax = "proto3";
package user_1;


message User {
    string name  = 1;
    uint32 age = 2;
}
"""

with open(filename, 'w') as file:
  file.write(my_proto)

Now let's compile it in python

In [3]:
!$HOME/.local/proto/bin/protoc code_snippets_2021_05_29/*.proto --python_out=.

In [4]:
from code_snippets_2021_05_29 import user_1_pb2

Now let's create a user with no fields

In [5]:
user = user_1_pb2.User()

In [6]:
from google.protobuf.json_format import MessageToDict

Now let's print the contents of this user.
One good way is just to convert it to a 
python dictionary.

In [7]:
MessageToDict(user)

{}

It is blank as expected. Now let's define a user with a name and age:

In [8]:
user = user_1_pb2.User(name='John', age=10)
MessageToDict(user)

{'name': 'John', 'age': 10}

John has age 10. That makes sense.
Now let's create a newborn, Sally, of age 0:

In [9]:
user = user_1_pb2.User(name='Sally', age=0)
MessageToDict(user)

{'name': 'Sally'}

The age is missing! That can be a problem.
Let's use a flag to force fill the age in.

In [10]:
user = user_1_pb2.User(name='Sally', age=0)
MessageToDict(user, including_default_value_fields=True)

{'name': 'Sally', 'age': 0}

Great. We have the age again.
Now let's add an additional restriction. Sometimes
we may want to register someone, but we may not actually know their age.
In this case, let's allow it to be optional by not defining it.
Let's create a new user Randall:

In [11]:
user = user_1_pb2.User(name='Randall')
MessageToDict(user, including_default_value_fields=True)

{'name': 'Randall', 'age': 0}

## A clear problem of missing information

We have a problem. We can't actually create a user, encoding the 
information that the age is missing!

Without having read anything about protobufs and simply probing this system,
we know there is an information problem here. It is not possible
to both encode 0 as meaning a missing value and an actual value!


How could the developers have missed this? It is unclear but I feel free to read this issue [here](https://github.com/protocolbuffers/protobuf/issues/1606) for the full discussion.

It turns out, since proto 3.12, that there is now an easy fix:

In [12]:
# Here we'll write to this file in python, since we are in a jupyter notebook.
filename = 'code_snippets_2021_05_29/user_2_optional.proto'

my_proto = """
syntax = "proto3";
package user_2;


message User {
    string name  = 1;
    optional uint32 age = 2;
}
"""
with open(filename, 'w') as file:
  file.write(my_proto)


# compile the proto
!$HOME/.local/proto/bin/protoc code_snippets_2021_05_29/*.proto --python_out .

Now let's create the user again

In [13]:
from code_snippets_2021_05_29 import user_2_optional_pb2

user_2 = user_2_optional_pb2.User(name='Randall')
MessageToDict(user_2, including_default_value_fields=True)

{'name': 'Randall'}

Perfect! This is exactly what we wanted. All the other cases work as expected as well.

In [14]:
print('John with an age')
print(MessageToDict(user_2_optional_pb2.User(name='John', age=21)))

print('Sally the newborn')
print(MessageToDict(user_2_optional_pb2.User(name='Sally', age=0)))

print('Randall the unknown')
print(MessageToDict(user_2_optional_pb2.User(name='Randall'), including_default_value_fields=True))

John with an age
{'name': 'John', 'age': 21}
Sally the newborn
{'name': 'Sally', 'age': 0}
Randall the unknown
{'name': 'Randall'}


But how is this happening?
Without knowing anything about protobufs, you 
can also look at their raw outputs as bytes:

In [15]:
print('John with an age')
print(user_2_optional_pb2.User(name='John', age=21).SerializeToString())
print()

print('Sally the newborn before optional (notice the information of 0 is missing)')
print(user_1_pb2.User(name='Sally', age=0).SerializeToString())
print()

print('Sally the newborn after using optional')
print(user_2_optional_pb2.User(name='Sally', age=0).SerializeToString())
print()

print('Randall the unknown')
print(user_2_optional_pb2.User(name='Randall').SerializeToString())

John with an age
b'\n\x04John\x10\x15'

Sally the newborn before optional (notice the information of 0 is missing)
b'\n\x05Sally'

Sally the newborn after using optional
b'\n\x05Sally\x10\x00'

Randall the unknown
b'\n\x07Randall'


By simply observing their outputs, one can see that the optional 
keyword has encoded additional information. The proto for 
Sally whose age was not defined with `optional` keyword appears 
to be missing from the byte string, whereas the proto for 
Sally whose age was defined with the `optional` string is present.
One sees this because the first proto is a substring of the second.

Obviously, one can refer to the documentation to understand this.
I am intentionally avoiding an explanation to underscore convincing
you by observation. This is because in the real world, we are
faced to make, intuitive quick early decisions based on logical observations.

## Under the hood
Under the hood, this is converting the primitive type to a [oneof field](https://github.com/protocolbuffers/protobuf/blob/master/docs/implementing_proto3_presence.md#background):

> To minimize this risk we chose a descriptor representation that is semantically compatible with existing proto3 reflection. Every proto3 optional field is placed into a one-field oneof.

However, all that is relevant to us I believe is the general outcome: default values are now serialized.

From the python API standpoint, this adds a new presence check feature 
`proto.HasField(field_name)` for these basic types.
Note that previously this was only allowed for `Message` 
types and not primitive types.

Why is this a problem? Let's look at the following example. We will define
a simple `User` proto with a `name` and `age`.

In [16]:
sally_1 = user_1_pb2.User(name='Sally', age=0)
sally_2 = user_2_optional_pb2.User(name='Sally', age=0)
randall_2 = user_2_optional_pb2.User(name='Randall')


# sally_1 has no `HasField` method because age is a primitive:
print('Does Sally 1 have an age?')
try:
    sally_1.HasField('age')
except Exception as error:
    print(f'Exception: {error}')

# However, sally_2 does:
has_age = sally_2.HasField('age')
print(f'sally_2 has age defined: {has_age}')


# and Randall has no age defined
has_age = randall_2.HasField('age')
print(f'randall_2 has age defined: {has_age}')

Does Sally 1 have an age?
Exception: Can't test non-optional, non-submessage field "User.age" for presence in proto3.
sally_2 has age defined: True
randall_2 has age defined: False


# Alternate Solutions
Before this change, there were many alternate solutions. 
However, none of them were very intuitive. Previously you could have:
- used a `oneof` field type. In this case, the user would have to check 
  which type was present in order to detect a null value
  (NOTE that this is actually what is happening under the hood, 
  when using `optional` but with much more elegant syntax)
- used one of the [protobuf wrapper types](https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto). This would wrap primitive types into a message. Since a Message does encode presence, this solve the problem. However, the lookup of a value would involve first checking presence, and then accessing the `value` attribute of the message (since the primitive was actually defined there)
- use two fields. The first field would be a bool which would be True if the field should be present. The second was the actual value.


You may see a summary of such solutions 
from the stackoverflow post 
[here](https://stackoverflow.com/questions/42622015/how-to-define-an-optional-field-in-protobuf-3) 
(note that the accepted answer for this stackoverflow post has 
now been changed to using `optional`).

# Conclusion
That's it! Very simple. This ends this post, but feel free to read the rest for 
some history.