In [1]:
import numpy as np
import tensorflow as tf
import IPython.display as display

To read data efficiently it can be helpful to serialize data and store it in a set of files (100-200MB each) that can each be read linearly. <br/>
This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.<br/>

The TFRecord format is a simple format for storing a sequence of binary records.<br/>

Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data.<br/>

Protocol messages are defined by .proto files, these are often the easiest way to understand a message type.<br/>

The tf.Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.<br/>

This notebook will demonstrate how to create, parse, and use the tf.Example message, and then serialize, write, and read tf.Example messages to and from .tfrecord files.<br/>

In order to convert a standard TensorFlow type to a tf.Example-compatible tf.train.Feature, we can use the shortcut functions below. 

Note that each function takes a scalar input value and returns a tf.train.Feature containing one of the three list types

In [2]:
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list = tf.train.BytesList(value = [value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list = tf.train.FloatList(value = [value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list = tf.train.Int64List(value = [value]))

To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use tf.serialize_tensor to convert tensors to binary-strings. Strings are scalars in tensorflow. Use tf.parse_tensor to convert the binary-string back to a tensor.

In [3]:
print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}



All proto messages can be serialized to a binary-string using the .SerializeToString method

In [4]:
feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

### Creating a tf.Example message

Suppose you want to create a tf.Example message from existing data. In practice, the dataset may come from anywhere, but the procedure of creating the tf.Example message from a single observation will be the same:

Within each observation, each value needs to be converted to a tf.train.Feature containing one of the 3 compatible types, using one of the functions above.

You create a map (dictionary) from the feature name string to the encoded feature value.

The map produced in above step is converted to a Features message.

In [5]:
# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)

Each of these features can be coerced into a tf.Example-compatible type using one of _bytes_feature, _float_feature, _int64_feature. <br/>

We can then create a tf.Example message from these encoded features

In [6]:
def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.Example message ready to be written to a file.
  """
  # Create a dictionary mapping the feature name to the tf.Example-compatible
  # data type.
  feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
  }

  # Create a Features message using tf.train.Example.

  example_proto = tf.train.Example(features = tf.train.Features(feature = feature))
  return example_proto.SerializeToString()

In [7]:
# Encode

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00'

In [8]:
# Decode

example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}