In [1]:
import numpy as np
import tensorflow as tf
import IPython.display as display

To read data efficiently it can be helpful to serialize data and store it in a set of files (100-200MB each) that can each be read linearly. <br/>
This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.<br/>

The TFRecord format is a simple format for storing a sequence of binary records.<br/>

Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data.<br/>

Protocol messages are defined by .proto files, these are often the easiest way to understand a message type.<br/>

The tf.Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.<br/>

This notebook will demonstrate how to create, parse, and use the tf.Example message, and then serialize, write, and read tf.Example messages to and from .tfrecord files.<br/>

In order to convert a standard TensorFlow type to a tf.Example-compatible tf.train.Feature, we can use the shortcut functions below. 

Note that each function takes a scalar input value and returns a tf.train.Feature containing one of the three list types

In [2]:
def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list = tf.train.BytesList(value = [value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list = tf.train.FloatList(value = [value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list = tf.train.Int64List(value = [value]))

To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use tf.serialize_tensor to convert tensors to binary-strings. Strings are scalars in tensorflow. Use tf.parse_tensor to convert the binary-string back to a tensor.

In [3]:
print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}

