The `TFRecord` format is a simple format for storing a sequence of binary records.

The `tf.Example` message (or protobuf) is a flexible message type that represents a `{"string": value}` mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

## Setup

In [1]:
import tensorflow as tf

import numpy as np
import IPython.display as display

## `tf.Example`

### Data types for `tf.Example`

Fundamentally, a `tf.Example` is a `{"string": tf.train.Feature}` mapping.

The `tf.train.Feature` message type can accept one of the following three types
- `tf.train.BytesList`
    - string
    - byte
- `tf.train.FloatList`
    - float (float32)
    - double (float64)
- `tf.train.Int64List`
    - bool
    - enum
    - int32
    - uint32
    - int64
    - uint64
    
In order to convert a standard TensorFlow type to a `tf.Example`-compatible `tf.train.Feature`, we can use the shortcut functions below. Note that each function takes a scalar input value and returns a `tf.train.Feature` containing one of the three `list` types above:

In [2]:
# The following functions can be used to convert a value to a type compatible
# with tf.Example.

def _bytes_feature(value):
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

print(_bytes_feature(b'test_string'))
print(_bytes_feature('test_bytes'.encode('utf-8')))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}



In [3]:
def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

print(_float_feature(np.exp(1)))

float_list {
  value: 2.7182817459106445
}



In [4]:
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

print(_int64_feature(True))
print(_int64_feature(1))
#print(_int64_feature(1.0)) # Error

int64_list {
  value: 1
}

int64_list {
  value: 1
}



All proto messages can be **serialized to a binary-string** using the `.SerializeToString` method

In [5]:
feature = _float_feature(np.exp(1))
feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

### Creating a `tf.Example` message

In [6]:
# The number of observations in the dataset.
n_observations = int(1e4)
n_observations

10000

In [7]:
# Boolean feature, encoded as False or True
feature0 = np.random.choice([False, True], n_observations)
feature0

array([False,  True,  True, ..., False,  True,  True])

In [8]:
# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)
feature1

array([4, 3, 4, ..., 1, 2, 3])

In [9]:
# String feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]
feature2

array([b'goat', b'horse', b'goat', ..., b'dog', b'chicken', b'horse'],
      dtype='|S7')

In [10]:
# Float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)
feature3

array([ 0.0210511 ,  0.82749023, -0.69820937, ..., -1.93800073,
        1.22599431,  1.55077696])

Each of these features can be coerced into a `tf.Example`-compatible type using one of `_bytes_feature`, `_float_feature`, `_int64_feature`. We can then create a `tf.Example` message from these encoded features:

In [11]:
def serialize_example(feature0, feature1, feature2, feature3):
    feature = {
        "feature0": _int64_feature(feature0),
        "feature1": _int64_feature(feature1),
        "feature2": _bytes_feature(feature2),
        "feature3": _float_feature(feature3)
    }
    
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    
    return example_proto.SerializeToString()

In [12]:
# This is an example observation from the dataset.
example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'

In [13]:
# To decode the message use the tf.train.Example.FromString method
example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

## TFRecords format details

A TFRecord file contains a **sequence of records**. The file can only be read sequentially.

Each record contains a byte-string, for the data-payload, plus the data-length, and CRC32C (32-bit CRC using the Castagnoli polynomial) hashes for integrity checking.

Each record is stored in the following formats

`uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data
`

The records are concatenated together to produce the file. CRCs are described here, and the mask of a CRC is:

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

## TFRecord files using `tf.data`

### Writing a TFRecord file

The easiest way to **get the data into a dataset** is to use the `from_tensor_slices` method.

In [15]:
feature1

array([4, 3, 4, ..., 1, 2, 3])

In [16]:
tf.data.Dataset.from_tensor_slices(feature1)

<TensorSliceDataset shapes: (), types: tf.int64>

In [17]:
# Applied to a tuple of arrays, it returns a dataset of tuples
features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset shapes: ((), (), (), ()), types: (tf.bool, tf.int64, tf.string, tf.float64)>

In [18]:
# View data
for f0, f1, f2, f3 in features_dataset.take(1):
    print(f0)
    print(f1)
    print(f2)
    print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'goat', shape=(), dtype=string)
tf.Tensor(0.021051098542752146, shape=(), dtype=float64)


Use the `tf.data.Dataset.map` method to apply a function to each element of a Dataset.

The mapped function must operate in TensorFlow graph mode—it must operate on and return `tf.Tensors`. A non-tensor function, like `serialize_example`, can be wrapped with `tf.py_function` to make it compatible.

Using `tf.py_function` **requires to specify the shape and type information** that is otherwise unavailable.

In [20]:
def tf_serialize_example(f0, f1, f2, f3):
    tf_string = tf.py_function(
        serialize_example,
        inp=(f0, f1, f2, f3),
        Tout=tf.string
    )
    return tf.reshape(tf_string, ())

In [21]:
tf_serialize_example(f0, f1, f2, f3)

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04Zs\xac<'>

In [22]:
# Apply this function to each element in the dataset
serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset shapes: (), types: tf.string>

In [23]:
def generator():
    for features in features_dataset:
        yield serialize_example(*features)

In [24]:
serialized_features_dataset = tf.data.Dataset.from_generator(
    generator,
    output_types=tf.string,
    output_shapes=()
)
serialized_features_dataset

<FlatMapDataset shapes: (), types: tf.string>

In [25]:
# And write them to a TFRecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

### Reading a TFRecord file

We can also read the TFRecord file using the `tf.data.TFRecordDataset` class.

More information on consuming `TFRecord` files using tf.data can be found here.

Using `TFRecordDatasets` can be useful for standardizing input data and optimizing performance.

In [26]:
filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

At this point the dataset contains serialized `tf.train.Example` messages. When iterated over it returns these as scalar string tensors.

Use the `.take` method to only show the first 10 records.

In [27]:
for raw_record in raw_dataset.take(10):
    print(repr(raw_record))

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04Zs\xac<'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nS\n\x15\n\x08feature2\x12\t\n\x07\n\x05horse\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04f\xd6S?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x03'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xd9\xbd2\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nS\n\x15\n\x08feature2\x12\t\n\x07\n\x05horse\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xe5\xbb\x89?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x03'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\

These tensors can be parsed using the function below. Note that the `feature_description` is necessary here because **datasets use graph-execution**, and need this description to build their shape and type signature

In [28]:
# Create a description of the features
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}
feature_description

{'feature0': FixedLenFeature(shape=[], dtype=tf.int64, default_value=0),
 'feature1': FixedLenFeature(shape=[], dtype=tf.int64, default_value=0),
 'feature2': FixedLenFeature(shape=[], dtype=tf.string, default_value=''),
 'feature3': FixedLenFeature(shape=[], dtype=tf.float32, default_value=0.0)}

In [29]:
def _parse_function(example_proto):
    return tf.io.parse_single_example(example_proto, feature_description)

Alternatively, use `tf.parse` example to parse the whole batch at once. Apply this function to each item in the dataset using the `tf.data.Dataset.map` method

In [30]:
parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset shapes: {feature0: (), feature1: (), feature2: (), feature3: ()}, types: {feature0: tf.int64, feature1: tf.int64, feature2: tf.string, feature3: tf.float32}>

**Use eager execution to display** the observations in the dataset. There are 10,000 observations in this dataset, but we will only display the first 10. The data is displayed as a dictionary of features. Each item is a `tf.Tensor`, and the numpy element of this tensor displays the value of the feature.

In [31]:
for parsed_record in parsed_dataset.take(10):
    print(repr(parsed_record))

{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.021051098>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=3>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'horse'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.8274902>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.69820935>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=3>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'horse'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=1.0760466>}
{'feature0

Here, the `tf.parse_example` function unpacks the `tf.Example` fields into standard tensors.