In [1]:
import tensorflow as tf

# [Tensorflow Records? What they are and how to use them](https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)

Interest in `Tensorflow` has increased steadily since its introduction in November 2015. A lesser-known component of `Tensorflow` is the [TFRecord file format](https://www.tensorflow.org/api_docs/python/tf/io), Tensorflow’s own binary storage format.

If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.

However, pure performance isn’t the only advantage of the `TFRecord` file format. It is optimized for use with `Tensorflow` in multiple ways. To start with, it makes it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library. Especially for datasets that are too large to be stored fully in memory this is an advantage as only the data that is required at the time (e.g. a batch) is loaded from disk and then processed. Another major advantage of `TFRecords` is that it is possible to store sequence data — for instance, a time series or word encodings — in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data. Check out the [Reading Data](https://www.tensorflow.org/api_guides/python/reading_data) guide to learn more about reading TFRecord files.

So, there are a lot of advantages to using `TFRecords`. But where there is light, there must be shadow and in the case of `TFRecords` the downside is that you have to convert your data to this format in the first place and only limited documentation is available on how to do that. An [official tutorial](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py) and a number of articles about writing TFRecords exist, but I found that they only got me covered part of the way to solve my challenge.

In this post I will explain the components required to structure and write a `TFRecord` file, and explain in detail how to write different types of data. This will help you get started to tackle your own challenges.

## Structuring TFRecords
A `TFRecord` file stores your data as a sequence of binary strings. This means you need to specify the structure of your data before you write it to the file. `Tensorflow` provides two components for this purpose: [tf.train.Example](https://www.tensorflow.org/api_docs/python/tf/train/Example) and [tf.train.SequenceExample](https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample). You have to store each sample of your data in one of these structures, then serialize it and use a [tf.python_io.TFRecordWriter](https://www.tensorflow.org/api_docs/python/tf/io/TFRecordWriter) to write it to disk.
> `tf.train.Example` isn’t a normal Python class, but a [protocol buffer](https://en.wikipedia.org/wiki/Protocol_Buffers).

As a software developer, the main problem I had at the beginning was that many of the components in the `Tensorflow` API don’t have a description of the attributes or methods of the class. For instance, for `tf.train.Example` only a “.proto” file with cryptic structures called “message” is provided, along with examples in pseudocode. The reason for this is that `tf.train.Example` isn’t a normal Python class, but a [protocol buffer](https://en.wikipedia.org/wiki/Protocol_Buffers). A protocol buffer is a method developed by Google to serialize structured data in an efficient way. I will now discuss the two main ways to structure `Tensorflow` `TFRecords`, give an overview of the components from a developers view and provide a detailed example of how to use `tf.train.Example` and `tf.train.SequenceExample`.

## Movie recommendations using `tf.train.Example`
> If your dataset consist of features, where each feature is a list of values of the same type, tf.train.Example is the right component to use.

Let’s use the movie recommendation application from the [Tensorflow documentation](https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/example/example.proto) as an example:

|Age|Movie|Movie Ratings|Suggestion|Suggestion Purchased|Purchase Price|
|---|:---:|---|---|---|---|
|29|The Shawshank Redemption|9.0|Inception|1.0|9.99|
| &nbsp; |Fight Club|9.7| &nbsp; | &nbsp; | &nbsp; |

We have a number of features, each being a list where every entry has the same data type. In order to store these features in a TFRecord, we fist need to create the lists that constitute the features.

[tf.train.BytesList](https://www.tensorflow.org/api_docs/python/tf/train/BytesList), [tf.train.FloatList](https://www.tensorflow.org/api_docs/python/tf/train/FloatList), and [tf.train.Int64List](https://www.tensorflow.org/api_docs/python/tf/train/Int64List) are at the core of a [tf.train.Feature](https://www.tensorflow.org/api_docs/python/tf/train/Feature). All three have a single attribute value, which expects a list of respective bytes, float, and int.

In [2]:
movie_name_list = tf.train.BytesList(value=[b'The Shawshank Redemption', b'Fight Club'])
movie_rating_list = tf.train.FloatList(value=[9.0, 9.7])

Python strings need to be converted to bytes, (e.g. `my_string.encode`(‘utf-8’)) before they are stored in a `tf.train.BytesList`.

[tf.train.Feature](https://www.tensorflow.org/api_docs/python/tf/train/Feature) wraps a list of data of a specific type so Tensorflow can understand it. It has a single attribute, which is a union of `bytes_list/float_list/int64_list`. Being a union, the stored list can be of type `tf.train.BytesList `(attribute name `bytes_list`), `tf.train.FloatList` (attribute name `float_list`), or `tf.train.Int64List` (attribute name `int64_list`).

In [3]:
movie_names = tf.train.Feature(bytes_list=movie_name_list)
movie_ratings = tf.train.Feature(float_list=movie_rating_list)

[tf.train.Feature](https://www.tensorflow.org/api_docs/python/tf/train/Feature) is a collection of named features. It has a single attribute feature that expects a dictionary where the key is the name of the features and the value a `tf.train.Feature`.

In [4]:
movie_dict = {
  'Movie Names': movie_names,
  'Movie Ratings': movie_ratings
}
movies = tf.train.Features(feature=movie_dict)

[tf.train.Example](https://www.tensorflow.org/api_docs/python/tf/train/Example) is one of the main components for structuring a `TFRecord`. An `tf.train.Example` stores features in a single attribute features of type `tf.train.Features`.

In [5]:
example = tf.train.Example(features=movies)

In contrast to the previous components, [tf.python_io.TFRecordWriter](https://www.tensorflow.org/api_docs/python/tf/io/TFRecordWriter) actually is a Python class. It accepts a file path in its path attribute and creates a writer object that works just like any other file object. The `TFRecordWriter` class offers write, flush and close methods. The method write accepts a string as parameter and writes it to disk, meaning that structured data must be serialized first. To this end, `tf.train.Example` and `tf.train.SequenceExample` provide `SerializeToString` methods:

In [6]:
# "example" is of type tf.train.Example.
with tf.python_io.TFRecordWriter('movie_ratings.tfrecord') as writer:
  writer.write(example.SerializeToString())

In our example, each `TFRecord` represents the movie ratings and corresponding suggestions of a single user (a single sample). Writing recommendations for all users in the dataset follows the same process. It is important that the type of a feature (e.g. float for the movie rating) is the same across all samples in the dataset. This conformance criterion and others are defined in the [protocol buffer definition](https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/core/example/example.proto) of `tf.train.Example`.

Here’s a complete example that writes the features to a `TFRecord` file, then reads the file back in and prints the parsed features.

In [7]:
# Create example data
data = {
    'Age': 29,
    'Movie': ['The Shawshank Redemption', 'Fight Club'],
    'Movie Ratings': [9.0, 9.7],
    'Suggestion': 'Inception',
    'Suggestion Purchased': 1.0,
    'Purchase Price': 9.99
}

print(data)

{'Age': 29, 'Movie': ['The Shawshank Redemption', 'Fight Club'], 'Movie Ratings': [9.0, 9.7], 'Suggestion': 'Inception', 'Suggestion Purchased': 1.0, 'Purchase Price': 9.99}


In [8]:
# Create the Example
example = tf.train.Example(features=tf.train.Features(feature={
    'Age': tf.train.Feature(
        int64_list=tf.train.Int64List(value=[data['Age']])),
    'Movie': tf.train.Feature(
        bytes_list=tf.train.BytesList(
            value=[m.encode('utf-8') for m in data['Movie']])),
    'Movie Ratings': tf.train.Feature(
        float_list=tf.train.FloatList(value=data['Movie Ratings'])),
    'Suggestion': tf.train.Feature(
        bytes_list=tf.train.BytesList(
            value=[data['Suggestion'].encode('utf-8')])),
    'Suggestion Purchased': tf.train.Feature(
        float_list=tf.train.FloatList(
            value=[data['Suggestion Purchased']])),
    'Purchase Price': tf.train.Feature(
        float_list=tf.train.FloatList(value=[data['Purchase Price']]))
}))

print(example)

features {
  feature {
    key: "Age"
    value {
      int64_list {
        value: 29
      }
    }
  }
  feature {
    key: "Movie"
    value {
      bytes_list {
        value: "The Shawshank Redemption"
        value: "Fight Club"
      }
    }
  }
  feature {
    key: "Movie Ratings"
    value {
      float_list {
        value: 9.0
        value: 9.699999809265137
      }
    }
  }
  feature {
    key: "Purchase Price"
    value {
      float_list {
        value: 9.989999771118164
      }
    }
  }
  feature {
    key: "Suggestion"
    value {
      bytes_list {
        value: "Inception"
      }
    }
  }
  feature {
    key: "Suggestion Purchased"
    value {
      float_list {
        value: 1.0
      }
    }
  }
}



In [9]:
# Write TFrecord file
with tf.python_io.TFRecordWriter('customer_1.tfrecord') as writer:
    writer.write(example.SerializeToString())

In [10]:
# Read and print data:
sess = tf.InteractiveSession()

# Read TFRecord file
reader = tf.TFRecordReader()
filename_queue = tf.train.string_input_producer(['customer_1.tfrecord'])

_, serialized_example = reader.read(filename_queue)

# Define features
read_features = {
    'Age': tf.FixedLenFeature([], dtype=tf.int64),
    'Movie': tf.VarLenFeature(dtype=tf.string),
    'Movie Ratings': tf.VarLenFeature(dtype=tf.float32),
    'Suggestion': tf.FixedLenFeature([], dtype=tf.string),
    'Suggestion Purchased': tf.FixedLenFeature([], dtype=tf.float32),
    'Purchase Price': tf.FixedLenFeature([], dtype=tf.float32)}

# Extract features from serialized data
read_data = tf.parse_single_example(serialized=serialized_example,
                                    features=read_features)

# Many tf.train functions use tf.train.QueueRunner,
# so we need to start it before we read
tf.train.start_queue_runners(sess)

# Print features
for name, tensor in read_data.items():
    print('{}: {}'.format(name, tensor.eval()))

Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.TFRecordDataset`.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To constru

Now that we’ve covered the structure of `TFRecords`, the process of reading them is straightforward:
   1. Read the TFRecord using a [tf.TFRecordReader](https://www.tensorflow.org/api_docs/python/tf/TFRecordReader).
   2. Define the features you expect in the `TFRecord` by using [tf.FixedLenFeature](https://www.tensorflow.org/api_docs/python/tf/io/FixedLenFeature) and [tf.VarLenFeature](https://www.tensorflow.org/api_docs/python/tf/io/VarLenFeature), depending on what has been defined during the definition of `tf.train.Example`.
   3. Parse one `tf.train.Example` (one file) a time using [tf.parse_single_example](https://www.tensorflow.org/api_docs/python/tf/io/parse_single_example).