https://www.tensorflow.org/tutorials/load_data/tf_records

* The <b>TFRecord format</b> (`.tfrecord`) is a simple format for storing a sequence of binary records.

* <b>Protocol buffers</b> are a cross-platform, cross-language library for efficient serialization of structured data.
  
  Protocol messages are defined by `.proto` files, these are often the easiest way to understand a message type.
  
* The `tf.Example` message (or protobuf) is a flexible message type that represents a `{"string": value}` mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

In [1]:
import tensorflow as tf
import numpy as np
import os

## Data types for `tf.Example`

* Fundamentally a `tf.Example` is a `{"string": tf.train.Feature}` mapping.
* The `tf.train.Feature` message type can accept one of the following three types
  1. `tf.train.BytesList`
     * `string`
     * `byte`
  2. `tf.train.FloatList`
     * `float32` (`float`)
     * `float64` (`double`)
  3. `tf.train.Int64List`
     * `bool`
     * `enum`
     * `int32`
     * `uint32`
     * `int64`
     * `uint64`

In [2]:
# The following functions can be used to convert a value to a type compatible
# with tf.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [3]:
print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}



All proto messages can be serialized to a binary-string using the `.SerializeToString()` method.

In [4]:
feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

## Creating a `tf.Example` message from existing data

1. Within each observation, each value needs to be converted to a `tf.train.Feature` containing one of the 3 compatible types, using one of the functions above.

2. Create a map (dictionary) from the feature name string to the <b>encoded feature value</b> produced in #1.

3. The map produced in #2 is converted to a <b>Features message</b>.

In [5]:
# the number of observations in the dataset
n_observations = int(1e4)

# boolean feature, encoded as False or True
feature0 = np.random.choice([False, True], n_observations)

# integer feature, random from 0 .. 4
feature1 = np.random.randint(0, 5, n_observations)

# string feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)

In [6]:
def serialize_example(feature0, feature1, feature2, feature3):
    """
    Creates a tf.Example message ready to be written to a file.
    """

    # Create a dictionary mapping the feature name to the tf.Example-compatible
    # data type.

    feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
    }

    # Create a Features message using tf.train.Example.

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

* Note: Use `tf.train.Features` to create a dictionary of `tf.train.Feature`, then pass it to the `features` arg of `tf.train.Example`

In [7]:
# This is an example observation from the dataset.

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'

To decode the message use the `tf.train.Example.FromString()` method

In [8]:
example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

# TFRecord files using `tf.io`

In [9]:
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

## Writing

In [10]:
# Write the `tf.Example` observations to the file.
with tf.io.TFRecordWriter(os.path.join(data_dir, 'test.tfrecords')) as writer:
    for i in range(n_observations):
        example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
        writer.write(example)

## Reading

In [11]:
# Create an record iterator
record_iterator = tf.io.tf_record_iterator(os.path.join(data_dir, 'test.tfrecords'))

# Iterate through the TFRecords
for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record) # decode the record message
    print(example)
    # Exit after 1 iteration as this is purely demonstrative.
    break

W0803 00:27:21.795620 16272 deprecation.py:323] From <ipython-input-11-3cfec0d824f4>:2: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 1
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "cat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.4755978286266327
      }
    }
  }
}



## Reading into `tf.data.TFRecordDataset` using `tf.parse_single_example()`

In [12]:
# Create a description of the features.
feature_description = {
    'feature0': tf.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
    # Parse the input tf.Example proto using the dictionary above.
    return tf.parse_single_example(example_proto, feature_description)

<b>Note</b>: you can use `tf.parse_example()` to parse a whole batch at once.

In [13]:
# read tfrecord files into a TFRecordDataset
raw_dataset = tf.data.TFRecordDataset(filenames=[os.path.join(data_dir, 'test.tfrecords')])

In [14]:
# Test _parse_funtion
# -------------------
iterator = raw_dataset.make_one_shot_iterator()
next_item = iterator.get_next()
parsed_item = _parse_function(next_item)

with tf.Session() as sess:
    raw_value, parsed_value = sess.run([next_item, parsed_item])
    print('Raw example: ', raw_value)
    print('Parsed example: ', parsed_value)
# -------------------

W0803 00:27:22.213502 16272 deprecation.py:323] From <ipython-input-14-708667d8f491>:3: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


Raw example:  b'\nQ\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x8f\x81\xf3>\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00'
Parsed example:  {'feature0': 1, 'feature1': 0, 'feature2': b'cat', 'feature3': 0.47559783}


Apply parse function to each item in the dataset using the `tf.data.Dataset.map()` method:

In [15]:
parsed_dataset = raw_dataset.map(_parse_function)

In [16]:
iterator = parsed_dataset.make_one_shot_iterator()
next_item = iterator.get_next()

with tf.Session() as sess:
    print(sess.run(next_item))

{'feature0': 1, 'feature1': 0, 'feature2': b'cat', 'feature3': 0.47559783}
