# Using TFRecords and tf.Example 

To read data efficiently it can be helpful to serialize your data and store it in a set of files (100-200MB) that can each be read linearly. This is especially true if the data is being streamed over a network.

- The `TFRecord format` is a simple format for storing a sequence of binary record.

- [Protocol buffers](https://developers.google.com/protocol-buffers/) are a cross-platform, cross-language library for efficient serialization of structured data

- The `tf.Example` message (or protobuf) is a flexible message type that represents a `{"string": value}` mapping. It is designed for use with Tensorflow and is used thorughout the higher-level APIs such as [TFx](https://www.tensorflow.org/tfx/).

This notebook will demonstrate how to create, parse, and use the `tf.Example` message, and then serialize, write, and read `tf.Example` messages to and from `.tfrecord` files.

## Import all the necessary libraries

In [1]:
import numpy as np
import tensorflow as tf

tf.__version__

'2.0.0'

## tf.Example

### Data types for `tf.Example`

Fundamentally, a `tf.Example` is a `{"string": tf.train.Feature}` mapping.

The [tf.train.Feature](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/Feature) message type can accept one of the following three types (See the [`.proto` file](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) for reference). Most other generic types can be coerced into one of these:

1. [tf.train.BytesList](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/BytesList) (the following types can be coerced)
    - `string (need to convert str to bytes)`
    - `byte`
   
2. [tf.train.FloatList](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/FloatList) (the following types can be coerced)

    - `float (float32)`
    - `double (float64)`

3. [tf,train.Int64List](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/Int64List) (the following types can be coerced)

    - `bool`
    - `enum`
    - `int32`
    - `uint32`
    - `int64`
    - `uint64`

In order to convert a standard Tensorflow type to a `tf.Example`-compatible [tf.train.Feature](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/Feature), you can use the shortcut functions below. Note that each function takes a scalar input value and returns a `tf.train.Feature` containing one of the tree `list` types above:

In [2]:
# The following functions can be used to convert a value to a type compatible
# with tf.Example.

def _bytes_feature(value):
    """Returns a bytes_list from a string/byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    """Return a float_list form a float/double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Return a int64_list from a bool/enum/int/uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [3]:
# tf.train.BytesList
print(_bytes_feature(b'test_string'))
print(_bytes_feature('test_string'.encode('utf8')))

# tf.train.FloatList
print(_float_feature(np.exp(1)))

# tf.train.Int64List
print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_string"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}



#### Serialized

All proto messages can be serialized to a binary-string using the `.SerializeToString` method.

In [4]:
feature = _float_feature(np.exp(1))
serialized_feature = feature.SerializeToString()

print(feature)
print("Serialized to binary-string: {}".format(serialized_feature))

float_list {
  value: 2.7182817459106445
}

Serialized to binary-string: b'\x12\x06\n\x04T\xf8-@'


### Creating a tf.Example message

In this notebook, we will create a dataset using Numpy, and create a `tf.Example` message from this dataset. In practice, the dataset may come from anywhere, but the procedure of creating the `tf.Example` message from a single observation will be the same:

1. Each value needs to be converted to a [tf.train.Feature](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/Feature) containing one of the 3 compatible types, using one of the functions above.
2. You create a map (dictionary) from the feature name string to the encoded feature value produced in #1.
3. The map produced in step 2 is converted to a [Features message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L85).

In [5]:
def create_tf_example(feature0, feature1, feature2, feature3):
    """
    Creates a tf.Example message ready to be written to a file.
    """
    
    # Create a dictionary mapping the feature name to 
    # the tf.Example-compatible data type
    feature = {
        'feature0': _int64_feature(feature0),
        'feature1': _int64_feature(feature1),
        'feature2': _bytes_feature(feature2),
        'feature3': _float_feature(feature3),
    }
    
    # Create a Features message using tf.train.Example.
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto

def serialize_example(example_proto):
    return example_proto.SerializeToString()

In [6]:
example_proto = create_tf_example(False, 4, b'goat', 0.9876)
serialized_example = serialize_example(example_proto)

print(serialized_example)
print()
print(example_proto)

b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00'

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}



#### Decode the serialized message

In [7]:
decode_example = tf.train.Example.FromString(serialized_example)
decode_example

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}