# TFRecord files using tf.data

The [tf.data](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data) module also provides tools for reading and writing data in Tensorflow.

## Import all the necessary libraries

In [1]:
import numpy as np
import tensorflow as tf

np.random.seed(5)

tf.__version__

'2.0.0'

## Utils

In [2]:
# The following functions can be used to convert a value to a type compatible
# with tf.Example.

def _bytes_feature(value):
    """Returns a bytes_list from a string/byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    """Return a float_list form a float/double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Return a int64_list from a bool/enum/int/uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [3]:
def serialize_example(feature0, feature1, feature2, feature3):
    """
    Creates a tf.Example message ready to be written to a file.
    """
    
    # Create a dictionary mapping the feature name to 
    # the tf.Example-compatible data type
    feature = {
        'feature0': _int64_feature(feature0),
        'feature1': _int64_feature(feature1),
        'feature2': _bytes_feature(feature2),
        'feature3': _float_feature(feature3),
    }
    
    # Create a Features message using tf.train.Example.
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

## Writing a TFRecord file

The easiest way to get the data into a dataset is to use the `from_tensor_slices` method.

### Create the observations data

In [4]:
# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)

print(feature0[0])
print(feature1[0])
print(feature2[0])
print(feature3[0])

True
3
b'horse'
-1.5667003260129333


### Create a `Dataset` from the observations data in memory

In [5]:
# The dimension of all features need to be the same
features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset shapes: ((), (), (), ()), types: (tf.bool, tf.int64, tf.string, tf.float64)>

Take one exmaple from the dataset.

In [6]:
# Use `take(1)` to only pull one example from the dataset.
for f0, f1, f2, f3 in features_dataset.take(1):
    print(f0)
    print(f1)
    print(f2)
    print(f3)

tf.Tensor(True, shape=(), dtype=bool)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(b'horse', shape=(), dtype=string)
tf.Tensor(-1.5667003260129333, shape=(), dtype=float64)


### Create a tf.Example message from the `Dataset`

Use the [tf.data.Dataset.map](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#map) method to apply a function to each element of a Dataset.

The mapped function must operate in TensorFlow graph mode—it must operate on and return `tf.Tensors`.

- A non-tensor function, like `serialize_example`, can be wrapped with 
[tf.py_function](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/py_function) to make it compatible.

In [7]:
# Using tf.py_function requires to specify the shape 
# and type information that is otherwise unavailable
def tf_serialize_example(f0, f1, f2, f3):
    tf_string = tf.py_function(
        serialize_example,
        (f0, f1, f2, f3), # pass these args to the above function.
        tf.string,        # the return type is `tf.string`.
    )     
    
    return tf.reshape(tf_string, ()) # The result is a scalar

In [8]:
tf_serialize_example(f0,f1,f2,f3)

<tf.Tensor: id=25, shape=(), dtype=string, numpy=b'\nS\n\x15\n\x08feature2\x12\t\n\x07\n\x05horse\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xa3\x89\xc8\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x03'>

#### Apply this function (`tf_serialize_example`) to each element in the dataset, and returns a new dataset.

In [9]:
serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset shapes: (), types: tf.string>

### Write tf.Example to a TFRecord file

In [10]:
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

## Reading a TFRecord file

Then read the TFRecord file using the [tf.data.TFRecordDataset](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/TFRecordDataset) class.

Using `TFRecordDatasets` can be useful for standardizing input data and optimizing performance.

In [11]:
filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 shapes: (), types: tf.string>

Now, we have the dataset that contains serialized [tf.train.Example](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/train/Example) messages. When iterated over it returns these as scalar string tensors.

* Note: iterating over a [`tf.data.Dataset`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset) only works with eager execution enabled.

In [12]:
#Use the `.take` method to only show the first 5 records.
for raw_record in raw_dataset.take(5):
    print(repr(raw_record))

<tf.Tensor: id=60058, shape=(), dtype=string, numpy=b'\nS\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x03\n\x15\n\x08feature2\x12\t\n\x07\n\x05horse\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xa3\x89\xc8\xbf'>
<tf.Tensor: id=60059, shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04@\xb9\x9b?'>
<tf.Tensor: id=60060, shape=(), dtype=string, numpy=b'\nU\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04H\x80\x87=\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken'>
<tf.Tensor: id=60061, shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x01\n\x13\n\x08feature2\x12\x07\n\x05\n\x03dog\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x8e\x13\xf8?'>


### Parse the `Dataset`

These tensors can be parsed using the function below.

- Note that the `feature_description` is necessary here because datasets use graph-execution, and need this description to build their shape and type signature:

In [13]:
# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

In [14]:
parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset shapes: {feature0: (), feature1: (), feature2: (), feature3: ()}, types: {feature0: tf.int64, feature1: tf.int64, feature2: tf.string, feature3: tf.float32}>

In [15]:
for parsed_record in parsed_dataset.take(5):
    print(repr(parsed_record))

{'feature0': <tf.Tensor: id=60094, shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: id=60095, shape=(), dtype=int64, numpy=3>, 'feature2': <tf.Tensor: id=60096, shape=(), dtype=string, numpy=b'horse'>, 'feature3': <tf.Tensor: id=60097, shape=(), dtype=float32, numpy=-1.5667003>}
{'feature0': <tf.Tensor: id=60098, shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: id=60099, shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: id=60100, shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: id=60101, shape=(), dtype=float32, numpy=1.2165909>}
{'feature0': <tf.Tensor: id=60102, shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: id=60103, shape=(), dtype=int64, numpy=2>, 'feature2': <tf.Tensor: id=60104, shape=(), dtype=string, numpy=b'chicken'>, 'feature3': <tf.Tensor: id=60105, shape=(), dtype=float32, numpy=0.066162646>}
{'feature0': <tf.Tensor: id=60106, shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: id=60107, shape=(), dtype=int64, numpy=1>, 

In [16]:
for raw_record in raw_dataset.take(1):    
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 1
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "horse"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: -1.5667003393173218
      }
    }
  }
}

