# What is TFRecord file

The tfrecord file is recommended file format in tensorflow as introduced in (https://www.tensorflow.org/api_guides/python/reading_data#Reading_from_files).

If you prepare tfrecords file, you can load it using tf.contrib.data.Dataset API:

In [None]:
dataset = tf.contrib.data.TFRecordDataset('filename.tfrecords')

Before making tfrecord file, you should understand protocol buffer since the tfrecord format consits of it.

According to [Google protocol buffer](https://developers.google.com/protocol-buffers/), protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializaing structured data, like XML but smaller, faster, and simpler.

Simply saying, protocol buffer can be regarded as container class to which input data is saved.

The order of converting data to tfrecord file using protocol buffer class could be described as follows:

1. Instantiate protocol buffer class
2. fill in the instance field with the data
3. serialize it with its instance method
4. convert and save to tfrecord file with tf.python_io.TFRecordWriter

# Let's examine the structure of protocol buffer class

Example has single field 'features' of 'Features' field type.

In [None]:
message Example {
  Features features = 1;
};

Features has single field 'feature' of 'map&lt;string, Feature&gt; field type.

In [None]:
message Features {
  // Map from feature name to feature.
  map<string, Feature> feature = 1;
};

Feature has single field which could be one of the following:

1. bytes_list of BytesList type
2. float_list of FloatList type
3. int64_list of Int64List type

In [None]:
// Containers for non-sequential data.
message Feature {
  // Each feature can be exactly one kind.
  oneof kind {
    BytesList bytes_list = 1;
    FloatList float_list = 2;
    Int64List int64_list = 3;
  }
};

Each of them has following definition.

In [None]:
// Containers to hold repeated fundamental values.
message BytesList {
  repeated bytes value = 1;
}
message FloatList {
  repeated float value = 1 [packed = true];
}
message Int64List {
  repeated int64 value = 1 [packed = true];
}

So, the inner most basic field we should fill in is 'value' field.

Now, how to fill in this field?

Just reference field name from outermost to innermost and assign the value.

**note)  'map' type works like dictionary, and 'repeated' type works like python sequences.** 

In [None]:
example = tf.train.Example() # instanciate Example protocol buffer class which have features field.
example.features.feature['feature_name'].float_list.value[:] = [10,20,30] # assign value (python sequence).

# There are multiple ways to assign value
example.features.feature['feature_name'].float_list.value.append(10) # append one value
example.features.feature['feature_name'].float_list.value.extend([20,30]) # append list

# Let's make practical example.

** (code run on 2013 macbook air) **

Make tfrecord file from MNIST numpy array.

In [1]:
import tensorflow as tf
import numpy as np
import time

In [2]:
(x_train, y_train), (x_test, y_test) = tf.contrib.keras.datasets.mnist.load_data()

In [3]:
print(x_train.shape, x_train.dtype)
print(y_train.shape, y_train.dtype)
print(x_test.shape, x_test.dtype)
print(y_test.shape, y_test.dtype)

(60000, 28, 28) uint8
(60000,) uint8
(10000, 28, 28) uint8
(10000,) uint8


In [4]:
x_train = x_train.astype(float) # cast from int to float
x_test = x_test.astype(float)

x_train = x_train / 255 # scale to 0-1 float value
x_test = x_test / 255

x_train = x_train.reshape(-1, 28*28) # reshape from (samples, height, width) to (samples, feature dimensions)
x_test = x_test.reshape(-1, 28*28)

In [5]:
train_writer = tf.python_io.TFRecordWriter(path='./train.tfrecords')
test_writer = tf.python_io.TFRecordWriter(path='./test.tfrecords')

start = time.time()
for x, y in zip(x_train, y_train):
    train_exam = tf.train.Example()
    train_exam.features.feature['image'].float_list.value[:] = x # 28*28 length float list
    train_exam.features.feature['label'].int64_list.value.append(y) # single int value
    train_writer.write(train_exam.SerializeToString())
for x, y in zip(x_test, y_test):
    test_exam = tf.train.Example()
    test_exam.features.feature['image'].float_list.value[:] = x
    test_exam.features.feature['label'].int64_list.value.append(y)
    test_writer.write(test_exam.SerializeToString())
    
train_writer.close()
test_writer.close()

print('{:.0f}s elapsed'.format(time.time()-start))

120s elapsed


Now you have mnist train, test dataset in tfrecord file format.

Build simple neural network and evaluate it!

In [6]:
filename = tf.placeholder(dtype=tf.string, shape=[])
dataset_train = tf.contrib.data.TFRecordDataset(filename)

def _parse_function(example_serialized):
    features = {'image': tf.FixedLenFeature(shape=[28*28], dtype=tf.float32),
               'label': tf.FixedLenFeature(shape=[], dtype=tf.int64)}
    parsed_features = tf.parse_single_example(example_serialized, features)
    return parsed_features['image'], parsed_features['label']

dataset = dataset_train.map(_parse_function)

In [7]:
dataset = dataset.shuffle(60000)
dataset = dataset.batch(128)

iterator = dataset.make_initializable_iterator()

In [8]:
dataset_batched = iterator.get_next()
features = dataset_batched[0]
labels = dataset_batched[1]

keep_prob = tf.placeholder(dtype=tf.float32, shape=[])
training = tf.placeholder(dtype=tf.bool, shape=[])

outputs = tf.layers.dense(features, 512, activation=tf.nn.relu)
outputs = tf.layers.dropout(outputs, rate=1-keep_prob, training=training)
outputs = tf.layers.dense(outputs, 512, activation=tf.nn.relu)
outputs = tf.layers.dropout(outputs, rate=1-keep_prob, training=training)
logits = tf.layers.dense(outputs, 10)

loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
update_step = optimizer.minimize(loss)
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), labels), tf.float32))

global_initializer = tf.global_variables_initializer()

In [12]:
epochs = 10

with tf.Session() as sess:
    sess.run(global_initializer)
    for epoch in range(epochs):
        start = time.time()
        train_acc, train_loss = [], []
        sess.run(iterator.initializer, feed_dict={filename: './train.tfrecords'})
        while True:
            try:
                _, acc, _loss = sess.run([update_step, accuracy, loss], feed_dict={keep_prob: 0.7, training: True})
                train_acc.append(acc)
                train_loss.append(_loss)
            except tf.errors.OutOfRangeError:
                log = 'Epoch {}/{}, \n\ttrain_acc={:.3f}, train_loss={:.3f}, time={:.0f}s'
                print(log.format(epoch+1, epochs, np.mean(train_acc), np.mean(train_loss), time.time()-start))
                
                start = time.time()
                test_acc, test_loss = [], []
                sess.run(iterator.initializer, feed_dict={filename: './test.tfrecords'})
                while True:
                    try:
                        acc, _loss = sess.run([accuracy, loss], feed_dict={keep_prob: 1.0, training: False})
                        test_acc.append(acc)
                        test_loss.append(_loss)
                    except tf.errors.OutOfRangeError:
                        log = '\ttest_acc={:.3f}, test_loss={:.3f}, time={:.0f}s'
                        print(log.format(np.mean(test_acc), np.mean(test_loss), time.time()-start))
                        break
                
                break # go to next epoch

Epoch 1/10, 
	train_acc=0.917, train_loss=0.273, time=14s
	test_acc=0.967, test_loss=0.108, time=1s
Epoch 2/10, 
	train_acc=0.965, train_loss=0.116, time=13s
	test_acc=0.973, test_loss=0.085, time=1s
Epoch 3/10, 
	train_acc=0.974, train_loss=0.084, time=13s
	test_acc=0.977, test_loss=0.073, time=1s
Epoch 4/10, 
	train_acc=0.979, train_loss=0.067, time=12s
	test_acc=0.979, test_loss=0.068, time=1s
Epoch 5/10, 
	train_acc=0.982, train_loss=0.057, time=14s
	test_acc=0.982, test_loss=0.059, time=1s
Epoch 6/10, 
	train_acc=0.984, train_loss=0.050, time=15s
	test_acc=0.980, test_loss=0.072, time=1s
Epoch 7/10, 
	train_acc=0.986, train_loss=0.044, time=13s
	test_acc=0.982, test_loss=0.060, time=1s
Epoch 8/10, 
	train_acc=0.988, train_loss=0.037, time=13s
	test_acc=0.981, test_loss=0.066, time=1s
Epoch 9/10, 
	train_acc=0.988, train_loss=0.035, time=15s
	test_acc=0.982, test_loss=0.066, time=1s
Epoch 10/10, 
	train_acc=0.989, train_loss=0.034, time=13s
	test_acc=0.982, test_loss=0.068, time=1s