
__TFRecord__ is TensorFlow's binary data format, which is a serialized __tf.train.Example__ *Protobuf object*.

__Protobuf (Protocol Buffers)__ is a method of serializing structured data like Thrift. Designed to be smaller and faster than XML. To use:  define data structures (called messages) and services in a proto definition file (.proto) and compiles it with *protoc*.
- there is no way to tell the names, meaning, or full datatypes of fields without an external specification e.g. ASCII serialization



### Example encode and decode an images dataset
Encode:
1. Create a TFRecord file writer
2. Convert image to bytes
3. Create an instance of tf.train.Example (which is a TFRecord) and add label, shape, and image content to it.
4. Write via TFRecord file writer

Decode:
1. Create a queue of all files to be read
2. Create a TFRecord reader
3. Read from queue
4. Specify and parse feature types of the example
5. Cast each feature to proper types
6. Apply other characteristics that you already should know about the data such as shape

Keep in mind that label, shape, and image returned are tensor objects. To get their values, you’ll have to eval them in tf.Session().

In [None]:

## ENCODE
# First, we need to read in the image and convert it to byte string
def get_image_binary(filename):
    image = Image.open(filename)
    image = np.asarray(image, np.uint8)
    shape = np.array(image.shape, np.int32)
return shape.tobytes(), image.tobytes() # convert image to raw data bytes in the array.

def write_to_tfrecord (label, shape, binary_image, tfrecord_file):
    """ Write a single sample to TFRecord file, to write more samples, just use a loop!
    """
    writer = tf.python_io.TFRecordWriter(tfrecord_file)  # Create a TFRecord writer
    # Create an instance of tf.train.Example (which is a TFRecord) and add label, shape, and image content to it
    example=tf.train.Example(features=tf.train.Features(feature={ )
        'label': tf.train.Feature(bytes_list=tf.train.BytesList(value = [label])),
        'shape': tf.train.Feature(bytes_list=tf.train.BytesList(value = [shape])),
        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value = [binary_image]))
        }))
    # write via TFRecordfile writer
    writer.write(example.SerializeToString())
    writer.close()

#########################################################################
## DECODE
#
def read_from_tfrecord(filenames): 
    # create a queue of all files to be read
    tfrecord_file_queue = tf.train.string_input_producer(filenames, name = 'queue') 
    # Create a TFRecord reader
    reader = tf.TFRecordReader() 
    # Read from queue
    _, tfrecord_serialized = reader.read(tfrecord_file_queue)
    # label and image are stored as bytes but could be stored as int64/float64 values in a serialized tf.Exampleprotobuf
    tfrecord_features = tf.parse_single_example(tfrecord_serialized, 
        features = {
            'label': tf.FixedLenFeature([], tf.string), 
            'shape': tf.FixedLenFeature([], tf.string), 
            'image': tf.FixedLenFeature([], tf.string), 
        }, name = 'features')
    # image was saved as uint8, so we have to decode as uint8.
    image = tf.decode_raw(tfrecord_features['image'], tf.uint8)
    shape = tf.decode_raw(tfrecord_features['shape'], tf.int32)
    # the image tensor is flattened out, so we have to reconstruct the shape
    image = tf.reshape(image, shape)
    label = tf.cast(tfrecord_features['label'], tf.string)
    return label, shape, image

In [4]:
import numpy as np
import tensorflow as tf
N_SAMPLES = 1000
NUM_THREADS = 4
# Generating some simple data
# create 1000 random samples, each is a 1D array from the normal distribution (10, 1)
data = 10 * np . random . randn ( N_SAMPLES , 4 ) + 1
# create 1000 random labels of 0 and 1
target = np . random . randint ( 0 , 2 , size = N_SAMPLES )
queue = tf . FIFOQueue ( capacity = 50 , dtypes =[ tf . float32 , tf . int32 ], shapes =[[ 4 ], []])
enqueue_op = queue . enqueue_many ([ data , target ])
dequeue_op = queue . dequeue ()
# create NUM_THREADS to do enqueue
qr = tf . train . QueueRunner ( queue , [ enqueue_op ] * NUM_THREADS)
with tf . Session () as sess:
    # Create a coordinator, launch the queue runner threads.
    coord = tf . train . Coordinator ()
    enqueue_threads = qr . create_threads ( sess , coord = coord , start = True)
    for step in xrange ( 100 ): # do to 100 iterations
        if coord . should_stop ():
            break
        data_batch , label_batch = sess . run ( dequeue_op)
    coord . request_stop ()
    coord . join ( enqueue_threads)

In [None]:
# example of using tf.Coordinator() for normal threads

import threading
# thread body: loop until the coordinator indicates a stop was requested.
# if some condition becomes true, ask the coordinator to stop.
def my_loop ( coord ):
while not coord . should_stop ():
... do something ...
if ... some condition ...:
coord . request_stop ()
# main code: create a coordinator.
coord = tf . Coordinator ()
# create 10 threads that run 'my_loop()'
# you can also create threads using QueueRunner as the example above
threads = [ threading . Thread ( target = my_loop , args =( coord ,)) for _ in xrange ( 10 )]
# start the threads and wait for all of them to stop.
for t in threads :
t . start ()
coord . join ( threads)

We’ve learned that there are 3 different ways to read in data for your TensorFlow. The first is
through constants (which will seriously bloat your graph -- which you’ll see in assignment 2).
The second is through feed dict which has the drawback of first loading the data from storage to
the client and then from the client to workers, which can be slow especially when the client and
workers are on different machines. A common practice is to use data readers to load your data
directly from storage to workers. In theory, this means that you can load in an amount of data
limited only by your storage and not your device.

There are several built-in readers for several common data types. The most versatile one is
TextLineReader, which will read in any file delimited by newlines and will just return a line in
that with each call. There are also a reader to read in files of fixed length, a reader to read in
entire files, and a reader to read in the file of the type TFRecord (which we will go into below).

tf . TextLineReader
Outputs the lines of a file delimited by newlines
E . g . text files , CSV files

tf . FixedLengthRecordReader
Outputs the entire file when all files have same fixed lengths
E . g . each MNIST file has 28 x 28 pixels , CIFAR - 10 32 x 32 x 3

tf . WholeFileReader
Outputs the entire file content. This is useful when each file contains a sample

tf . TFRecordReader
Reads samples from TensorFlow ' s own binary format ( TFRecord)

tf . ReaderBase
Allows you to create your own readers

To use data reader, we first need to create a queue to hold the names of all the files you want to
read in through tf.train.string_input_producer.

filename_queue = tf . train . string_input_producer ([ "heart.csv" ])
reader = tf . TextLineReader (skip_header_lines=1)

 it means you choose to skip the first line for every file in the queue

My friend encouraged me to think of readers as ops that return a different value every time you
call it -- similar to Python generators. So when you call reader.read(), it’ll return you a pair key,
value, in which key is a key to identify the file and record (useful for debugging if you have some
weird records), and a scalar string value.

## ByteString

In [9]:
a = np.array([[1,2,3],[7,8,9]])

In [11]:
b = a.tobytes()

In [14]:
c = np.array(b)
print(c)

dt = np.dtype(int)
np.frombuffer(b, dtype=dt)

array('\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\t',
      dtype='|S48')

In [16]:
dt = np.dtype(int)
np.frombuffer(b, dtype=dt)

array([1, 2, 3, 7, 8, 9])