
__TFRecord__ is TensorFlow's binary data format, which is a serialized __tf.train.Example__ *Protobuf object*.

__Protobuf (Protocol Buffers)__ is a method of serializing structured data like Thrift. Designed to be smaller and faster than XML. To use:  define data structures (called messages) and services in a proto definition file (.proto) and compiles it with *protoc*.
- there is no way to tell the names, meaning, or full datatypes of fields without an external specification e.g. ASCII serialization



### Example encode and decode an images dataset
Encode:
1. Create a TFRecord file writer
2. Convert image to bytes
3. Create an instance of tf.train.Example (which is a TFRecord) and add label, shape, and image content to it.
4. Write via TFRecord file writer

Decode:
1. Create a queue of all files to be read
2. Create a TFRecord reader
3. Read from queue
4. Specify and parse feature types of the example
5. Cast each feature to proper types
6. Apply other characteristics that you already should know about the data such as shape

Keep in mind that label, shape, and image returned are tensor objects. To get their values, you’ll have to eval them in tf.Session().

In [None]:

## ENCODE
# First, we need to read in the image and convert it to byte string
def get_image_binary(filename):
    image = Image.open(filename)
    image = np.asarray(image, np.uint8)
    shape = np.array(image.shape, np.int32)
return shape.tobytes(), image.tobytes() # convert image to raw data bytes in the array.

def write_to_tfrecord (label, shape, binary_image, tfrecord_file):
    """ Write a single sample to TFRecord file, to write more samples, just use a loop!
    """
    writer = tf.python_io.TFRecordWriter(tfrecord_file)  # Create a TFRecord writer
    # Create an instance of tf.train.Example (which is a TFRecord) and add label, shape, and image content to it
    example=tf.train.Example(features=tf.train.Features(feature={ )
        'label': tf.train.Feature(bytes_list=tf.train.BytesList(value = [label])),
        'shape': tf.train.Feature(bytes_list=tf.train.BytesList(value = [shape])),
        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value = [binary_image]))
        }))
    # write via TFRecordfile writer
    writer.write(example.SerializeToString())
    writer.close()

#########################################################################
## DECODE
#
def read_from_tfrecord(filenames): 
    # create a queue of all files to be read
    tfrecord_file_queue = tf.train.string_input_producer(filenames, name = 'queue') 
    # Create a TFRecord reader
    reader = tf.TFRecordReader() 
    # Read from queue
    _, tfrecord_serialized = reader.read(tfrecord_file_queue)
    # label and image are stored as bytes but could be stored as int64/float64 values in a serialized tf.Exampleprotobuf
    tfrecord_features = tf.parse_single_example(tfrecord_serialized, 
        features = {
            'label': tf.FixedLenFeature([], tf.string), 
            'shape': tf.FixedLenFeature([], tf.string), 
            'image': tf.FixedLenFeature([], tf.string), 
        }, name = 'features')
    # image was saved as uint8, so we have to decode as uint8.
    image = tf.decode_raw(tfrecord_features['image'], tf.uint8)
    shape = tf.decode_raw(tfrecord_features['shape'], tf.int32)
    # the image tensor is flattened out, so we have to reconstruct the shape
    image = tf.reshape(image, shape)
    label = tf.cast(tfrecord_features['label'], tf.string)
    return label, shape, image

# Multi-threading in Tensorflow

- Multiple threads prepare training examples and push them in the queue.
- A training thread executes a training op that dequeues mini-batches from the queue.

Session object is designed multithreaded, so multiple threads can easily use the same session and run ops in parallel.

TensorFlow provides two classes to help with the threading: *tf.Coordinator* and *tf.train.QueueRunner*. These two classes are designed to be used together. 

- __Coordinator__ 
  - helps multiple threads stop together 
  - report exceptions to a program that waits for them to finish. 

- __QueueRunner__ is used to create a number of threads cooperating to enqueue tensors in the same queue. Methods they provide: enqueue, enqueue_many, and dequeue
    - tf.FIFOQueue: a queue the dequeues elements in a first in first out order
    - tf.RandomShuffleQueue: dequeues elements in, a random order
    - tf.PaddingFIFOQueue: FIFOQueue that supports batching variable-sized tensors by padding
    - tf.PriorityQueue: FIFOQueue whose enqueues and dequeues take in another argument: the priority.

dequeue_many is not allowed. If you want to get multiple elements at once for your batch training, you’ll have to use *tf.train.batch* or *tf.train.shuffle_batch* if you want to your batch to be shuffled.

### An example Coordinator+Queue

In [7]:
import numpy as np
import tensorflow as tf
N_SAMPLES = 1000
NUM_THREADS = 4

# Generating simple data
# create random samples with 4 features and random binary labels (0/1)
data = np.random.randn(N_SAMPLES, 4)
label = np.random.randint(0, 2, size = N_SAMPLES)

# Define queue
queue = tf.FIFOQueue(capacity = 50, dtypes =[tf.float32, tf.int32], shapes =[[4], []])
# Define enqueue and deqeue op
enqueue_op = queue.enqueue_many([data, label])
data_sample_op, lable_sample_op = queue.dequeue() # dequeue op

# Define QueueRunner on the queue with NUM_THREADS to do *enqueue* op
qr = tf.train.QueueRunner(queue, [enqueue_op] * NUM_THREADS)
with tf.Session() as sess:
    # Create a thread coordinator, launch the queue runner threads.
    coord = tf.train.Coordinator()
    # Create threads (enqueue) and assign their coordinator to the one just created.
    enqueue_threads = qr.create_threads(sess, coord = coord, start = True)
    for step in xrange(100): # do to 100 iterations
        if coord.should_stop():
            break
        # dequeue one
        data_batch, label_batch = sess.run([data_sample_op, lable_sample_op]) # dequeue ops can be used as placeholders
        #print(data_batch)
        #print('--')
    coord.request_stop()
    coord.join(enqueue_threads)

### Example of using Coordinate to manage normal multi-threading in python
Taken from [Tensorflow Threading_and_Queues](https://www.tensorflow.org/programmers_guide/threading_and_queues)

In [None]:
# example of using tf.Coordinator() for normal threads

def my_loop (coord):
    """
    thread body: loop until the coordinator indicates a stop was requested.
    if some condition becomes true, ask the coordinator to stop.
    """
    while not coord.should_stop():
        ... do something ...
    if ... some condition ...:
        coord.request_stop()

import threading
# main code: create a coordinator.
coord = tf.Coordinator()
# create 10 threads that run 'my_loop()'
# you can also create threads using QueueRunner as the example above
threads = [threading.Thread (target = my_loop, args =(coord,)) for _ in xrange (10)]
# start the threads and wait for all of them to stop.
for t in threads :
    t.start()
coord.join(threads)

## Read in Data
To read in data
- Constants: will seriously bloat your graph -- which you’ll see in assignment 2.
- Feed dict: which has the drawback of first loading the data from storage to the client and then from the client to workers, which can be slow especially when the client and workers are on different machines. 
- Data Readers: use data readers to load your data directly from storage to workers. In theory, this means that you can load in an amount of data limited only by your storage and not your device.
  - __TextLineReader:__ Outputs the lines of a file delimited by newlines. e.g. text files, CSV files
  - __FixedLengthRecordReader:__ Outputs the entire file when all files have same fixed lengths. e.g. each MNIST file has 28 x 28 pixels, CIFAR - 10 32 x 32 x 3
  - __WholeFileReader:__ Outputs the entire file content. This is useful when each file contains a sample
  - __TFRecordReader:__ Reads samples from TFRecord files (TensorFlow's own binary format)
  - __ReaderBase:__ Allows you to create your own readers
 

To use data reader, we first need to create a queue to hold the names of all the files you want to read in through __string_input_producer__ which is a queue of strings.

In [9]:
filename_queue = tf.train.string_input_producer(["heart.csv"])
reader = tf.TextLineReader(skip_header_lines=1) # skip header row
key, val = reader.read(filename_queue)

__Reader:__ My friend encouraged me to think of readers as ops that return a different value every time you
call it -- similar to Python generators. So when you call reader.read(), it’ll return you a pair key,
value, in which key is a key to identify the file and record (useful for debugging if you have some
weird records), and a scalar string value.

In [None]:
filename_queue = tf.train.string_input_producer(["heart.csv"])
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read(filename_queue)
with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    for _ in range(1): # generate 1 example
        key, value = sess.run([key, value])
        print value # 144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
        print key # data/heart.csv:2
    coord.request_stop()
    coord.join(threads)