### 数据读取
TensorFlow程序读取数据一共有3种方法：
1. 供给数据(Feeding)： 在TensorFlow程序运行的每一步， 让Python代码来供给数据。
1. 从文件读取数据： 在TensorFlow图的起始， 让一个输入管线从文件中读取数据。
1. 预加载数据： 在TensorFlow图中定义常量或变量来保存所有数据(仅适用于数据量比较小的情况)。

#### 供给数据    
    TensorFlow的数据供给机制允许你在TensorFlow运算图中将数据注入到任一张量中。因此，python运算可以把数据直接设置到TensorFlow图中。通过给run()或者eval()函数输入feed_dict参数， 可以启动运算过程。
    虽然你可以使用常量和变量来替换任何一个张量， 但是最好的做法应该是使用placeholder op节点。设计placeholder节点的唯一的意图就是为了提供数据供给(feeding)的方法。placeholder节点被声明的时候是未初始化的， 也不包含数据， 如果没有为它供给数据， 则TensorFlow运算的时候会产生错误， 所以千万不要忘了为placeholder提供数据。
```
    with tf.Session():
      input = tf.placeholder(tf.float32)
      classifier = ...
      print classifier.eval(feed_dict={input: my_python_preprocessing_fn()})    
```


#### 从文件读取数据
    典型的文件读取管线会包含下面这些步骤：
1. 创建文件名列表
1. 创建文件名队列
1. 创建reader 和 decoder
1. 创建样例队列
![image](./images/input_data_flowchart.png)

#### tensorflow 推荐的三种数据文件格式
![image](./images/three.png)

#### 1. 生成文件存储样例数据。

In [3]:
import tensorflow as tf

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

num_shards = 2
instances_per_shard = 2
for i in range(num_shards):
    # 文件名
    filename = ('data.tfrecords-%.5d-of-%.5d' % (i, num_shards)) 
    
    # 创建Writer
    writer = tf.python_io.TFRecordWriter(filename)
    
    
    for j in range(instances_per_shard):
        # Example结构仅包含当前样例属于第几个文件以及是当前文件的第几个样本。
        example = tf.train.Example(features=tf.train.Features(feature={
            'i': _int64_feature(i),
            'j': _int64_feature(j)}))
        writer.write(example.SerializeToString())
        
    writer.close()  

#### 2. 读取文件。
1. tf.train.string_input_producer 使用参数提供的文件列表创建一个输入队列     
   同时打开多个文件，显示创建Queue，同时隐含了QueueRunner的创建    
1. 生成的队列可以同时被多个文件读取线程操作，输入队列会将文件均匀分给不同线程    
1. 当输入队列文件处理完成后，后将初始化是提供的文件列表中的文件重新加入队列，    
   num_epochs参数限制加载文件列表的最大轮数，当所有文件已经被使用了设定的轮    
   数后，继续读取就会报OutOfRange错误  
   
>使用了string_input_producer指定num_epochs之后，需要使用如下代码做初始化        
```
    init_op = tf.group(tf.global_variables_initializer(),
                       tf.local_variables_initializer())
    sess.run(init_op)
```
>要不然会报错

In [4]:
import tensorflow as tf


# 获取符合正则表达式的所以文件
files = tf.train.match_filenames_once("data.tfrecords-*")

# 根据输入的文件列表创建一个输入队列
filename_queue = tf.train.string_input_producer(files, shuffle=False)  

# 创建reader
reader = tf.TFRecordReader()

# reader 从输入队列读取数据，这个是operaor
_, serialized_example = reader.read(filename_queue)

features = tf.parse_single_example(
      serialized_example,
      features={
          'i': tf.FixedLenFeature([], tf.int64),
          'j': tf.FixedLenFeature([], tf.int64),
      })

with tf.Session() as sess:
    sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
    
    print(sess.run(files))
    
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    for i in range(6):
        print(sess.run([features['i'], features['j']]))
        
    coord.request_stop()
    coord.join(threads)

[b'./data.tfrecords-00000-of-00002' b'./data.tfrecords-00001-of-00002']
[0, 0]
[0, 1]
[1, 0]
[1, 1]
[0, 0]
[0, 1]


#### 3. 组合训练数据（Batching）

In [4]:
example, label = features['i'], features['j']
batch_size = 2
capacity = 1000 + 3 * batch_size
example_batch, label_batch = tf.train.batch([example, label], batch_size=batch_size, capacity=capacity)

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    tf.local_variables_initializer().run()
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    for i in range(3):
        cur_example_batch, cur_label_batch = sess.run([example_batch, label_batch])
        print cur_example_batch, cur_label_batch
    coord.request_stop()
    coord.join(threads)


[1 1] [0 1]
[0 0] [0 1]
[1 1] [0 1]


In [18]:
#coding=utf8
import tensorflow as tf
import numpy as np

def main():
    filename_queue = tf.train.string_input_producer(['/tmp/A.csv','/tmp/B.csv'], num_epochs=2)
    reader = tf.TextLineReader()
    key, value = reader.read(filename_queue)
    
    with tf.Session() as sess:
        sess.run(tf.initialize_local_variables())
        tf.train.start_queue_runners()
        num_examples = 0
        try:
            while True:
                s_key, s_value = sess.run([key, value])
                print("key:", s_key, s_value)
            num_examples += 1
        except tf.errors.OutOfRangeError:
            print ("There are", num_examples, "examples")

if __name__ == "__main__":
    main()

key: b'/tmp/A.csv:1' b'1 11111'
key: b'/tmp/A.csv:2' b'2 22222'
key: b'/tmp/A.csv:3' b'3 33333'
key: b'/tmp/B.csv:1' b'4 444444'
key: b'/tmp/B.csv:2' b'5 545555'
key: b'/tmp/B.csv:3' b'6 666666'
key: b'/tmp/A.csv:1' b'1 11111'
key: b'/tmp/A.csv:2' b'2 22222'
key: b'/tmp/A.csv:3' b'3 33333'
key: b'/tmp/B.csv:1' b'4 444444'
key: b'/tmp/B.csv:2' b'5 545555'
key: b'/tmp/B.csv:3' b'6 666666'
There are 0 examples
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_3/input_producer_3_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_11/input_producer_11_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_8/input_producer_8_EnqueueMany}}]]


Exception in thread QueueRunnerThread-input_producer_3-input_producer_3/input_producer_3_EnqueueMany:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257, in _run
    enqueue_callable()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
	 [[{{node input_producer_3/input_producer_3_EnqueueMany}}]]

Exception in thread QueueRunnerThread-input_prod

ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_14/input_producer_14_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_13/input_producer_13_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_10/input_producer_10_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_7/input_producer_7_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_12/input_producer_12_EnqueueMany}}]]
ERROR:tensorflow:Exception in QueueRunner: Enqueue operation was cancelled
	 [[{{node input_producer_4/input_producer_4_EnqueueMany}}]]


Exception in thread QueueRunnerThread-input_producer_8-input_producer_8/input_producer_8_EnqueueMany:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/queue_runner_impl.py", line 257, in _run
    enqueue_callable()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.CancelledError: Enqueue operation was cancelled
	 [[{{node input_producer_8/input_producer_8_EnqueueMany}}]]

Exception in thread QueueRunnerThread-input_prod

##### 单reader，单样本（batch_size=1）

In [23]:
#coding=utf8
import tensorflow as tf

#创建文件队列
filenames = ['/tmp/A.csv','/tmp/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=True)
#shuffle=True 文件队列随机读取，默认

TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)

##数据格式是两列，中间空格分割，所有一定要加field_delim=" "
label, example = tf.decode_csv(value, record_defaults=[[], []],field_delim=" ")
##record_defaults=[[], []]文件读取后的数据默认格式，文件有几列返回值就有几个，
##默认是英文逗号分隔，可以指定
##关于tf.decode_csv()的具体用法可以查看https://www.tensorflow.org/versions/master/api_docs/python/tf/decode_csv

with tf.Session() as sess:
    init_op = tf.group(tf.global_variables_initializer(),
                       tf.local_variables_initializer())
    sess.run(init_op)
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess,coord=coord)
    for i in range(6):
        ##循环读取，即使所有文件没有那么多行
        print("value:",sess.run([value,label]))
        
    coord.request_stop()
    coord.join(threads)

value: [b'4 444444', 4.0]
value: [b'5 545555', 5.0]
value: [b'6 666666', 6.0]
value: [b'1 11111', 1.0]
value: [b'2 22222', 2.0]
value: [b'3 33333', 3.0]


#### 单reader，多样本（batch_size）

In [2]:
#coding=utf8
import tensorflow as tf

#创建文件队列
filenames = ['/tmp/A.csv','/tmp/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
#shuffle=True 文件队列随机读取，默认

TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)

label,example = tf.decode_csv(value, record_defaults=[[], []], field_delim=" ")
##record_defaults=[[], []]文件读取后的数据默认格式，文件有几列返回值就有几个，
##默认是英文逗号分隔，可以指定

#label_batch, example_batch = tf.train.batch([label,example],
#                                           batch_size=10,
#                                           capacity=100,
#                                           num_threads=2)
# ###随机读取
label_batch, example_batch = tf.train.shuffle_batch([label,example],
                                                     batch_size=5,
                                                     capacity=100,
                                                     min_after_dequeue=50,
                                                     num_threads=2)

with tf.Session() as sess:
    init_op = tf.group(tf.global_variables_initializer(),
                       tf.local_variables_initializer())
    sess.run(init_op)
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess,coord=coord)
    for i in range(1):
        ##循环读取，即使所有文件没有那么多行
        print(sess.run([label_batch,example_batch]))
        
    coord.request_stop()
    coord.join(threads)

Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size)`.
[array([5., 1., 2., 4., 4.], dtype=float32), array([545555.,  11111.,  22222., 444444., 444444.], dtype=float32)]


#### 多reader，多样本
1. tf.train.batch与tf.train.shuffle_batch函数是单个Reader读取，但是可以多线程。    
1. tf.train.batch_join与tf.train.shuffle_batch_join可设置多Reader读取，每个Reader使用一个线程。    

In [4]:
#coding=utf8
import tensorflow as tf

#创建文件队列
filenames = ['/tmp/A.csv','/tmp/B.csv']
filename_queue = tf.train.string_input_producer(filenames,shuffle=False)
#shuffle=True 文件队列随机读取，默认

TFReader = tf.TextLineReader()
key,value = TFReader.read(filename_queue)

example_list = [tf.decode_csv(value, record_defaults=[[], []], field_delim=" ") for _ in range(2)]
##2表示创建两个reader

example_batch,label_batch = tf.train.batch_join(example_list, batch_size=5)
# 使用tf.train.batch_join()，可以使用多个reader，并行读取数据。每个Reader使用一个线程。

with tf.Session() as sess:
    sess.run(tf.initialize_local_variables())
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess,coord=coord)
    for i in range(10):
        ##循环读取，即使所有文件没有那么多行
        print(sess.run([example_batch,label_batch]))
    coord.request_stop()
    coord.join(threads)

[array([1., 2., 3., 4., 5.], dtype=float32), array([ 11111.,  22222.,  33333., 444444., 545555.], dtype=float32)]
[array([6., 1., 2., 3., 4.], dtype=float32), array([666666.,  11111.,  22222.,  33333., 444444.], dtype=float32)]
[array([5., 6., 1., 2., 3.], dtype=float32), array([545555., 666666.,  11111.,  22222.,  33333.], dtype=float32)]
[array([4., 5., 6., 1., 2.], dtype=float32), array([444444., 545555., 666666.,  11111.,  22222.], dtype=float32)]
[array([3., 4., 5., 6., 1.], dtype=float32), array([ 33333., 444444., 545555., 666666.,  11111.], dtype=float32)]
[array([2., 3., 4., 5., 6.], dtype=float32), array([ 22222.,  33333., 444444., 545555., 666666.], dtype=float32)]
[array([1., 2., 3., 4., 5.], dtype=float32), array([ 11111.,  22222.,  33333., 444444., 545555.], dtype=float32)]
[array([6., 1., 2., 3., 4.], dtype=float32), array([666666.,  11111.,  22222.,  33333., 444444.], dtype=float32)]
[array([5., 6., 1., 2., 3.], dtype=float32), array([545555., 666666.,  11111.,  22222., 