## Input pipeline

The process in which data is loaded from files and fed into a machine learning model is known as the input pipeline. Since the input pipeline handles a large amount of data for machine learning projects, we need it to be as efficient as possible.

![title](img/input_pipeline.png)

A flexible and efficient format for storing large amounts of data is Google’s protocol buffer. The protocol buffer is similar to JSON and XML (another feature-based data format), but uses less space and is faster to process. When used with TensorFlow, protocol buffers make the input pipeline for large datasets much more streamlined.

## TensorFlow protocol buffer

Since protocol buffers use a structured format when storing data, they can be represented with Python classes. In TensorFlow, the tf.train.Example class represents the protocol buffer used to store data for the input pipeline.

Each individual tf.train.Example object describes data for a single dataset observation (e.g. a single row in a data table). We convert raw data to a protocol buffer by initializing a tf.train.Example object with the data’s values.

When we initialize a tf.train.Example object, we need to set that object’s features argument to a tf.train.Features object. The tf.train.Features class is initialized by setting the feature field to a dictionary that maps feature names to feature values.

In [4]:
import tensorflow as tf

age = tf.train.Int64List(value=[12])
weight = tf.train.FloatList(value=[88.19999694824219])

f_dict = {
    'age': age,
    'weight': weight
}
features = tf.train.Features(feature=f_dict)  # f_dict is a dict
ex = tf.train.Example(features=features)
print(repr(ex))

TypeError: Parameter to MergeFrom() must be instance of same class: expected tensorflow.Feature got tensorflow.Int64List.

## Bytes and text

When dealing with datasets containing bytes (e.g. images or videos) or text (e.g. articles or sentences), it is beneficial to first read all the data files and then store the read data in the bytes_list field of a tf.train.Feature. This saves us from having to open each individual file within our input pipeline, which can drastically improve efficiency.

In [17]:
import tensorflow as tf

with open('text/story.txt') as f:
    words = f.read().split()

encw = [w.encode() for w in words]
words_feature = tf.train.Feature(
    bytes_list=tf.train.BytesList(value=encw))
print(repr(words_feature))

with open('img/input_pipeline.PNG', 'rb') as f:
    img_bytes = f.read()

img_feature = tf.train.Feature(
    bytes_list=tf.train.BytesList(value=[img_bytes]))
print(repr(img_feature))

FileNotFoundError: [Errno 2] No such file or directory: 'text/story.txt'