# [Datasets for Estimators](https://www.tensorflow.org/guide/datasets_for_estimators)

In [5]:
import tensorflow as tf

The [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model. This document introduces the API by walking through two simple examples:
   - Reading in-memory data from numpy arrays.
   - Reading lines from a csv file.

## <a name="basic-input"></a>Basic input
Taking slices from an array is the simplest way to get started with [tf.data](https://www.tensorflow.org/api_docs/python/tf/data).

The [Premade Estimators](https://www.tensorflow.org/guide/premade_estimators) chapter describes the following `train_input_fn`, from [iris_data.py](https://github.com/tensorflow/models/blob/master/samples/core/get_started/iris_data.py), to pipe the data into the Estimator:
```python
def train_input_fn(features, labels, batch_size):
    """An input function for training"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)

    # Return the dataset.
    return dataset
```

Let's look at this more closely.

### Arguments
This function expects three arguments. Arguments expecting an "array" can accept nearly anything that can be converted to an array with `numpy.array`. One exception is [tuple](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences) which, as we will see, has special meaning for `Datasets`.
   - `features`: A `{'feature_name':array}` dictionary (or [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)) containing the raw input features.
   - labels : An array containing the [label](https://developers.google.com/machine-learning/glossary/#label) for each example.
   - `batch_size` : An integer indicating the desired batch size.

In [premade_estimator.py](https://github.com/tensorflow/models/blob/master/samples/core/get_started/premade_estimator.py) we retrieved the Iris data using the `iris_data.load_data()` function. You can run it, and unpack the results as follows:

In [1]:
import iris_data

# Fetch the data
train, test = iris_data.load_data()
features, labels = train

Then we passed this data to the input function, with a line similar to this:

In [2]:
batch_size=100
iris_data.train_input_fn(features, labels, batch_size)

<BatchDataset shapes: ({SepalLength: (?,), SepalWidth: (?,), PetalLength: (?,), PetalWidth: (?,)}, (?,)), types: ({SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}, tf.int64)>

Let's walk through the `train_input_fn()`.

### Slices
The function starts by using the [tf.data.Dataset.from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) function to create a [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) representing slices of the array. The array is sliced across the first dimension. For example, an array containing the `iris` training data has a shape of `(120, 4)`. Passing this to `from_tensor_slices` returns a `Dataset` object contining 120 slices, each one with 4 features.

In [3]:
features.shape

(120, 4)

In [None]:
The code that returns this Dataset is as follows:

In [6]:
features_ds = tf.data.Dataset.from_tensor_slices(features)
print(features_ds)

<TensorSliceDataset shapes: (4,), types: tf.float64>


This will print the following line, showing the [shapes](https://www.tensorflow.org/guide/tensors#shapes) and [types](https://www.tensorflow.org/guide/tensors#data_types) of the items in the dataset. Note that a `Dataset` does not know how many items it contains.

The `Dataset` above represents a simple collection of arrays, but datasets are much more powerful than this. A `Dataset` can transparently handle any nested combination of dictionaries or tuples (or [namedtuple](https://docs.python.org/2/library/collections.html#collections.namedtuple) ).

For example after converting the iris `features` to a standard python dictionary, you can then convert the dictionary of arrays to a `Dataset` of dictionaries as follows:

In [7]:
dataset = tf.data.Dataset.from_tensor_slices(dict(features))
print(dataset)

<TensorSliceDataset shapes: {SepalLength: (), SepalWidth: (), PetalLength: (), PetalWidth: ()}, types: {SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}>


In [8]:
features_dict = dict(features)

In [9]:
print(features_dict.keys())

dict_keys(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth'])


In [10]:
sepallen = features_dict.get('SepalLength')

In [11]:
sepallen.shape

(120,)

In [12]:
sepallen[:5]

0    6.4
1    5.0
2    4.9
3    4.9
4    5.7
Name: SepalLength, dtype: float64

Here we see that when a `Dataset` contains structured elements, the `shapes` and `types` of the `Dataset` take on the same structure. This dataset contains dictionaries of [scalars](https://www.tensorflow.org/guide/tensors#rank), all of type [tf.float64](https://www.tensorflow.org/api_docs/python/tf#float64).

The first line of the `iris train_input_fn` uses the same functionality, but adds another level of structure. It creates a dataset containing `(features_dict, label)` pairs.

The following code shows that the label is a scalar with type int64:

In [8]:
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
print(dataset)

<TensorSliceDataset shapes: ({SepalLength: (), SepalWidth: (), PetalLength: (), PetalWidth: ()}, ()), types: ({SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}, tf.int64)>


### Manipulation
Currently the `Dataset` would iterate over the data once, in a fixed order, and only produce a single element at a time. It needs further processing before it can be used for training. Fortunately, the [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) class provides methods to better prepare the data for training.

In [9]:
# Shuffle, repeat, and batch the examples.
dataset_batch = dataset.shuffle(1000).repeat().batch(batch_size)

The [tf.data.Dataset.shuffle](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) method uses a fixed-size buffer to shuffle the items as they pass through. In this case the `buffer_size` is greater than the number of examples in the `Dataset`, ensuring that the data is completely shuffled (The Iris data set only contains 150 examples).

The [tf.data.Dataset.repeat](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#repeat) method restarts the Dataset when it reaches the end. To limit the number of epochs, set the count argument.

The [tf.data.Dataset.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch) method collects a number of examples and stacks them, to create batches. This adds a dimension to their shape. The new dimension is added as the first dimension.

In [10]:
print(dataset_batch.batch(100))

<BatchDataset shapes: ({SepalLength: (?, ?), SepalWidth: (?, ?), PetalLength: (?, ?), PetalWidth: (?, ?)}, (?, ?)), types: ({SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}, tf.int64)>


Note that the dataset has an unknown batch size because the last batch will have fewer elements. In `train_input_fn`, after batching the `Dataset` contains 1D vectors of elements where each scalar was previously: 

In [13]:
print(dataset_batch)

<BatchDataset shapes: ({SepalLength: (?,), SepalWidth: (?,), PetalLength: (?,), PetalWidth: (?,)}, (?,)), types: ({SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}, tf.int64)>


### Return
At this point the `Dataset` contains `(features_dict, labels)` pairs.
This is the format expected by the `train` and `evaluate` methods, so the `input_fn` returns the dataset. The `labels` can/should be omitted when using the `predict` method. 

## Reading a CSV File
The most common real-world use case for the `Dataset` class is to stream data from files on disk. The [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) module includes a variety of file readers. Let's see how parsing the Iris dataset from the csv file looks using a `Dataset`. The following call to the `iris_data.maybe_download` function downloads the data if necessary, and returns the pathnames of the resulting files: 

In [15]:
train_path, test_path = iris_data.maybe_download()

The [iris_data.csv_input_fn](https://github.com/tensorflow/models/blob/master/samples/core/get_started/iris_data.py) function contains an alternative implementation that parses the csv files using a `Dataset`. Let's look at how to build an Estimator-compatible input function that reads from the local files. 

### Build the `Dataset` 
We start by building a [tf.data.TextLineDataset](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) object to read the file one line at a time. Then, we call the [tf.data.Dataset.skip](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#skip) method to skip over the first line of the file, which contains a header, not an example: 

In [16]:
ds = tf.data.TextLineDataset(train_path).skip(1)

### Build a csv line parser
We will start by building a function to parse a single line. The following `iris_data.parse_line` function accomplishes this task using the [tf.decode_csv](https://www.tensorflow.org/api_docs/python/tf/io/decode_csv) function, and some simple python code: We must parse each of the lines in the dataset in order to generate the necessary `(features, label)` pairs. The following `_parse_line` function calls [tf.decode_csv](https://www.tensorflow.org/api_docs/python/tf/io/decode_csv) to parse a single line into its features and the label. Since Estimators require that features be represented as a dictionary, we rely on Python's built-in `dict` and `zip` functions to build that dictionary. The feature names are the keys of that dictionary. We then call the dictionary's `pop` method to remove the label field from the features dictionary: 

In [17]:
# Metadata describing the text columns 
COLUMNS = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'label']
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line): 
    # Decode the line into its fields 
    fields = tf.decode_csv(line, FIELD_DEFAULTS) 
    # Pack the result into a dictionary 
    features = dict(zip(COLUMNS,fields)) 
    # Separate the label from the features 
    label = features.pop('label') 
    return features, label

### Parse the lines
Datasets have many methods for manipulating the data while it is being piped to a model. The most heavily-used method is [tf.data.Dataset.map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map), which applies a transformation to each element of the `Dataset`. The `map` method takes a `map_func` argument that describes how each item in the `Dataset` should be transformed.

<img src="../images/datasets-for-estimator/map.png" alt="map" width="500"/>

The [tf.data.Dataset.map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) method applies the `map_func` to transform each item in the Dataset. So to parse the lines as they are streamed out of the csv file, we pass our `_parse_line` function to the `map` method: 

In [19]:
ds = ds.map(_parse_line)
print(ds)

<MapDataset shapes: ({SepalLength: (), SepalWidth: (), PetalLength: (), PetalWidth: ()}, ()), types: ({SepalLength: tf.float32, SepalWidth: tf.float32, PetalLength: tf.float32, PetalWidth: tf.float32}, tf.int32)>


Now instead of simple scalar strings, the dataset contains (features, label) pairs.

the remainder of the `iris_data.csv_input_fn` function is identical to `iris_data.train_input_fn` which was covered in the in the [Basic input](#basic-input) section.

In [20]:
print(train_path)

/Users/zzhang/.keras/datasets/iris_training.csv


### Tyr it out
This function can be used as a replacement for `iris_data.train_input_fn`. It can be used to feed an estimator as follows:

In [21]:
train_path, test_path = iris_data.maybe_download()

# All the inputs are numeric
feature_columns = [
    tf.feature_column.numeric_column(name)
    for name in iris_data.CSV_COLUMN_NAMES[:-1]]

# Build the estimator
est = tf.estimator.LinearClassifier(feature_columns,
                                    n_classes=3)
# Train the estimator
batch_size = 100
est.train(
    steps=1000,
    input_fn=lambda : iris_data.csv_input_fn(train_path, batch_size))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/41/ys9q_7f57_3bk23wg5rzh4k00000gn/T/tmph1pb8nf8', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0xb2ffd42e8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create Chec

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0xb2ffd4198>

Estimators expect an `input_fn` to take no arguments. To work around this restriction, we use `lambda` to capture the arguments and provide the expected interface.