# Deep Neural Networks with TensorFlow's Dataset API and Estimators

With TensorFlow 1.4, the Dataset API is [introduced](https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html). The real advantage of the Dataset API is that a lot of memory management is done for the user when using large file-based datasets. And, in this work, we will be implementing a predefined DNN estimator and feed it with the Dataset API for Kaggle's Titanic dataset.

---

With Dataset API we can use file-based datasets or datasets in the memory. In this work we will read the data from a csv file. In order to have it, you should first run [this file](./01-data-label-encoding.ipynb) to label encode and then [this file](./02-data-feature-engineering.ipynb) for feature engineering and finally run [this file](split-train-valid.ipynb) to split the train set into train and valid sets.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

Our dataset includes the features:

In [2]:
train = pd.read_csv('./data/train_split_final.csv')
valid = pd.read_csv('./data/valid_split_final.csv')

train.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Title,Embarked,FamilySize,FamilyID,Survived
0,300,1,0,50.0,0,1,247.5208,7,1,2,50,1
1,372,3,1,18.0,1,0,6.4958,6,3,2,50,0
2,563,2,1,28.0,0,0,13.5,6,3,1,50,0
3,696,2,1,52.0,0,0,13.5,6,3,1,50,0
4,867,2,0,27.0,1,0,13.8583,4,1,2,50,1


`{Pclass, Sex, Age, SibSp, Parch, Fare, Title, Embarked, FamilySize, FamilyID}` we will be using as the features and the `Survived` column will be our labels.

In order to use the Datasets API and feed the Estimator, we should write an input function like this:

```python
def input_fn():
    ...<code>...
    return ({ 'Pclass':[values], ..<etc>.., 'FamilyID':[values] },
            [Survived])
```

This function takes the `file_path` as input and outputs a two-element tuple. The first element of the tuple is a dictionary containing feature names as keys and features as values. And the second element is a list of labels for the training batch.

Other two arguments for the input function are `perform_shuffle` and `repeat_count`. If `perform_shuffle` is `True` the order of the examples are shuffled. The `perform_shuffle` argument specifies the number of epochs during training, for instance, if `perform_shuffle=1` all the train set examples are passed only once.

And the implementation is as follows, we will use this function to feed the estimator later.

In [3]:
# define feature names first

feature_names = [
    'Pclass',
    'Sex',
    'Age',
    'SibSp',
    'Parch',
    'Fare',
    'Title',
    'Embarked',
    'FamilySize'
    'FamilyID']

In [4]:
def titanic_input_fn(file_path, perform_shuffle=False, repeat_count=1):
    def decode_csv(line):
        # second argument of decode_csv defines the data types for each dataset column!
        # the first argument is passenger ids thus integer
        # the last column is survived or not labels thus integer
        # and the rest are float.
        parsed_line = tf.decode_csv(
            line, [[0], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.], [0]])
        label = parsed_line[-1:] # Last element is the label
        del parsed_line[-1] # Delete last element (it is the labels)
        features = parsed_line[1:] # First element is excluded since it is the id column
        d = dict(zip(feature_names, features)), label
        return d
    
    dataset = (tf.data.TextLineDataset(file_path) # Read text file
        .skip(1) # Skip header row
        .map(decode_csv)) # Transform each elem by applying decode_csv fn
    if perform_shuffle:
        # Randomizes input using a window of 256 elements (read into memory)
        dataset = dataset.shuffle(buffer_size=256)
    dataset = dataset.repeat(repeat_count) # Repeats dataset this # times
    dataset = dataset.batch(32)  # Batch size to use
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

Memory management is provided here with `TextLineDataset`.

Let's print and check the first batch:

In [49]:
train_path = './data/train_final.csv'

In [6]:
next_batch = titanic_input_fn(train_path, False) # Will return first 32 elements

with tf.Session() as sess:
    first_batch = sess.run(next_batch)
print(first_batch)

({'Pclass': array([ 3.,  1.,  3.,  1.,  3.,  3.,  1.,  3.,  3.,  2.,  3.,  1.,  3.,
        3.,  3.,  2.,  3.,  2.,  3.,  3.,  2.,  2.,  3.,  1.,  3.,  3.,
        3.,  1.,  3.,  3.,  1.,  1.], dtype=float32), 'Sex': array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,
        1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,
        1.,  1.,  0.,  1.,  1.,  0.], dtype=float32), 'Age': array([ 22.,  38.,  26.,  35.,  35.,   0.,  54.,   2.,  27.,  14.,   4.,
        58.,  20.,  39.,  14.,  55.,   2.,   0.,  31.,   0.,  35.,  34.,
        15.,  28.,   8.,  38.,   0.,  19.,   0.,   0.,  40.,   0.], dtype=float32), 'SibSp': array([ 1.,  1.,  0.,  1.,  0.,  0.,  0.,  3.,  0.,  1.,  1.,  0.,  0.,
        1.,  0.,  0.,  4.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  3.,  1.,
        0.,  3.,  0.,  0.,  0.,  1.], dtype=float32), 'Parch': array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  2.,  0.,  1.,  0.,  0.,
        5.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.

Ok, it is working!

---
Now we will define our DNN estimator

In [45]:
%rm -r ./checkpoints

# path to save checkpoints
save_dir = './checkpoints'

# reset default graph if rebuilding the classifier
tf.reset_default_graph()

# Create the feature_columns, which specifies the input to our model.
# All our input features are numeric, so use numeric_column for each one.
feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]

# define the classifier
classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns, # The input features to our model
    hidden_units=[2048, 1024, 512, 256, 128], # 5 layers
    n_classes=2, # survived or not {1, 0}
    model_dir=save_dir, # Path to where checkpoints etc are stored
    optimizer=tf.train.RMSPropOptimizer(
        learning_rate=0.00001),
    dropout=0.1)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './checkpoints', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f90c284a358>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


Now we will train the model using `titanic_input_fn` and our classifier.

In [46]:
train_path = './data/train_split_final.csv'
valid_path = './data/valid_split_final.csv'
test_path = './data/test_final.csv'

In [47]:
# the classifier will run for 500 epochs below
classifier.train(input_fn=lambda: titanic_input_fn(train_path, True, 500))

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into ./checkpoints/model.ckpt.
INFO:tensorflow:loss = 25.8542, step = 1
INFO:tensorflow:global_step/sec: 243.864
INFO:tensorflow:loss = 18.1964, step = 101 (0.411 sec)
INFO:tensorflow:global_step/sec: 286.23
INFO:tensorflow:loss = 21.0345, step = 201 (0.350 sec)
INFO:tensorflow:global_step/sec: 291.351
INFO:tensorflow:loss = 18.289, step = 301 (0.343 sec)
INFO:tensorflow:global_step/sec: 276.521
INFO:tensorflow:loss = 18.0048, step = 401 (0.361 sec)
INFO:tensorflow:global_step/sec: 219.284
INFO:tensorflow:loss = 16.4129, step = 501 (0.457 sec)
INFO:tensorflow:global_step/sec: 286.576
INFO:tensorflow:loss = 18.8377, step = 601 (0.348 sec)
INFO:tensorflow:global_step/sec: 295.436
INFO:tensorflow:loss = 21.0404, step = 701 (0.338 sec)
INFO:tensorflow:global_step/sec: 269.046
INFO:tensorflow:loss = 16.477, step = 801 (0.372 sec)
INFO:tensorflow:global_step/sec: 250.686
INFO:tensorflow:loss = 17.1671, step 

INFO:tensorflow:loss = 11.7054, step = 8401 (0.385 sec)
INFO:tensorflow:global_step/sec: 243.69
INFO:tensorflow:loss = 14.7185, step = 8501 (0.410 sec)
INFO:tensorflow:global_step/sec: 229.14
INFO:tensorflow:loss = 19.1348, step = 8601 (0.436 sec)
INFO:tensorflow:global_step/sec: 260.972
INFO:tensorflow:loss = 12.1516, step = 8701 (0.384 sec)
INFO:tensorflow:global_step/sec: 253.994
INFO:tensorflow:loss = 18.1426, step = 8801 (0.394 sec)
INFO:tensorflow:global_step/sec: 273.929
INFO:tensorflow:loss = 14.3482, step = 8901 (0.364 sec)
INFO:tensorflow:global_step/sec: 229.85
INFO:tensorflow:loss = 20.474, step = 9001 (0.436 sec)
INFO:tensorflow:global_step/sec: 225.907
INFO:tensorflow:loss = 16.1795, step = 9101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.706
INFO:tensorflow:loss = 12.8834, step = 9201 (0.398 sec)
INFO:tensorflow:global_step/sec: 286.595
INFO:tensorflow:loss = 10.8068, step = 9301 (0.350 sec)
INFO:tensorflow:global_step/sec: 277.714
INFO:tensorflow:loss = 10.7849, st

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7f90c284a080>

In [48]:
# evaluate
# Return value will contain evaluation_metrics such as: loss & average_loss
evaluate_result = classifier.evaluate(
   input_fn=lambda: titanic_input_fn(valid_path, False, 1))
print('')
print("Evaluation results:")
for key in evaluate_result:
    print("   {}, was: {}".format(key, evaluate_result[key]))

INFO:tensorflow:Starting evaluation at 2017-12-06-12:35:07
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-9750
INFO:tensorflow:Finished evaluation at 2017-12-06-12:35:07
INFO:tensorflow:Saving dict for global step 9750: accuracy = 0.764045, accuracy_baseline = 0.588015, auc = 0.803561, auc_precision_recall = 0.703048, average_loss = 0.551122, global_step = 9750, label/mean = 0.411985, loss = 16.35, prediction/mean = 0.371183

Evaluation results:
   accuracy, was: 0.7640449404716492
   accuracy_baseline, was: 0.5880149602890015
   auc, was: 0.8035610914230347
   auc_precision_recall, was: 0.703047513961792
   average_loss, was: 0.55112224817276
   label/mean, was: 0.41198500990867615
   loss, was: 16.349960327148438
   prediction/mean, was: 0.37118253111839294
   global_step, was: 9750


Predict on test set

In [40]:
predict_results = classifier.predict(
    input_fn=lambda: titanic_input_fn(test_path, False, 1))
print("Predictions on test file")
i=892
for prediction in predict_results:
    # Will print the predicted class: 0 or 1.
    print(prediction["class_ids"][0])
    i += 1
print(i)

Predictions on test file
INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-9750
0
0
0
0
1
0
0
0
1
0
0
0
1
0
1
1
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
1
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
1
1
0
1
1
0
1
1
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
1
0
1
0
0
0
0
0
1
0
1
1
0
0
1
0
0
1
1310


That's all folks!