# Input Pipeline

As we're using TensorFlow we can make use of the `tf.data.Dataset` object. First, we'll load in our Numpy binaries from file:

In [44]:
import numpy as np

with open('movie-xids.npy', 'rb') as f:
    Xids = np.load(f, allow_pickle=True)
with open('movie-xmask.npy', 'rb') as f:
    Xmask = np.load(f, allow_pickle=True)
with open('movie-labels.npy', 'rb') as f:
    labels = np.load(f, allow_pickle=True)

In [45]:
Xids.shape

(156060, 512)

In [46]:
Xids[0:5, 0:10]

array([[  101,   138,  1326,  1104, 13936, 25265, 16913, 15107,  1103,
         8050],
       [  101,   138,  1326,  1104, 13936, 25265, 16913, 15107,  1103,
         8050],
       [  101,   138,  1326,   102,     0,     0,     0,     0,     0,
            0],
       [  101,   138,   102,     0,     0,     0,     0,     0,     0,
            0],
       [  101,  1326,   102,     0,     0,     0,     0,     0,     0,
            0]])

We can take these three arrays and create a TF dataset object with them using `from_tensor_slices` like so:

In [47]:
import tensorflow as tf

In [48]:
dataset = tf.data.Dataset.from_tensor_slices((Xids, Xmask, labels))

In [49]:
dataset.take(1)

<TakeDataset element_spec=(TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(512,), dtype=tf.int64, name=None), TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

Each sample in our dataset is a tuple containing a single `Xids`, `Xmask`, and `labels` tensor. However, when feeding data into our model we need a two-item tuple in the format **(\<inputs\>, \<outputs\>)**. Now, we have two tensors for our inputs - so, what we do is enter our **\<inputs\>** tensor as a dictionary:

```
{
    'input_ids': <input_id_tensor>,
    'attention_mask': <mask_tensor>
}
```

To rearrange the dataset format we can `map` a function that modifies the format like so:

format needed:
```{input_ids, attention_mask}, outputs```

In [50]:
def map_func(input_ids, masks, labels):
    # we convert our three-item tuple into a two-item tuple where the input item is a dictionary
    return {'input_ids': input_ids, 
            'attention_mask': masks}, labels

In [51]:
# then we use the dataset map method to apply this transformation
dataset = dataset.map(map_func)

In [52]:
dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(512,), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(512,), dtype=tf.int64, name=None)}, TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

Now we can see that our dataset sample format has been changed. Next, we need to shuffle our data, and batch it. We will take batch sizes of `16` and drop any samples that don't fit evenly into chunks of 16.

In [53]:
batch_size = 16

In [54]:
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True) #Increase 10,000 if data isn't being shuffled properly

In [55]:
dataset.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>

Now our dataset samples are organized into batches of 16. The final step is to split our data into training and validation sets. For this we use the `take` and `skip` methods, creating and 90-10 split.

In [56]:
# Create a 90-10 training to validation split
split = 0.9

In [57]:
#Number of observations
Xids.shape[0]

156060

In [58]:
# Total number of batches
Xids.shape[0] / batch_size

9753.75

In [59]:
# How many batches must be taken to create 90% training set
Xids.shape[0] / batch_size * split

8778.375

In [60]:
size = int((Xids.shape[0] / batch_size) * split)

In [61]:
train_ds = dataset.take(size)
val_ds = dataset.skip(size)

# free up memory
del dataset

Our two datasets are fully prepared for our model inputs. Now, we can save both to file using [`tf.data.experimental.save`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/save).

In [62]:
tf.data.experimental.save(train_ds, 'train')
tf.data.experimental.save(val_ds, 'val')

In the next notebook we will be loading these files using `tf.data.experimental.load`. Which requires us to define the tensor `element_spec` - which describes the tensor shape. To find our dataset element spec we can write:

In [63]:
train_ds.element_spec

({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
  'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
 TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))

In [64]:
val_ds.element_spec

({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None),
  'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)},
 TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))

In [65]:
val_ds.element_spec == train_ds.element_spec

True

We will be using this tuple when loading our data in the next notebook.

In [66]:
ds = tf.data.experimental.load('train', element_spec=train_ds.element_spec)

In [67]:
ds.take(1)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>