##Preparing Time Series Features and Labels

This lab involves preparing time series data into features and labels that you can use to train a model. 

This is achieved by a windowing technique where you group consecutive measurement values into one feature and the next measurement will be the label. 

For example, in hourly measurements you can use values taken at hourse 1 to 11 to predict the value at hour 12

##Imports 

Tensorflow will be the only impot in this module and all methods will be pulled from the `tf.data` API particularly the `tf.data.Dataset` class. 

This class contains useful methods to arrange sequences of data

In [1]:
import tensorflow as tf

##Create a simple dataset

Just use a sequence of numbers as your dataset so you can clearly see the effect of each command. 

For example, the cell below uses the range() mehtod to generate a dataset containing numbers 0 to 9

In [3]:
#Generates a tf dataset with 10 elemnts (i.e. numbers 0 to 9)
dataset = tf.data.Dataset.range(10)

#Preview the result 
for value in dataset:
  print(value.numpy())

0
1
2
3
4
5
6
7
8
9


from tensorflow.python.ops.tensor_array_ops import build_ta_with_new_flow
##Windowing the data

You want to group consecutive elements of your data and use that to predict a future value. This is called windowing and you can use that with the `window()` method as shown build_ta_with_new_flow

Here you take 5 elements per window (ie `size` parameter) and move this window 1 element at a time (ie `shift` parameter)

One caveat to using this method is that each window returns a `Dataset` in itself

In [7]:
#Generate a tf dataset with 10 elements (ie numbers 0 to 9)
dataset = tf.data.Dataset.range(10)

#Window the data
dataset = dataset.window(size = 5, shift = 1)

#Print the result 
for window_dataset in dataset:
  print(window_dataset)

<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>


If you want to see elements, you'll have to iterate through each iterable. 

In [8]:
#Print the result 
for window_dataset in dataset:
  for item in window_dataset:
    print([item.numpy()])

[0]
[1]
[2]
[3]
[4]
[1]
[2]
[3]
[4]
[5]
[2]
[3]
[4]
[5]
[6]
[3]
[4]
[5]
[6]
[7]
[4]
[5]
[6]
[7]
[8]
[5]
[6]
[7]
[8]
[9]
[6]
[7]
[8]
[9]
[7]
[8]
[9]
[8]
[9]
[9]


You can see that the resulting sets aren't sized evenly since there are no more elements after the number 9. 

You can use the `drop_remainder` flag to make sure that only 5-element windows are retained

In [9]:
#Generate the dataset
dataset = tf.data.Dataset.range(10)

#Window  the data but only take those with the specified size 
dataset = dataset.window(size = 5, shift = 1, drop_remainder = True)

#Print the result 
for window_dataset in dataset:
  print([item.numpy() for item in window_dataset])

[0, 1, 2, 3, 4]
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
[3, 4, 5, 6, 7]
[4, 5, 6, 7, 8]
[5, 6, 7, 8, 9]


##Flatten the windows 

In training the model later, you will want to prep the windows to be `tensors` instead of the `Dataset` structure. You can do so by feeding a mapping function to the `flat_map()`` method. 

This function will be applied to each window and the results will be flattened into a single dataset. 

In [11]:
#Generate a tf dataset with 10 elements 
dataset = tf.data.Dataset.range(10)

#Window the data but only take those with the specified size
dataset = dataset.window(5, shift = 1, drop_remainder = True)

#Flatten the windows by putting its elements in a single batch
dataset = dataset.flat_map(lambda window : window.batch(5))

#Print the results 
for window in dataset:
  print(window.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]
[5 6 7 8 9]


##Group features and labels 
Next, mark the labels in each window. Do so by splitting the last element of each window from the first four using the `map()` method

In [12]:
#Generate a tf dataset with 10 elements 
dataset = tf.data.Dataset.range(10)

#Window the data but only take those wiht the specified size
dataset = dataset.window(5, shift = 1, drop_remainder = True)

#Flatten the windows by putting its elements in single batch
dataset = dataset.flat_map(lambda window: window.batch(5))

#Create tuples with features (first four elements of the window) and labels (last element)
dataset = dataset.map(lambda window: (window[:1], window[-1]))

#Print the result 
for x,y in dataset:
  print('x = ', x.numpy())
  print('y = ', y.numpy())
  print()

x =  [0]
y =  4

x =  [1]
y =  5

x =  [2]
y =  6

x =  [3]
y =  7

x =  [4]
y =  8

x =  [5]
y =  9



##Shuffle the data

It's good practice to shuffle the dataset to reduce sequence bias while training the model. This refers to a network overfitting to the order of inputs and consequently it will not perform well when it doesn't see that particular order when testing. 

You can use the `shuffle()` method to do this. The `buffer_size` parameter is required for that and as mentioned in the doc, you should put a number equal or greater than the total number of elements you're shuffling. 

We can see from previous cells that the total number of windows in the dataset is 6 so we can choose this number or higher

In [15]:
dataset = tf.data.Dataset.range(10)

#Window the data but only take those with the specified size
dataset = dataset.window(5, shift = 1, drop_remainder = True)

#Flatten the windows by putting its elements in a single batch
dataset = dataset.flat_map(lambda window: window.batch(5))

#Create tuples with features (first four elements of window) and labels (last element of window)
dataset = dataset.map(lambda window: (window[:-1], window[-1]))

#Shuffle the windows
dataset = dataset.shuffle(buffer_size = 10)

#Print the results
for x,y in dataset:
  print('x = ', x.numpy())
  print('y = ', y.numpy())
  print()

x =  [3 4 5 6]
y =  7

x =  [2 3 4 5]
y =  6

x =  [4 5 6 7]
y =  8

x =  [1 2 3 4]
y =  5

x =  [0 1 2 3]
y =  4

x =  [5 6 7 8]
y =  9



##Create batches for training 
Lastly group windows into batches. You can do this with the `batch()` mehtod as shown below. SImply specify the batch size and it will return a batched dataset with that number of windows 

As a rule of thumb, it's good to specify a `prefetch()` step. This optimizesthe execution time when the model is already training. By specifing a prefetch `buffer_size` of 1, tensorflow will prep the next one in advance while the curent batch is being consumed by the model. 

In [17]:
dataset = tf.data.Dataset.range(10)

#Window the data but only take those with specified size 
dataset = dataset.window(5, shift = 1, drop_remainder = True)

#Flatten the windows by putting its elements in a single batch
dataset = dataset.flat_map(lambda window: window.batch(5))

#Create tuples with features (first four elements of window) and labels (last element of window)
dataset = dataset.map(lambda window: (window[:-1], window[-1]))

#Shuffle the windows
dataset = dataset.shuffle(buffer_size = 10)

#Create batches of windows 
dataset = dataset.batch(2).prefetch(1)

#Print the results
for x,y in dataset:
  print("x = ", x.numpy())
  print("y = ", y.numpy())
  print()


x =  [[5 6 7 8]
 [3 4 5 6]]
y =  [9 7]

x =  [[0 1 2 3]
 [1 2 3 4]]
y =  [4 5]

x =  [[2 3 4 5]
 [4 5 6 7]]
y =  [6 8]

