<a href="https://colab.research.google.com/github/kavyajeetbora/time_series_deeplearning.ai/blob/main/week2/preparing_features_labels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
print(tf.__version__)
import numpy as np

2.4.1


1. Creating a range using tensorflow dataset

In [2]:
dataset = tf.data.Dataset.range(10)
for val in dataset:
    print(val.numpy())

0
1
2
3
4
5
6
7
8
9


2. Creating a window of 5 from the dataset shifted by 1 after each window:

In [13]:
dataset = tf.data.Dataset.range(10)
window = dataset.window(5, shift=1)
for i, w in enumerate(window):
    print("#window:",i,":",end=" ")
    for val in w:
        print(val.numpy(),end=" ")
    print()

#window: 0 : 0 1 2 3 4 
#window: 1 : 1 2 3 4 5 
#window: 2 : 2 3 4 5 6 
#window: 3 : 3 4 5 6 7 
#window: 4 : 4 5 6 7 8 
#window: 5 : 5 6 7 8 9 
#window: 6 : 6 7 8 9 
#window: 7 : 7 8 9 
#window: 8 : 8 9 
#window: 9 : 9 


To just get chuncks of exactly 5 sets, we will use drop remainder to True:

In [14]:
dataset = tf.data.Dataset.range(10)
window = dataset.window(5, shift=1, drop_remainder=True)
for i, w in enumerate(window):
    print("#window:",i,":",end=" ")
    for val in w:
        print(val.numpy(),end=" ")
    print()

#window: 0 : 0 1 2 3 4 
#window: 1 : 1 2 3 4 5 
#window: 2 : 2 3 4 5 6 
#window: 3 : 3 4 5 6 7 
#window: 4 : 4 5 6 7 8 
#window: 5 : 5 6 7 8 9 


3. Now converting the data to numpy format:

In [18]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda x: x.batch(5))
for window in dataset:
    print(window.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]
[5 6 7 8 9]


4. Splitting the dataset into features and labels:

In [23]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda x: x.batch(5))
dataset = dataset.map(lambda x: (x[:-1], x[-1:]))
print("features\tlabel")
for x,y in dataset:
    print(x.numpy(),"\t",y.numpy())

features	label
[0 1 2 3] 	 [4]
[1 2 3 4] 	 [5]
[2 3 4 5] 	 [6]
[3 4 5 6] 	 [7]
[4 5 6 7] 	 [8]
[5 6 7 8] 	 [9]


5. Shuffling the data to avoid sequence biases during training:


**Sequence bias**:

Sequence bias is when the order of things can impact the selection of things. For example, if I were to ask you your favorite TV show, and listed "Game of Thrones", "Killing Eve", "Travellers" and "Doctor Who" in that order, you're probably more likely to select 'Game of Thrones' as you are familiar with it, and it's the first thing you see. Even if it is equal to the other TV shows. So, when training data in a dataset, we don't want the sequence to impact the training in a similar way, so it's good to shuffle them up. 


**What is shuffle buffer ?**

Using a shuffle buffer speeds things up a bit. So for example, if you have 100,000 items in your dataset, but you set the buffer to a thousand. It will just fill the buffer with the first thousand elements, pick one of them at random. And then it will replace that with the 1,000 and first element before randomly picking again, and so on. This way with super large datasets, the random element choosing can choose from a smaller number which effectively speeds things up

In [24]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda x: x.batch(5))
dataset = dataset.map(lambda x: (x[:-1], x[-1:]))
dataset = dataset.shuffle(buffer_size=10)
print("features\tlabel")
for x,y in dataset:
    print(x.numpy(),"\t",y.numpy())

features	label
[2 3 4 5] 	 [6]
[4 5 6 7] 	 [8]
[1 2 3 4] 	 [5]
[0 1 2 3] 	 [4]
[5 6 7 8] 	 [9]
[3 4 5 6] 	 [7]


6. Batching the data: this will allow faster training, training the model with $n$ features and label at a time:

In [26]:
dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda x: x.batch(5))
dataset = dataset.map(lambda x: (x[:-1], x[-1:]))
dataset = dataset.shuffle(buffer_size=10)
dataset = dataset.batch(2).prefetch(1)

for x,y in dataset:
    print(x.numpy(),y.numpy())

[[3 4 5 6]
 [1 2 3 4]] [[7]
 [5]]
[[4 5 6 7]
 [2 3 4 5]] [[8]
 [6]]
[[5 6 7 8]
 [0 1 2 3]] [[9]
 [4]]
