# `tf.data.Dataset`
In previous notebooks, we have this code cell which is a memory hog (the `X`) and took long time to run.
Here in this notebook, our objective is to construct the same dataset by using `tf` operations
instead of `numpy` ones, hoping to reduce both memory usage and time (i.e. dataset construction time.)
```python
%%time
S = set(range(0, 9+1))
index_instance = 0
for length in range(2, max_length+1):    
    n_permutations = factorial(length)
    for c in combinations(S, length):
        for p in permutations(c):
            X[index_instance, :length, :] = one_hot(np.array(p))
            Y[index_instance, :] = np.concatenate((np.argsort(p), np.arange(length, max_length)))
            index_instance += 1
```

## `dataset = tf.data.Dataset.from_tensor_slices(X)`

In [1]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from functools import reduce
from itertools import combinations, permutations
from math import factorial
import sys

In [2]:
n_classes = 10
max_length = 10
n_instances = sum([reduce(lambda x, y: x*y, range(n_classes,n_classes-length,-1)) for length in range(2, max_length+1)])
n_instances

9864090

The following `X` will be our dataset (including training/validation/test sets).

In [3]:
help(sys.getsizeof)

Help on built-in function getsizeof in module sys:

getsizeof(...)
    getsizeof(object, default) -> int
    
    Return the size of object in bytes.



In [4]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi       7.0Gi       2.8Gi       918Mi       5.7Gi       7.2Gi
Swap:           31Gi       2.4Gi        29Gi


In [5]:
X = np.zeros((n_instances, max_length, n_classes), dtype=np.float32)

In [6]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi       7.0Gi       2.8Gi       911Mi       5.7Gi       7.2Gi
Swap:           31Gi       2.4Gi        29Gi


In [7]:
sys.getsizeof(X)

3945636128

`3.9` billion bytes! That's more than `3GB`. Let's verify this number.

In [8]:
n_instances * max_length * n_classes * (32//8)

3945636000

In [9]:
del X

In [10]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi       7.0Gi       2.8Gi       912Mi       5.7Gi       7.2Gi
Swap:           31Gi       2.4Gi        29Gi


**(?)** To dive even deeper: Where went the extra `128` bytes?

About right: The numbers are quite consistent.

<s>By contrast, it seems that `tf.zeros` does not allocate the memory immediately, taking only a memory of `184` bytes.</s>

In [11]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi       7.0Gi       2.8Gi       909Mi       5.7Gi       7.2Gi
Swap:           31Gi       2.4Gi        29Gi


In [12]:
X = tf.zeros((n_instances, max_length, n_classes), dtype=tf.float32)
# float32 unable to be allocated on 4GB-RAM X61s whereas int8 can.
#X = tf.zeros((n_instances, max_length, n_classes), dtype=tf.int8)

In [13]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi        10Gi       448Mi       909Mi       4.4Gi       3.5Gi
Swap:           31Gi       2.4Gi        29Gi


**(?)** Why my 4GB-RAM Thinkpad X61s still unable to allocate for this `X` using `tf`? Isn't that allocation just a mere 148 bytes?

In [14]:
sys.getsizeof(X)

184

In [15]:
del X

In [16]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            15Gi       7.0Gi       4.1Gi       907Mi       4.4Gi       7.2Gi
Swap:           31Gi       2.4Gi        29Gi


Note that the `free` commands above showed that the RAM consumption equals approximately
- `  0GB` for the case of `np.zeros(dtype=np.float32)`
- `  1GB` for the case of `tf.zeros(dtype=tf.int8)`
- `3.7GB`for the case of `tf.zeros(dtype=tf.float32)`

So, even though `sys.getsizeof(X)` shows less in `tf` tensor than in `np` ndarray, the OS feels the other way around.<br>
**Suspicious...**

## Workaround
Maybe we should abandon the idea of using `tf.data.Dataset.from_tensor_slices(X)`, because that direction might always have to first allocate large memory.

We start small and try to use `tf.data.Dataset`'s method to construct an equivalent datset.

**(?)** You've already seen in `ageron`'s homl2e that a dataset is able to contain tensors of diff shapes. Try to make an example yourself.

In [17]:
lengths = tf.range(2, max_length+1)
dataset = tf.data.Dataset.from_tensor_slices(lengths)
dataset = dataset.map(lambda x: tf.range(x))

In [18]:
for tensor in dataset:
    print(tensor)

tf.Tensor([0 1], shape=(2,), dtype=int32)
tf.Tensor([0 1 2], shape=(3,), dtype=int32)
tf.Tensor([0 1 2 3], shape=(4,), dtype=int32)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)
tf.Tensor([0 1 2 3 4 5], shape=(6,), dtype=int32)
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([0 1 2 3 4 5 6 7], shape=(8,), dtype=int32)
tf.Tensor([0 1 2 3 4 5 6 7 8], shape=(9,), dtype=int32)
tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)


**(?)** A big question that you haven't understood is: Should a `tf.data.Dataset` instance contain both `X` and `y`, i.e. data and labels, for supervised training? If so, how do we arrange `X` and `y`?

### First try: `tf.data.Dataset.from_generator()`
As I imagine, we can keep the original code, keep the `for` loop, but instead of filling in each "row" of `X`, we make it a generator using the keyword `yield`. After implementing the generator using numpy, we pass the generator into `tf.data.Dataset.from_generator()` and we're done.

In [19]:
def dataset_generator():
    S = set(range(0, 9+1))
    index_instance = 0
    for length in range(2, max_length+1):    
        n_permutations = factorial(length)
        for c in combinations(S, length):
            for p in permutations(c):
                x = np.zeros((max_length, n_classes), dtype=np.float32)
                x[:length, :] = tf.one_hot(np.array(p),
                                           depth=n_classes).numpy()
                y = np.concatenate((np.argsort(p),
                                    np.arange(length, max_length)))
                yield x, y
                index_instance += 1

In [20]:
dataset = tf.data.Dataset.from_generator(
    dataset_generator,
    output_types=(tf.float32, tf.float32),
    output_shapes=([max_length, n_classes], [max_length]),
)

**Rmk**. Had we forgotten to specify `output_shapes`, the following cells will still be able to run, up until
`model.fit()`, which will generate the following error:
```
ValueError : as_list() is not defined on an unknown TensorShape
```
`model.fit()` is able to run once we specify both `output_types` and `output_shapes`.

In the above, we have also provided (and disactivated) an equivalent cell using `output_signature` instead of the `(output_types, output_shapes)` pair, which is to be deprecated in the future.

In [21]:
for x, y in dataset.take(3):
    print(f"x =\n{x}")
    print(f"y =\n{y}")

x =
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
y =
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
x =
[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
y =
[1. 0. 2. 3. 4. 5. 6. 7. 8. 9.]
x =
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 

**Pros**

01. We do not have to wait two to six minutes for `X` to be constructed any more
02. Computers with little RAM can also run this code. Otherwise, they won't be able to even allocate enough memory for `X`.
03. Compared to building a `tf.data.Dataset` completely from its methods, this `from_generator()` has the advantage of being a lot easier to implement. Actually, we almost only replaced the assignment of rows of `X` by `yield`

**Cons**

01. We must think of a way to split the dataset into Training/Validation/Test sets because we no longer have the entire `X` to apply `train_test_split` from `sklearn`.

In [22]:
dataset = dataset.batch(32, drop_remainder=True)

In [23]:
for x, y in dataset.take(3):
    print(f"x.shape =\n{x.shape}")
    print(f"y.shape =\n{y.shape}")

x.shape =
(32, 10, 10)
y.shape =
(32, 10)
x.shape =
(32, 10, 10)
y.shape =
(32, 10)
x.shape =
(32, 10, 10)
y.shape =
(32, 10)


In [24]:
#https://keras.io/api/layers/reshaping_layers/reshape/
#https://keras.io/api/layers/activation_layers/softmax/
input_shape = (max_length, n_classes)
product_input_shape = np.product((max_length, n_classes))
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=input_shape),
    keras.layers.Dense(product_input_shape, activation="relu"),
    #keras.layers.Dense(2*product_input_shape, activation="relu"),
    keras.layers.Dense(product_input_shape),
    keras.layers.Reshape(input_shape),
    keras.layers.Softmax(axis=-1),
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["acc"],
)

In [25]:
#checkpoint_cb = keras.callbacks.ModelCheckpoint("vanilla_NN_model_generator.h5")
model.fit(dataset,
         batch_size=32)

    310/Unknown - 5s 16ms/step - loss: 1.2368 - acc: 0.6771

KeyboardInterrupt: 

In [None]:
# labels
Y = np.empty((n_instances, max_length), dtype=np.float32)  

In [None]:
%%time
#X[...] = 0
S = set(range(0, 9+1))
index_instance = 0
#for length in tqdm(range(2, max_length+1)):
for length in range(2, max_length+1):    
    n_permutations = factorial(length)
    #n_combinations = n_instances // n_permutations
    #for i, c in enumerate(combinations(S, length)):
    for c in combinations(S, length):
        #for j, p in enumerate(permutations(c)):
        for p in permutations(c):
            #print(f"(index_instance/n_instances = {index_instance}/{n_instances})", end="\r")
            #print(f"np.array(p) = {np.array(p)}")
            X[index_instance, :length, :] = one_hot(np.array(p))#[..., np.newaxis]
            Y[index_instance, :] = np.concatenate((np.argsort(p), np.arange(length, max_length)))
            index_instance += 1

### Train/Validation/Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train_val, X_test, Y_train_val, Y_test = train_test_split(X, Y, test_size=0.2)
X_train_val.shape, X_test.shape

## Model

We might be able to use less neurons and still arrive at a similar performance. Running out of time, I had not tried to tune the model; instead, I had spent most of the time trying to implement more solutions.

In [None]:
model = keras.models.load_model("vanilla_NN_model.h5")
model.summary()

In [None]:
model.evaluate(X_test, Y_test)

## Evaluation on `X_test`
We certainly would like to have performance measures like accuracy, precision/recall, etc. But we must first write some convenience functions to facilitate the operations.

In [None]:
class Sorter:
    def __init__(self, model):
        self.model = model

    def lenlen(self, x):
        somme = np.sum(x, axis=-1)
        first_zero_index = -1
        for i, s in enumerate(somme):
            if s > 10**(-6):
                first_zero_index = i
        if first_zero_index == -1:
            length = 10
        else:
            length = first_zero_index + 1
        return length

    def prettier(self, x, y):
        """
        x.shape = (10,10)
        """
        length = self.lenlen(x)
        xx = np.argmax(x[:length], axis=-1)
        sort_indices = y.astype(int)[:length]
        yy = xx[sort_indices]
        return xx, yy
    
    def evaluate(self, X, Y):
        Y_pred = self.model.predict(X)  # of shape (n_instances, 10, 10)
        Y = Y.astype(int)               # of shape (n_instances, 10)
        m = X.shape[0]
        n_correct = 0
        for i, x in enumerate(X):
            length = self.lenlen(x)
            y_pred = Y_pred[i]
            y_pred_sparse = np.argmax(y_pred, axis=-1)
            n_correct += np.array_equal(Y[i], y_pred_sparse)
        print(f"acc = {n_correct/m}")


In [None]:
sorter = Sorter(model)

In [None]:
%%time
sorter.evaluate(X_test, Y_test)