# Diff Dataset
Recall that in `../01-vanilla_NN/01-vanilla_NN.ipynb` we have devised a dataset in which all sequences are of length
`10`, making the dataset easier to split into Train/Val/Test sets. The same dataset can be run with sequential
models, of course and that's exactly what I plan to write in this notebook.

In [1]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from functools import reduce
from itertools import combinations, permutations
from math import factorial

The following `X` will be our dataset (including training/validation/test sets).

In [2]:
n_classes = 10
max_length = 10
n_instances = sum([reduce(lambda x, y: x*y, range(n_classes,n_classes-length,-1)) for length in range(2, max_length+1)])
n_instances

9864090

In [3]:
X = np.zeros((n_instances, max_length, n_classes), dtype=np.float32)

I have said in `README.md` that CNN is of little use here because we are not dealing with images. However, the shape of `X` does look like a single-channel image. Still, using CNN to extract local features makes little sense, so we will probably stick to our plan -- Maybe the first layer of our vanilla NN would be a `keras.layers.Flatten` and followed by a few fully connected layers.

In [4]:
def one_hot(array, depth=n_classes):
    """
    array is an ndarray of shape (None,)
    """
    return np.eye(depth)[array, :]

In [6]:
# labels
Y = np.empty((n_instances, max_length), dtype=np.float32)  

In [7]:
%%time
#X[...] = 0
S = set(range(0, 9+1))
index_instance = 0
for length in range(2, max_length+1):    
    n_permutations = factorial(length)
    #n_combinations = n_instances // n_permutations
    #for i, c in enumerate(combinations(S, length)):
    for c in combinations(S, length):
        #for j, p in enumerate(permutations(c)):
        for p in permutations(c):
            #print(f"(index_instance/n_instances = {index_instance}/{n_instances})", end="\r")
            #print(f"np.array(p) = {np.array(p)}")
            X[index_instance, :length, :] = one_hot(np.array(p))#[..., np.newaxis]
            Y[index_instance, :] = np.concatenate((np.argsort(p), np.arange(length, max_length)))
            #print(f"""
            #(index_instance/n_instances = {index_instance}/{n_instances})
            #x = {one_hot(np.array(p))}
            #y = {np.concatenate((np.argsort(p), np.arange(length, max_length)))}
            #""", end="\r")
            index_instance += 1

CPU times: user 6min 23s, sys: 2.94 s, total: 6min 26s
Wall time: 6min 27s


### Train/Validation/Test Split

In [67]:
from sklearn.model_selection import train_test_split

In [52]:
X_train_val, X_test, Y_train_val, Y_test = train_test_split(X, Y, test_size=0.2)
X_train_val.shape, X_test.shape

((7891272, 10, 10), (1972818, 10, 10))

## Model

In [50]:
np.product(X.shape[1:])

100

In [65]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 200)               20200     
_________________________________________________________________
dense_3 (Dense)              (None, 100)               20100     
_________________________________________________________________
reshape_1 (Reshape)          (None, 10, 10)            0         
_________________________________________________________________
softmax_1 (Softmax)          (None, 10, 10)            0         
Total params: 40,300
Trainable params: 40,300
Non-trainable params: 0
_________________________________________________________________


In [66]:
# add some callbacks before beginning training.
checkpoint_cb = keras.callbacks.ModelCheckpoint("vanilla_NN_model.h5")

model.fit(X_train_val,
         Y_train_val,
         #steps_per_epoch=60_000,
         epochs=2,
         validation_split=0.2,
         verbose=True,
         callbacks=[checkpoint_cb],
)

Epoch 1/2
  5880/197282 [..............................] - ETA: 5:54 - loss: 2.1663

KeyboardInterrupt: 