# Vanilla Neural Network
I think this question is less convenient to tackle using vanilla neural network than using sequential models like RNN, LSTM, etc. This is because vanilla NN normally deals with input with **fixed length**. To accomodate our need of different length from `2` to `10`, we would have to train `9` models, each with input shape `2, 3, 4, ...`.

**Note**. Creating `10-2+1 = 9` NNs is not a bad thing. Their cooperated performance might also not mediocre. It is just that I find it more challenging/rewarding to try to find a devise-once-use-everywhere solution, so I ended up spending most of the time on finding such solutions. 

## Correction
Actually, there exist ways to still use vanilla NN for this question. One such way is through padding. More precisely, let's take a few examples to illustrate our point:

01. Input array `[0, 9, 7, 1, 2]`
  - We may use integers $\in {91, 92, 93, \ldots, 98}$ to pad, whence the padded input array becomes
  `[0, 9, 7, 1, 2, 91, 92, 93, 94, 95]`. We always pad until the padded array has length `10`.
  - As for the corresponding output, I choose to return the permutation which make the padded input array sorted.
  In this particular example,  that would be `[0, 3, 4, 2, 1, 5, 6, 7, 8, 9]`. Note how the last five indices
  have not been altered at all in this permutation.
02. Input array `[9, 0]`
  - The padded input array would be `[9, 0, 91, 92, 93, 94, 95, 96, 97, 98]`.
  - The output would be `[1, 0, 2, 3, 4, 5, 6, 7, 8, 9]`.

This looks a little involved and artificial; nevertheless, it also brings convenience

- The input and output are now both of fixed shape
- The fact of being of fixed shape makes creating the dataset a lot more easier, (which in turn makes splitting it into Train/Validation/Test sets easier).

### Correction inside correction
We have said that we wanted to use `91..98` as padders, but, sorry, because normally we would use one-hot encoding during the dataset preparation for `X`. It's clueless how these `91..98` should be mapped. So, let's just forget the `91..98` idea, just **pad with zero vectors**.

In [1]:
padders = list(range(91, 98+1))
padders

[91, 92, 93, 94, 95, 96, 97, 98]

In [2]:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
from functools import reduce
from itertools import combinations, permutations
from math import factorial

The following `X` will be our dataset (including training/validation/test sets).

In [3]:
n_classes = 10
max_length = 10
n_instances = sum([reduce(lambda x, y: x*y, range(n_classes,n_classes-length,-1)) for length in range(2, max_length+1)])
n_instances

9864090

In [4]:
X = np.zeros((n_instances, max_length, n_classes), dtype=np.float32)

I have said in `README.md` that CNN is of little use here because we are not dealing with images. However, the shape of `X` does look like a single-channel image. Still, using CNN to extract local features makes little sense, so we will probably stick to our plan -- Maybe the first layer of our vanilla NN would be a `keras.layers.Flatten` and followed by a few fully connected layers.

In [5]:
np.argsort([9,5,0,3])

array([2, 3, 1, 0])

In [6]:
A = np.array([9,5,0,3])
A[np.argsort(A)]

array([0, 3, 5, 9])

In [7]:
def one_hot(array, depth=n_classes):
    """
    array is an ndarray of shape (None,)
    """
    return np.eye(depth)[array, :]

In [8]:
one_hot(A)

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]])

In [9]:
tf.one_hot(A, depth=10)

<tf.Tensor: shape=(4, 10), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]], dtype=float32)>

In [10]:
# labels
Y = np.empty((n_instances, max_length), dtype=np.float32)  

In [11]:
np.concatenate(([1,2,3], [4,5,6]))

array([1, 2, 3, 4, 5, 6])

In [12]:
p = (4,3,9,7,8)
length = len(p)
np.concatenate((np.argsort(p), np.arange(length, max_length)))

array([1, 0, 3, 4, 2, 5, 6, 7, 8, 9])

In [32]:
%%time
X[...] = 0
S = set(range(0, 9+1))
index_instance = 0
#for length in tqdm(range(2, max_length+1)):
for length in range(2, max_length+1):    
    n_permutations = factorial(length)
    #n_combinations = n_instances // n_permutations
    #for i, c in enumerate(combinations(S, length)):
    for c in combinations(S, length):
        #for j, p in enumerate(permutations(c)):
        for p in permutations(c):
            #print(f"(index_instance/n_instances = {index_instance}/{n_instances})", end="\r")
            #print(f"np.array(p) = {np.array(p)}")
            X[index_instance, :length, :] = one_hot(np.array(p))#[..., np.newaxis]
            Y[index_instance, :] = np.concatenate((np.argsort(p), np.arange(length, max_length)))
            #print(f"""
            #(index_instance/n_instances = {index_instance}/{n_instances})
            #x = {one_hot(np.array(p))}
            #y = {np.concatenate((np.argsort(p), np.arange(length, max_length)))}
            #""", end="\r")
            index_instance += 1

CPU times: user 6min 57s, sys: 0 ns, total: 6min 57s
Wall time: 6min 57s


**(?)** Improvement. The above construction of `X, Y` takes a little long (6min on Thinkpad X200), can consider using concurrent programming.

In [33]:
index = half = n_instances // 2
print(f"X[index] =\n{X[index]}")
print(f"Y[index] = {Y[index]}")

X[index] =
[[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Y[index] = [5. 8. 2. 0. 6. 4. 1. 7. 3. 9.]


In [34]:
index = -1
print(f"X[index] =\n{X[index]}")
print(f"Y[index] = {Y[index]}")

X[index] =
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Y[index] = [9. 8. 7. 6. 5. 4. 3. 2. 1. 0.]


In [35]:
index = 2
print(f"X[index] =\n{X[index]}")
print(f"Y[index] = {Y[index]}")

X[index] =
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Y[index] = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


As we can see: Our dataset is more or less correct now, except that `X` build quite slow. A threaded, or multi-process version of it will be desirable.

### Shuffling
We'd better shuffle `X` and `Y` (together).

In [41]:
A = np.arange(7*2).reshape((7,2))
A

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13]])

In [42]:
np.random.shuffle(A)
A

array([[ 8,  9],
       [12, 13],
       [10, 11],
       [ 4,  5],
       [ 2,  3],
       [ 6,  7],
       [ 0,  1]])

In [43]:
np.random.shuffle(X)

### Train/Validation/Test Split

In [44]:
from sklearn.model_selection import train_test_split

In [52]:
X_train_val, X_test, Y_train_val, Y_test = train_test_split(X, Y, test_size=0.2)
X_train_val.shape, X_test.shape

((7891272, 10, 10), (1972818, 10, 10))

In [48]:
(X_train_val.shape[0] + X_test.shape[0]) - n_instances

0

In [51]:
Y.shape

(9864090, 10)

## Model

In [50]:
np.product(X.shape[1:])

100

In [55]:
#https://keras.io/api/layers/reshaping_layers/reshape/
#https://keras.io/api/layers/activation_layers/softmax/
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=X.shape[1:]),
    #keras.layers.Dense(np.product(X.shape[1:]), activation="relu"),
    keras.layers.Dense(2*np.product(X.shape[1:]), activation="relu"),
    keras.layers.Dense(np.product(X.shape[1:])),
    keras.layers.Reshape(X.shape[1:]),
    keras.layers.Softmax(axis=-1),
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

In [62]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 200)               20200     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               20100     
_________________________________________________________________
reshape (Reshape)            (None, 10, 10)            0         
_________________________________________________________________
softmax (Softmax)            (None, 10, 10)            0         
Total params: 40,300
Trainable params: 40,300
Non-trainable params: 0
_________________________________________________________________


In [63]:
# add some callbacks before beginning training.
checkpoint_cb = keras.callbacks.ModelCheckpoint("vanilla_NN_model.h5")

model.fit(X_train_val,
         Y_train_val,
         #steps_per_epoch=60_000,
         epochs=2,
         validation_split=0.2,
         verbose=True,
         callbacks=[checkpoint_cb],
)

Epoch 1/2


TypeError: 'NoneType' object is not callable