In [12]:
import numpy as np
import tensorflow as tf

# TensorFlow Dataset

TF Dataset is a Monad which is a container with interfaces. Cannot directly access internal elements, need to use its I/F.

You will not try to access elements in Spark RDD until the transformation is done and reduced to a single row. Because those elements are distributed over nodes.

<img src="image/what_is_tensorflow_dataset.jpg" align="left" width=500/>

# from_tensor_slice

tf.constant() to Tensor is tf.data.Dataset.from_tensor_slice() to Dataset.

* [from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)

In [69]:
x: np.ndarray = np.arange(2*3*4).reshape((2,3,4))
x

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [70]:
dataset: tf.data.Dataset = tf.data.Dataset.from_tensor_slices(x)
dataset

<TensorSliceDataset element_spec=TensorSpec(shape=(3, 4), dtype=tf.int64, name=None)>

In [71]:
for row in dataset:
    print(row)

tf.Tensor(
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]], shape=(3, 4), dtype=int64)
tf.Tensor(
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]], shape=(3, 4), dtype=int64)


# Extract single record 

Reduce to a dataset with single record and apply ```get_single_element()```.

* [get_single_element](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#get_single_element)

In [10]:
dataset.take(1).get_single_element()

<tf.Tensor: shape=(3, 4), dtype=int64, numpy=
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])>

# Aggregate (Reduce)

* [reduce](np.arange(2*3*4).reshape((2,3,4)))

```
reduce(
    initial_state, reduce_func, name=None
)
```

```reduce``` is like ```monad.foldLeft(initial)(func)``` in Scala but apply row-wise like SQL LAG.

```
list.foldLeft(initial=0)(left + right)  # Initial placeholder left value is 0 and continuously add right
```

## sum

Similar to ```tf.math.reduce_sum(axis=0)```.

In [30]:
def sum_fn(previous_row, current_row):
    return previous_row + current_row

In [26]:
x = np.arange(3*4).reshape((3,4)).astype(np.float32)
print(x)
ds = tf.data.Dataset.from_tensor_slices(x)

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]


In [33]:
sum = ds.reduce(initial_state=0.0, reduce_func=sum_fn)
print(f"reduce result type is {type(sum)}")
sum.numpy()

reduce result type is <class 'tensorflow.python.framework.ops.EagerTensor'>


array([12., 15., 18., 21.], dtype=float32)

## sum on Tensor of Tuple

Sum feature_1 and feature_2 respectively on Tensor(tuple(feature_1, feature_2).

In [64]:
def f(row0, row1):
    return (
        row0[0] + row1[0], # feature_1 in (feature_1, feature_2)
        row0[1] + row1[1]  # feature_2 in (feature_1, feature_2)
    )

In [72]:
feature_1 = np.array([
    [1, 2, 3], 
    [4, 5, 6]
]).astype(np.float32)
feature_2 = np.array(
    [
        1, 
        2
    ]
).astype(np.float32)                
 
ds_tuple = tf.data.Dataset.from_tensor_slices((feature_1, feature_2))
feature_1_sum, feature_2_sum = ds_tuple.reduce(initial_state=(0.0,0.0), reduce_func=f)
feature_1_sum.numpy(), feature_2_sum.numpy()

(array([5., 7., 9.], dtype=float32), 3.0)