In [18]:
import numpy as np
import tensorflow as tf

# TensorFlow Dataset

TF Dataset is a Monad which is a container with interfaces. Cannot directly access internal elements, need to use its I/F.

You will not try to access elements in Spark RDD until the transformation is done and reduced to a single row. Because those elements are distributed over nodes.

<img src="image/what_is_tensorflow_dataset.jpg" align="left" width=500/>

---
# Creation

## from_tensor_slice

tf.constant(x) to Tensor is tf.data.Dataset.from_tensor_slice(y) to Dataset.

**Each element in y on axis=0 is a row in a dataset**.

* [from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)

x[0] and x[1] will be a respective row in the dataset.

```
x = [
    [                        # <--- x[0]
        [ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]
    ],
    [                        # <--- x[1]
        [12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]
]
```

In [58]:
x: np.ndarray = np.arange(2*3*4).reshape((2,3,4))
x

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [60]:
# x[0] is the first row
dataset: tf.data.Dataset = tf.data.Dataset.from_tensor_slices(x)
list(dataset.take(1).as_numpy_iterator())

[array([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])]

In [4]:
for row in dataset:
    print(row)

tf.Tensor(
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]], shape=(3, 4), dtype=int64)
tf.Tensor(
[[12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]], shape=(3, 4), dtype=int64)


## from_tensors

```from_tensors(x)``` combines all in ```x``` into single tensor row. The result dataset has only one row.

> from_tensors produces a dataset **containing only a single element/row**. To slice the input tensor into multiple elements, use from_tensor_slices instead.

* [from_tensors](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensors)

In [62]:
ds_from_tensors = tf.data.Dataset.from_tensors([x])

In [64]:
for row in ds_from_tensors:
    print(row)

tf.Tensor(
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]], shape=(2, 3, 4), dtype=int64)


---
# Attributes

## Batch size

In [68]:
x = np.random.random(size=(64,3))
ds = tf.data.Dataset.from_tensor_slices(x).batch(32)

In [69]:
batch_size = ds._batch_size
batch_size.numpy()

32

## Number of batches in dataset

In [70]:
num_batches = tf.data.Dataset.cardinality(ds)
num_batches.numpy()

2

---
# Operations

## Extract single record 

Reduce to a dataset with single record and apply ```get_single_element()```.

* [get_single_element](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#get_single_element)

In [5]:
dataset.take(1).get_single_element()

<tf.Tensor: shape=(3, 4), dtype=int64, numpy=
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])>

## Aggregate (Reduce)

* [reduce](np.arange(2*3*4).reshape((2,3,4)))

```
reduce(
    initial_state, reduce_func, name=None
)
```

```reduce``` is like ```monad.foldLeft(initial)(func)``` in Scala but apply row-wise like SQL LAG.

```
list.foldLeft(initial=0)(left + right)  # Initial placeholder left value is 0 and continuously add right
```

## Sum

Similar to ```tf.math.reduce_sum(axis=0)```.

In [6]:
def sum_fn(previous_row, current_row):
    return previous_row + current_row

In [7]:
x = np.arange(3*4).reshape((3,4)).astype(np.float32)
print(x)
ds = tf.data.Dataset.from_tensor_slices(x)

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]


In [8]:
sum = ds.reduce(initial_state=0.0, reduce_func=sum_fn)
print(f"reduce result type is {type(sum)}")
sum.numpy()

reduce result type is <class 'tensorflow.python.framework.ops.EagerTensor'>


2023-03-04 10:58:57.297660: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


array([12., 15., 18., 21.], dtype=float32)

## Sum on Tensor of Tuple

Sum feature_1 and feature_2 respectively on Tensor(tuple(feature_1, feature_2).

In [9]:
def f(row0, row1):
    return (
        row0[0] + row1[0], # feature_1 in (feature_1, feature_2)
        row0[1] + row1[1]  # feature_2 in (feature_1, feature_2)
    )

Calculate row-wise sum on:
```
([1,2,3],1)
([4,5,6],2)
```

In [26]:
feature_1 = np.array([
    [1, 2, 3], 
    [4, 5, 6]
]).astype(np.float32)
feature_2 = np.array(
    [
        1, 
        2
    ]
).astype(np.float32)                
 
ds_tuple = tf.data.Dataset.from_tensor_slices((feature_1, feature_2))
for row in ds_tuple:
    print(row)

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 2., 3.], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=1.0>)
(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([4., 5., 6.], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>)


In [11]:
feature_1_sum, feature_2_sum = ds_tuple.reduce(initial_state=(0.0,0.0), reduce_func=f)
feature_1_sum.numpy(), feature_2_sum.numpy()

(array([5., 7., 9.], dtype=float32), 3.0)

## Iterate as numpy arrays

In [73]:
ds = tf.data.Dataset.from_tensor_slices(np.random.random(size=(4,3)))
for index, row in enumerate(ds.as_numpy_iterator()):
    print(index, row)

0 [0.07311059 0.01731903 0.77369721]
1 [0.1863143  0.90496002 0.51770964]
2 [0.204347   0.50955457 0.37652489]
3 [0.75352069 0.45518075 0.90753191]
