# Data

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/intro_ml/blob/main/docs/02_data.ipynb)

In [1]:
# if you're using colab, then install the required modules
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    pass

## Tensors

#### NumPy

```python
np.random.normal(size=(1,))  # scalar
np.random.normal(size=(3,))  # vector
np.random.normal(size=(3, 3))  # matrix
```

#### [TensorFlow](https://www.tensorflow.org/guide/tensor)

```python
tf.random.normal(shape=(1,))  # scalar
tf.random.normal(shape=(3,))  # vector
tf.random.normal(shape=(3, 3))  # matrix
```

## Data pipelines

The data pipeline can be useful:

- When the data does not fit in memory.
- When the data requires pre-processing.
- To efficiently use hardware.

The steps can include:

- Extract e.g., read data from memory / storage.
- Transform e.g., pre-processing, batching, shuffling.
- Load e.g., transfer to GPU.

### Data loading

[Keras](https://keras.io/api/data_loading/) models accept three types of inputs:

- [NumPy arrays](https://www.tensorflow.org/guide/data#consuming_numpy_arrays)
    - Suitable for when the data fits in memory.
- [TensorFlow Dataset objects](https://www.tensorflow.org/guide/data#dataset_structure)
    - Suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.
- [Python generators](https://www.tensorflow.org/guide/data#consuming_python_generators)
    - Suitable for custom processing yielding batches of data (subclasses of `tf.keras.utils.Sequence` class).

If you have a large dataset and you are training on GPU(s), consider using `Dataset` objects, since they will take care of performance-critical details, such as:

- Asynchronously preprocessing your data on CPU while your GPU is busy, and buffering it into a queue.
- Prefetching data on GPU memory so it's immediately available when the GPU has finished processing the previous batch, so you can reach full GPU utilization.

Keras features a range of utilities to help you turn raw data on disk into a Dataset:

- [`tf.keras.utils.image_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory) turns image files sorted into class-specific folders into a labeled dataset of image tensors.
- [`tf.keras.utils.text_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory) does the same for text files.
- [`tf.keras.utils.timeseries_dataset_from_array`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/timeseries_dataset_from_array) creates a dataset of sliding windows over a timeseries provided as array.

"step fusing"

model = build_model()
model.compile(
    optimiser,
    loss,
    steps_per_execution=32  # this step
)
model.fit(dataset, epochs=epochs, callbacks=callbacks)

In [None]:
# # Train the model for 1 epoch from Numpy data
# batch_size = 64
# print("Fit on NumPy data")
# history = model.fit(x_train, y_train, batch_size=batch_size, epochs=1)

# # Train the model for 1 epoch using a dataset
# dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
# print("Fit on Dataset")
# history = model.fit(dataset, epochs=1)

### [Map](https://www.tensorflow.org/guide/data#preprocessing_data)

Map a preprocessing function to a dataset.

#### TensorFlow

```python
dataset.map(function)
```

### [Batch](https://www.tensorflow.org/guide/data#batching_dataset_elements)

Split the data into batches.

#### TensorFlow

```python
dataset.batch(batch_size=32)
```

There are range of ways to [improve the performance](https://www.tensorflow.org/guide/data_performance) of the data pipeline.

In these examples, using `tf.data.AUTOTUNE` leaves the decision to TensorFlow.

(cache_prefetch)=
### [Dataset caching](https://www.tensorflow.org/guide/data_performance#caching)

Cache the data after the first iteration through it. The data can be cached to either memory or a local file.

This can improve performance when:

- The data is the same each iteration.
- The data is read from a remote distributed filesystem.
- The data is I/O (input/output) bound and will fit in memory.

#### TensorFlow

```python
dataset.cache()
```

### [Prefetch data](https://www.tensorflow.org/guide/data_performance#prefetching)

Prefect the next batch to save time waiting for it.

#### TensorFlow

```python
dataset.prefetch(buffer_size=tf.data.AUTOTUNE)
```

### [Parallel data extraction](https://www.tensorflow.org/guide/data_performance#parallelizing_data_extraction)

Extract the data in parallel.

#### TensorFlow

```python
dataset.interleave(
    build_dataset, 
    num_parallel_calls=tf.data.AUTOTUNE
)
```

###  [Parallel data transformation](https://www.tensorflow.org/guide/data_performance#parallelizing_data_transformation)

Pre-process your data in parallel.

#### TensorFlow

```python
dataset.map(
    function, 
    num_parallel_calls=tf.data.AUTOTUNE
)
```

### [Vectorise mapping](https://www.tensorflow.org/guide/data_performance#vectorizing_mapping)

Batch _before_ mapping, to vectorise a function.

#### TensorFlow

```python
dataset.batch(256).map(function)
```

jit compile

model = build_model()
model.compile(
    optimiser,
    loss,
    jit_compile=True  # this step
)
model.fit(dataset, epochs=epochs, callbacks=callbacks)

## Data preprocessing

- Tokenization of string data, followed by token indexing.
- Feature normalization.
- Rescaling the data to small values (in general, input values to a neural network should be close to zero -- typically we expect either data with zero-mean and unit-variance, or data in the [0, 1] range.


## [Mixed precision](https://www.tensorflow.org/guide/mixed_precision)

...

from tensorflow.keras import mixed_precision

mixed_precision.set_policy('mixed_float16')

with distribution_strategy.scopy():
    model = build_model()
    model.compile(optimiser, loss)
    model.fit(dataset, epochs=epochs, callbacks=callbacks)

In [None]:
import numpy as np
import tensorflow as tf
import torch

- datasets
- data centric AI hub videos
- efficient feeding of data into GPUs
- pipelines for large data I/O into GPUs using compression/decompression, Ray datasets




## TensorFlow Datasets

...

(data_augmentation)=
## Data augmentation

### Synthetic data

...

- [NVIDIA Replicator Composer](https://docs.omniverse.nvidia.com/app_isaacsim/app_isaacsim/tutorial_replicator_composer.html#replicator-composer)

In [None]:
import tensorflow as tf

Check whether you have a GPU:

In [5]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0


## Exercises

```{admonition} Exercise 1

...

```

## {ref}`Solutions <data>`

## Key Points

```{important}

- [x] _Use a data pipeline._
- [x] _Optimise the data pipeline with caching, prefetching, parallel extraction, parallel preprocessing, and vectorised mapping._

```

## Further information

### Good practices

- Do data processing as part of the model to increase portability and reproducibility.
- Analyse data pipeline performance with [TensorBoard Profiler](https://www.tensorflow.org/guide/data_performance_analysis).
- Use sparse tensors when there are many zeros / np.nans (e.g., [TensorFlow](https://www.tensorflow.org/guide/sparse_tensor)).
- ...

### Other options

- ...
 
### Resources

#### General

- [Papers with code - Datasets](https://paperswithcode.com/datasets)
- [HuggingFace - Datasets](https://huggingface.co/datasets)
- [Google research datasets](https://ai.google/tools/datasets/)
- [Google Dataset Search](https://datasetsearch.research.google.com/)
- [Google Cloud public datasets](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&pli=1)
- [Kaggle Datasets](https://www.kaggle.com/datasets)

#### TensorFlow

- [TensorFlow official datasets](https://www.tensorflow.org/datasets)

#### PyTorch

- [Torch Vision Datasets](https://pytorch.org/vision/stable/datasets.html)
- [Torch Text Datasets](https://pytorch.org/text/stable/datasets.html)
- [Torch Audio Datasets](https://pytorch.org/audio/stable/datasets.html)