# Data

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/intro_ml/blob/main/docs/02_data.ipynb)

In [1]:
# if you're using colab, then install the required modules
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install ...

In [61]:
import numpy as np
import tensorflow as tf
import torch

In [62]:
np.random.normal(size=(1,))

array([-0.60170661])

In [63]:
tf.random.normal(shape=(1,))

<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.9826471], dtype=float32)>

In [65]:
np.random.normal(size=(3,))

array([ 1.85227818, -0.01349722, -1.05771093])

In [66]:
tf.random.normal(shape=(3,))

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 0.7847413, -0.5809721, -0.2452356], dtype=float32)>

In [68]:
np.random.normal(size=(3, 3))

array([[ 0.82254491, -1.22084365,  0.2088636 ],
       [-1.95967012, -1.32818605,  0.19686124],
       [ 0.73846658,  0.17136828, -0.11564828]])

In [69]:
tf.random.normal(shape=(3, 3))

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[ 1.2764754 , -0.6551445 , -0.00389835],
       [ 0.45862213, -1.507044  ,  1.0932531 ],
       [-0.45938343, -0.35318485, -0.3738681 ]], dtype=float32)>

## Data loading

Keras models accept three types of inputs:

- NumPy arrays, just like Scikit-Learn and many other Python-based libraries. This is a good option if your data fits in memory.
- TensorFlow Dataset objects. This is a high-performance option that is more suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.
- Python generators that yield batches of data (such as custom subclasses of the keras.utils.Sequence class).


If you have a large dataset and you are training on GPU(s), consider using `Dataset` objects, since they will take care of performance-critical details, such as:

- Asynchronously preprocessing your data on CPU while your GPU is busy, and buffering it into a queue.
- Prefetching data on GPU memory so it's immediately available when the GPU has finished processing the previous batch, so you can reach full GPU utilization.

Keras features a range of utilities to help you turn raw data on disk into a Dataset:

- [`tf.keras.utils.image_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory) turns image files sorted into class-specific folders into a labeled dataset of image tensors.
- [`tf.keras.utils.text_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory) does the same for text files.
- [`tf.keras.utils.timeseries_dataset_from_array`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/timeseries_dataset_from_array) creates a dataset of sliding windows over a timeseries provided as array.

"step fusing"

model = build_model()
model.compile(
    optimiser,
    loss,
    steps_per_execution=32  # this step
)
model.fit(dataset, epochs=epochs, callbacks=callbacks)

jit compile

model = build_model()
model.compile(
    optimiser,
    loss,
    jit_compile=True  # this step
)
model.fit(dataset, epochs=epochs, callbacks=callbacks)

## Data preprocessing

- Tokenization of string data, followed by token indexing.
- Feature normalization.
- Rescaling the data to small values (in general, input values to a neural network should be close to zero -- typically we expect either data with zero-mean and unit-variance, or data in the [0, 1] range.


## Mixed precision

...

from tensorflow.keras import mixed_precision

mixed_precision.set_policy('mixed_float16')

with distribution_strategy.scopy():
    model = build_model()
    model.compile(optimiser, loss)
    model.fit(dataset, epochs=epochs, callbacks=callbacks)

- datasets
- data centric AI hub videos
- efficient feeding of data into GPUs
- pipelines for large data I/O into GPUs using compression/decompression, Ray datasets




### Synthetic data

...

- [NVIDIA Replicator Composer](https://docs.omniverse.nvidia.com/app_isaacsim/app_isaacsim/tutorial_replicator_composer.html#replicator-composer)

In [None]:
import tensorflow as tf

Check whether you have a GPU:

In [None]:
if tf.config.list_physical_devices('GPU'):
    print(f"Yes, there are {len(tf.config.list_physical_devices('GPU'))} GPUs available.")
else:
    print('No, GPUs are not available.')

## Exercises

```{admonition} Exercise 1

...

```

## {ref}`Solutions <data>`

## Key Points

```{important}

- [x] _..._

```

## Further information

### Good practices

- Do data processing as part of the model to increase portability and reproducibility.
- ...

### Other options

- ...
 
### Resources

#### General

- [Papers with code - Datasets](https://paperswithcode.com/datasets)
- [HuggingFace - Datasets](https://huggingface.co/datasets)
- [Google research datasets](https://ai.google/tools/datasets/)
- [Google Dataset Search](https://datasetsearch.research.google.com/)
- [Google Cloud public datasets](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&pli=1)
- [Kaggle Datasets](https://www.kaggle.com/datasets)

#### TensorFlow

- [TensorFlow official datasets](https://www.tensorflow.org/datasets)

#### PyTorch

- [Torch Vision Datasets](https://pytorch.org/vision/stable/datasets.html)
- [Torch Text Datasets](https://pytorch.org/text/stable/datasets.html)
- [Torch Audio Datasets](https://pytorch.org/audio/stable/datasets.html)