# Data

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/intro_ml/blob/main/docs/02_data.ipynb)

In [1]:
# if you're using colab, then install the required modules
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    %pip install ...

## Basic ideas

Data has inputs and outputs.

The inputs are what you provide to the model.

The outputs are what you're trying to predict.

### Tensors

The data is normally in the form of tensors.

Tensors are multi-dimensional arrays:

Scalars are rank-0 tensors:

In [61]:
import numpy as np
import tensorflow as tf
import torch

In [62]:
np.random.normal(size=(1,))

array([-0.60170661])

In [63]:
tf.random.normal(shape=(1,))

<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.9826471], dtype=float32)>

In [64]:
torch.randn(1)

tensor([1.3525])

Vectors are rank-1 tensors:

In [65]:
np.random.normal(size=(3,))

array([ 1.85227818, -0.01349722, -1.05771093])

In [66]:
tf.random.normal(shape=(3,))

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 0.7847413, -0.5809721, -0.2452356], dtype=float32)>

In [67]:
torch.randn(3)

tensor([ 0.6863, -0.3278,  0.7950])

Matrices are rank-2 tensors:

In [68]:
np.random.normal(size=(3, 3))

array([[ 0.82254491, -1.22084365,  0.2088636 ],
       [-1.95967012, -1.32818605,  0.19686124],
       [ 0.73846658,  0.17136828, -0.11564828]])

In [69]:
tf.random.normal(shape=(3, 3))

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[ 1.2764754 , -0.6551445 , -0.00389835],
       [ 0.45862213, -1.507044  ,  1.0932531 ],
       [-0.45938343, -0.35318485, -0.3738681 ]], dtype=float32)>

In [70]:
torch.randn(3, 3)

tensor([[ 0.2815,  0.0562,  0.5227],
        [-0.2384, -0.0499,  0.5263],
        [-0.0085,  0.7291,  0.1331]])

And 3+ dimensional arrays are rank-3+ tensors.

### Supervised and unsupervised

- Supervised learning is when you provide labelled outputs to learn from.
- Unsupervised learning when you don't provide any labels.

We'll focus on supervised learning in this course.

### Classification and regression

- Classification problems are those that try to predict a category (i.e., cat or dog).
- Regression problems are those that try to predict a number (i.e., beans in a jar).

### Training, validation, and test splits

The data is normally split into training, validation, and test sets.

- The training set is for training the model.
- The validation set (optional) is for iteratively optimising the model during training.
- The test set is only for testing the model at the end.
    - This should remain untouched and single-use.

The size of the split depends on the size of the dataset and the signal you're trying to predict (i.e., the smaller the signal, then the larger the test set needs to be).

- For small data sets, a split of 60/20/20 for train/validation/test may be suitable.
- For large data sets, a split of 90/5/5 for train/validation/test may be suitable.
- For very large data sets, a split of 98/1/1 for train/validation/test may be suitable.

- datasets
- data centric AI hub videos
- efficient feeding of data into GPUs
- synthetic data
- pipelines for large data I/O into GPUs using compression/decompression, Ray datasets

In [None]:
import tensorflow as tf

Check whether you have a GPU:

In [None]:
if tf.config.list_physical_devices('GPU'):
    print(f"Yes, there are {len(tf.config.list_physical_devices('GPU'))} GPUs available.")
else:
    print('No, GPUs are not available.')

## Exercises

```{admonition} Exercise 1

...

```

## {ref}`Solutions <data>`

## Key Points

```{important}

- [x] _..._

```

## Further information

### Good practices

- ...

### Other options

- ...
 
### Resources

- [TensorFlow official datasets](https://www.tensorflow.org/datasets)
    - Datasets ready to use with TensorFlow.
- [Google research datasets](https://ai.google/tools/datasets/)
    - Large-scale datasets across computer science.
- [Google Dataset Search](https://datasetsearch.research.google.com/)
- [Google Cloud public datasets](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&pli=1)
- [Kaggle Datasets](https://www.kaggle.com/datasets)
- [Torch Vision Datasets](https://pytorch.org/vision/stable/datasets.html)
- [Papers with code - Datasets](https://paperswithcode.com/datasets)