# Lab 2: Data Engineering
## Exercise 2: Add documentation to the dataset

To improve our dataset, we should add some documentation to it. This is done through the `Logger`, just like we did for experiments.

Use `Dataset.get_logger()` to get the logger.

- Add a few metrics, like the number of images.
- Add some example images with their labels.

https://clear.ml/docs/latest/docs/references/sdk/dataset#get_logger
https://clear.ml/docs/latest/docs/references/sdk/logger/

Afterward, go to the ClearML website, click on DATASETS on the left side, and find the dataset you just created.

Now go to the ClearML website, to the dataset you just created. In the VERSION INFO in the right, click on "Task information" at the bottom. Here you can find the logging that we added.

In [1]:
import os
from dotenv import load_dotenv

#%pip install -q clearml nbconvert

%env CLEARML_WEB_HOST=https://app.clear.ml
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml

load_dotenv()

CLEARML_API_ACCESS_KEY=os.getenv("CLEARML_API_ACCESS_KEY")
CLEARML_API_SECRET_KEY=os.getenv("CLEARML_API_SECRET_KEY")

if CLEARML_API_ACCESS_KEY is None:
    raise KeyError("CLEARML_API_ACCESS_KEY")

env: CLEARML_WEB_HOST=https://app.clear.ml
env: CLEARML_API_HOST=https://api.clear.ml
env: CLEARML_FILES_HOST=https://files.clear.ml


In [4]:
import tensorflow.keras as keras
import matplotlib.pyplot as plt
import numpy as np
from clearml import Dataset

# download the dataset
(images, labels), _ = keras.datasets.cifar10.load_data()

# there are 10 classes of images
all_classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

# choose four classes (feel free to change this!)
class_names = ["bird", "cat", "deer", "dog"]
print("Class names:", class_names)

# only keep images of these classes
class_indexes = [all_classes.index(c) for c in class_names]
to_keep = np.array([l in class_indexes for l in labels])
images = images[to_keep]
labels = labels[to_keep]

# change indexes from 10 to 2 classes
labels = np.array([class_indexes.index(l) for l in labels])

# normalize pixels between 0 and 1
images = images / 255.0

# ---
# TMP Only store 10 images/labels, because we want fast upload.
# This way we can learn the dataset SDK without waiting too long.
images = images[:10]
labels = labels[:10]
# ---

# split into train and test set
split = round(len(images) * 0.8)
train_images = images[:split]
train_labels = labels[:split]
test_images = images[split:]
test_labels = labels[split:]
print("Number of train images:", len(train_images))
print("Number of test images:", len(test_images))

# save numpy arrays to disk
np.save("train_images.npy", train_images)
np.save("train_labels.npy", train_labels)
np.save("test_images.npy", test_images)
np.save("test_labels.npy", test_labels)

# create ClearML dataset
dataset = Dataset.create(dataset_name="my-dataset", dataset_project="vives-mlops-workshop")
dataset.add_files(path="train_images.npy")
dataset.add_files(path="train_labels.npy")
dataset.add_files(path="test_images.npy")
dataset.add_files(path="test_labels.npy")

# add documentation
logger = dataset.get_logger()

# TODO add a few metrics, like the number of images
logger.report_single_value("Number of training images", train_images.size) # train
logger.report_single_value("Number of testing images", test_images.size) # test

# TODO add some example images with their label
for i in range(10):
    the_class = class_names[labels[i]]
    logger.report_image(title="Labeled examples", series=the_class, image=images[i])

dataset.upload()
dataset.finalize()

Class names: ['bird', 'cat', 'deer', 'dog']
Number of train images: 8
Number of test images: 2
ClearML results page: https://app.clear.ml/projects/c91b253598c347cab986e3aa22fa3207/experiments/8900a988777844b680f9362bf2532314/output/log
ClearML dataset page: https://app.clear.ml/datasets/simple/c91b253598c347cab986e3aa22fa3207/experiments/8900a988777844b680f9362bf2532314
Uploading dataset changes (4 files compressed to 83.14 KiB) to https://files.clear.ml
File compression and upload completed: total size 83.14 KiB, 1 chunk(s) stored (average size 83.14 KiB)


True