# Image classification

This notebook demonstrates how to use the ML Cube Platform with image data. We utilize a Huggingface dataset and a pre-trained model for image classification. We load the validation data and split the dataset in two parts, using the first as reference data and the second as production data. 

In a real-world scenario, the training and validation datasets would be considered historical/reference data, while production data would come from the production environment after the model's deployment.

**With this example you will learn:**
- how to create an image classification task
- how to define a data schema
- how to create a model
- how to upload historical data
- how to set a reference for the model
- how to upload production data

**Requirements**

These are the dependencies your Python environment is required to have in order to properly run this notebook.
```
ml3-platform-sdk>=0.0.22
transformers[torch]==4.41.2
torch==2.2.0
datasets==2.15.0
polars==0.20.31
json==2.0.9
numpy==1.26.4
tqdm==4.66.4
```

Imports

In [38]:
from ml3_platform_sdk import enums as ml3_enums
from ml3_platform_sdk import models as ml3_models
from ml3_platform_sdk.client import ML3PlatformClient

from datasets import load_dataset, DatasetDict
from transformers import ViTImageProcessor, ViTModel, ViTForImageClassification
import polars as pl
import json
import datetime
import numpy as np
import tqdm
import os
import tempfile
import shutil
import torch
from PIL import Image

User Inputs

In [40]:
URL = 'https://api.platform.mlcube.com'
API_KEY = ""
PROJECT_ID = ''
model_name = 'mymodel'
model_version = 'v0.0.1'

## Dataset, model and predictions
Download dataset and model using Huggingface api.

Load dataset

In [42]:
dataset_name = 'ethz/food101'

dataset = load_dataset(dataset_name, split="validation")

We use only a subset of labels to reduce data size

In [43]:
LABELS = [0, 1, 2]

In [None]:
dataset = dataset.filter(lambda example: example['label'] in LABELS).train_test_split(test_size=0.4, seed=42, stratify_by_column='label', shuffle=True)

In [None]:
len(dataset["train"]['image']), len(dataset["test"]['image'])

Here we load the model and embedder classes. We use two wrappers to simplify the process of getting predictions and embeddings.

In [49]:
class Embedder:
    def __init__(self):
        self.model = ViTModel.from_pretrained("nateraw/food")
        self.processor = ViTImageProcessor.from_pretrained('nateraw/food')
        
    def __call__(self, images):
        with torch.no_grad():
            return self.model(
                **self.processor(images=images, return_tensors='pt')
            ).last_hidden_state[:, 0, :].numpy()

class Classifier:
    def __init__(self):
        self.classifier = ViTForImageClassification.from_pretrained("nateraw/food")
        self.processor = ViTImageProcessor.from_pretrained('nateraw/food')
        
    def __call__(self, images):
        with torch.no_grad():
            return self.classifier(
                **self.processor(images=images, return_tensors='pt')
            ).logits.argmax(-1).numpy()

In [50]:
processor = ViTImageProcessor.from_pretrained('nateraw/food')
processor.do_normalize = False
processor.do_rescale = False

In [None]:
embedder = Embedder()
classifier_model = Classifier()

## Create data objects for ML cube Platform
We will use local data sources to upload data. Hence, we need to create local files that will be shared with the ML cube Platform.
Image data are sent as a zipped folder containing the image samples. An additional file (usually csv) is uploaded to map each image file name with its timestamp and sample id.
Along with those two data it is possible to send custom embeddings of the images as a parquet file.
Target and predictions can be sent as csv tabular files.

Data are uploaded separately for each category:
- **inputs:** ImageData object pointing to a zip folder containing jpg images;
- **target:** TabularData object, pointing to csv file
- **predictions:** TabularData object, also pointing to a csv file.

When dealing with unstructured data like images, there are three ways to send them:
1. Sending only embeddings, which are a numerical representation of the image sample as a vector, using `EmbeddingData`;
2. Sending only the raw image, using `ImageData`. In this case ML cube Platform will create the numerical representation using internal encoders;
3. Sending the raw image and its embeddings using `ImageData` with the `embedding_source` attribute. This more complete option has two benefits: it allows the usage of a personal embedder, which is usually focused on the domain rather than a general one, and it enables the extraction of additional metrics from the image, providing more functionalities in the web application.

In [10]:
def build_data_objects(
    dataset,
    model,
    embedder,
    model_name: str,
    model_version: str,
    starting_id: int,
    starting_timestamp: float,
    prefix: str,
) -> tuple[
    ml3_models.ImageData,
    ml3_models.TabularData,
    ml3_models.TabularData,
    int,
    float
]:
    """Builds data objects for inputs, target and predictions.
    
    Since each sample has associated an id and a timestamp we generate 
    them as incremental counters.
    Therefore, we need a starting point for both.

    Parameters
    ----------

    dataset: huggingface dataset
    starting_id: sample id to start from
    starting_timestamp: timestamp to start from

    Returns
    ------
    image_data, last_sample_id, last_timestamp
    """

    n_samples = len(dataset['image'])
    next_starting_sample_id = starting_id + n_samples
    sample_ids = list(map(lambda x: f'sample_{x}', range(starting_id, starting_id + n_samples)))
    # each sample has a time delta of 2 minutes
    next_starting_timestamp = starting_timestamp + n_samples * 120
    timestamps = list(np.arange(starting_timestamp, starting_timestamp + n_samples * 120, 120))

    # Create inputs embedding file and save it
    print('Creating image file')
    with tempfile.TemporaryDirectory() as image_dir:
        
        image_names = []
        mapping_samples = []
        for (i, sample) in enumerate(dataset['image']):
            file_name = f'sample_{i}.jpg'
            image = Image.fromarray(processor(sample)['pixel_values'][0].transpose([1, 2, 0]))
            image.convert('RGB').save(os.path.join(image_dir, file_name))
            image_names.append(file_name)
            mapping_samples.append(file_name)
    
        # save mapping dataframe
        mapping = pl.DataFrame({
            'sample-id': sample_ids,
            'timestamp': timestamps,
            'file_name': image_names,
        })
        mapping.write_csv(f'{prefix}_mapping.csv')

        # compress images folder as zip
        shutil.make_archive(f'{prefix}_images', 'zip', image_dir)
        
        # Create embedding dataframe
        embedding_list = []
        for sample in tqdm.tqdm(dataset['image']):
            embedding_list.append(embedder(sample).tolist()[0])
        print('Creating embedding file')
        embeddings = pl.DataFrame({
            'timestamp': timestamps,
            'sample-id': sample_ids,
            'embedding': embedding_list
        })
        embeddings.write_parquet(f'{prefix}_embeddings.parquet')
    
        print('Creating target file')
        target = pl.DataFrame({
            'label': dataset['label'],
            'timestamp': timestamps,
            'sample-id': sample_ids,
        })
        target.write_csv(f'{prefix}_target.csv')
    
        print('Creating predictions file')
        predicted_labels = []
        for sample in tqdm.tqdm(dataset['image']):
            prediction = model(sample).item()
            if prediction not in LABELS:
                prediction = LABELS[0]
            predicted_labels.append(prediction)
            
        predictions = pl.DataFrame({
            'timestamp': timestamps,
            'sample-id': sample_ids,
            f'{model_name}@{model_version}': predicted_labels,
        })
        predictions.write_csv(f'{prefix}_predictions.csv')
    
        print('Creating inputs data')
        inputs_data = ml3_models.ImageData(
            source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.JPG,
                is_folder=True,
                folder_type=ml3_enums.FolderType.ZIP,
                file_path=f'{prefix}_images.zip'
            ),
            mapping_source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.CSV,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_mapping.csv'
            ),
            embedding_source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.PARQUET,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_embeddings.parquet',
            )
        )
    
        print('Creating target data')
        target_data = ml3_models.TabularData(
            source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.CSV,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_target.csv'
            )
        )
    
        print('Creating predictions data')
        predictions_data = ml3_models.TabularData(
            source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.CSV,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_predictions.csv'
            )
        )
        
        return (
            inputs_data,
            target_data,
            predictions_data,
            next_starting_sample_id,
            next_starting_timestamp
        )


In [None]:
training_initial_sample_id = 0
training_initial_timestamp = datetime.datetime.now().timestamp()

(
    training_inputs_data,
    training_target_data,
    _,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset["train"],
    classifier_model,
    embedder,
    model_name,
    model_version,
    training_initial_sample_id,
    training_initial_timestamp,
    'train'
)
training_end_timestamp = starting_timestamp - 120

In [None]:
(
    production_0_inputs_data,
    production_0_target_data,
    production_0_prediction_data,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset["test"],
    classifier_model,
    embedder,
    model_name,
    model_version,
    starting_id,
    starting_timestamp,
    'prod_0'
)

## Create data schema

The data schema describes the data used in the task, with their specific names.
A data schema contains:
- *sample id*, column that is used to uniquely identify each sample
- *timestamp*, column that is used to order samples
- *input*, column that specifies the nature of the input. In this case, it's an ARRAY_3 representing the image
- *input additional embedding*, optional column for the embedding of the image data
- *target*, column that specifies the nature of the target. In this case, categorical with three possible values

The prediction column must not be specified because it will be automatically added during the model creation, with a name like `MODEL_NAME@MODEL_VERSION`

In [13]:
data_schema = ml3_models.DataSchema(
    columns=[
        ml3_models.ColumnInfo(
            name='timestamp',
            role=ml3_enums.ColumnRole.TIME_ID,
            is_nullable=False,
            data_type=ml3_enums.DataType.FLOAT,
        ),
        ml3_models.ColumnInfo(
            name='sample-id',
            role=ml3_enums.ColumnRole.ID,
            is_nullable=False,
            data_type=ml3_enums.DataType.STRING,
        ),
        ml3_models.ColumnInfo(
            name='image',
            role=ml3_enums.ColumnRole.INPUT,
            is_nullable=False,
            data_type=ml3_enums.DataType.ARRAY_3,
            dims=(224, 224, 3),
            image_mode=ml3_enums.ImageMode.RGB
        ),
        ml3_models.ColumnInfo(
            name="embedding",
            role=ml3_enums.ColumnRole.INPUT_ADDITIONAL_EMBEDDING,
            is_nullable=False,
            data_type=ml3_enums.DataType.ARRAY_1,
            dims=(768,),
        ),
        ml3_models.ColumnInfo(
            name='label',
            role=ml3_enums.ColumnRole.TARGET,
            is_nullable=False,
            data_type=ml3_enums.DataType.CATEGORICAL,
            possible_values=LABELS
        ),
    ]
)

## Interaction with ML cube Platform
To start, we create an instance of ML cube Platform client using the provided api key.
Then, we create a task, a dataschema, a model and finally we upload data.

In [None]:
client = ML3PlatformClient(URL, API_KEY)

When creating a task there are some information we need to specify:
- **task_type:** artificial intelligence task type. In this case it is a multiclass classification.
- **data_structure:** the type of input data. In this case, it is Image.
- **optional_target:** if the target can be missing for production data. We assume that reference data, being the training data, always have the target.
    However, it is possible that target is not available in other historical data or production data. By enabling the optional target option, the ML cube Platform will not check its presence in the data sent to the platform.

In [17]:
task_id = client.create_task(
    project_id=PROJECT_ID,
    name='image_task',
    tags=["tag_1", "tag_2"],
    task_type=ml3_enums.TaskType.CLASSIFICATION_MULTICLASS,
    data_structure=ml3_enums.DataStructure.IMAGE,
    optional_target=False
)

In [18]:
client.add_data_schema(task_id=task_id, data_schema=data_schema)

A model is uniquely identified by its `name` and the `model version`.

In [19]:
model_id: str = client.create_model(
    task_id=task_id,
    name=model_name,
    version=model_version,
    metric_name=ml3_enums.ModelMetricName.ACCURACY,
    preferred_suggestion_type=ml3_enums.SuggestionType.SAMPLE_WEIGHTS,
    with_probabilistic_output=False
)

Training data are uploaded as historical data i.e., any data that do not come from production.
Then we indicate them as reference data of our model in order to set up the detection algorithms.

Note that it is possible to add other historical data in any time.

In [None]:
job_id = client.add_historical_data(
    task_id=task_id,
    inputs=training_inputs_data,
    target=training_target_data
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

In [None]:
job_id = client.set_model_reference(
    model_id,
    from_timestamp=training_initial_timestamp,
    to_timestamp=training_end_timestamp,
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

Now, we are ready to upload production data.
Production data can be uploaded asynchronously, that means that we can upload each data category whenever, it is available without waiting for the others.
This is specially true for *target* data that usually are available with an amount of delay.

In [None]:
job_id = client.add_production_data(
    task_id=task_id,
    inputs=production_0_inputs_data,
    target=production_0_target_data,
    predictions=[(model_id, production_0_prediction_data)]
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')