# Image classification

This notebook shows how to use ML cube Platform with image data.
We use a Huggingface dataset and trained model for image classification.
The dataset contains train and validation sets, we use train as reference dataset while validation as production data.
Of course, in a real scenario all those dataset will be part of historical/reference data and production will come from the production environment after the deployment of the algorithm.

**With this example you will learn:**
- how to create an image classification task
- how to define a data schema
- how to create a model
- how to upload historical data
- how to set reference for the model
- how to upload production data

**Requirements**

In order to properly run this notebook the Python environment has those requirements.
```
ml3-platform-sdk>=0.0.15
transformers[torch]==4.41.2
torch==2.2.0
datasets==2.15.0
polars==0.20.31
json==2.0.9
numpy==1.26.4
tqdm==4.66.4
```


Imports

In [None]:
from ml3_platform_sdk import enums as ml3_enums
from ml3_platform_sdk import models as ml3_models
from ml3_platform_sdk.client import ML3PlatformClient

from datasets import load_dataset, DatasetDict
from transformers import ViTImageProcessor, ViTModel, ViTForImageClassification
import polars as pl
import json
import datetime
import numpy as np
import tqdm
import os
import tempfile
import shutil
import torch
from PIL import Image

User Inputs

In [None]:
URL = 'https://api.platform.mlcube.com'
API_KEY = ""
PROJECT_ID = ''
model_name = 'mymodel'
model_version = 'v0.0.1'

## Dataset, model and predictions
Download dataset and model using Huggingface api.
After the dataset and the model are downloaded we run the model to get predictions.

Load dataset

In [None]:
dataset = load_dataset("ethz/food101")

We use only a subset of labels to reduce data size

In [None]:
LABELS = [0, 1, 2]

In [None]:
def sample_dataset(dataset, classes, fraction=0.05, seed=42):
    sampled_dataset = DatasetDict()
    for split in dataset.keys():
        sampled_dataset[split] = dataset[split].filter(lambda example: example['label'] in classes)
        if fraction is not None:
            sampled_dataset[split] = sampled_dataset[split].train_test_split(test_size=fraction, seed=seed)['test']
    return sampled_dataset

# Perform the sampling
dataset = sample_dataset(dataset, LABELS, fraction=0.3)

In [None]:
len(dataset['train']['image']), len(dataset['validation']['image'])

Load model pipeline, we use two wrappers to simplify the way to get predictions and embeddings

In [None]:
class Embedder():
    def __init__(self):
        self.model = ViTModel.from_pretrained("nateraw/food")
        self.processor = ViTImageProcessor.from_pretrained('nateraw/food')
        
    def __call__(self, images):
        with torch.no_grad():
            return self.model(
                **self.processor(images=images, return_tensors='pt')
            ).last_hidden_state[:, 0, :].numpy()

class Classifier():
    def __init__(self):
        self.classifier = ViTForImageClassification.from_pretrained("nateraw/food")
        self.processor = ViTImageProcessor.from_pretrained('nateraw/food')
        
    def __call__(self, images):
        with torch.no_grad():
            return self.classifier(
                **self.processor(images=images, return_tensors='pt')
            ).logits.argmax(-1).numpy()

In [None]:
processor = ViTImageProcessor.from_pretrained('nateraw/food')
processor.do_normalize = False
processor.do_rescale = False

In [None]:
embedder = Embedder()
classifier_model = Classifier()

## Create data objects for ML cube Platform
We use local data sources to upload data, hence, we need to create local files that will be shared with ML cube Platform.
Image data are sent as a zipped folder containing one image sample, an additional file (usually csv) is uploaded to map image file name with timestamp and sample id.
Along with those two data it is possible to send custom emeddings of the images as a parquet file.
While target and predictions can be sent as csv tabular files.

In ML cube Platform data are uploaded separately for each category:
- **inputs:** ImageData object in json format
- **target:** TabularData object in csv format
- **predictions:** TabularData object in csv format


When dealing with unstructured data like image it is possible to send them in three ways:
1. By sending only embeddings i.e., a numerical representation of the image sample as a vector, using `EmbeddingData`;
2. By sending only unstructured image, using `ImageData`. In this case ML cube Platform will create the numerical representation using internal encoders;
3. By sending ustructured image along with embeddings using `ImageData` with `embedding_source` attribute. This more complete option has two benefits, the first is the usage of personal embedder that usually is focused on the domain instead of a general one; the other is using image to extract additional metrics and to have full capability in the web application.

In [None]:
def build_data_objects(
    dataset,
    model,
    embedder,
    model_name: str,
    model_version: str,
    starting_id: int,
    starting_timestamp: float,
    prefix: str,
) -> tuple[
    ml3_models.ImageData,
    ml3_models.TabularData,
    ml3_models.TabularData,
    int,
    float
]:
    """Builds data objects for inputs, target and predictions.
    
    Since each sample has associated an id and a timestamp we generate 
    them as incremental counters.
    Therefore, we need a starting point for both.

    Parameters
    ----------

    dataset: huggingface dataset
    starting_id: sample id to start from
    starting_timestamp: timestamp to start from

    Returns
    ------
    image_data, last_sample_id, last_timestamp
    """

    n_samples = len(dataset['image'])
    next_starting_sample_id = starting_id + n_samples
    sample_ids = list(map(lambda x: f'sample_{x}', range(starting_id, starting_id + n_samples)))
    # each sample has a time delta of 2 minutes
    next_starting_timestamp = starting_timestamp + n_samples * 120
    timestamps = list(np.arange(starting_timestamp, starting_timestamp + n_samples * 120, 120))

    # Create inputs embedding file and save it
    print('Creating image file')
    with tempfile.TemporaryDirectory() as image_dir:
        
        image_names = []
        mapping_samples = []
        for (i, sample) in enumerate(dataset['image']):
            file_name = f'sample_{i}.jpg'
            image = Image.fromarray(processor(sample)['pixel_values'][0].transpose([1, 2, 0]))
            image.convert('RGB').save(os.path.join(image_dir, file_name))
            image_names.append(file_name)
            mapping_samples.append(file_name)
    
        # save mapping dataframe
        mapping = pl.DataFrame({
            'sample-id': sample_ids,
            'timestamp': timestamps,
            'file_name': image_names,
        })
        mapping.write_csv(f'{prefix}_mapping.csv')

        # compress images folder as zip
        shutil.make_archive(f'{prefix}_images', 'zip', image_dir)
        
        # Create embedding dataframe
        embedding_list = []
        for sample in tqdm.tqdm(dataset['image']):
            embedding_list.append(embedder(sample).tolist()[0])
        print('Creating embedding file')
        embeddings = pl.DataFrame({
            'timestamp': timestamps,
            'sample-id': sample_ids,
            'embedding': embedding_list
        })
        embeddings.write_parquet(f'{prefix}_embeddings.parquet')
    
        print('Creating target file')
        target = pl.DataFrame({
            'label': dataset['label'],
            'timestamp': timestamps,
            'sample-id': sample_ids,
        })
        target.write_csv(f'{prefix}_target.csv')
    
        print('Creating predictions file')
        predicted_labels = []
        for sample in tqdm.tqdm(dataset['image']):
            prediction = model(sample).item()
            if prediction not in LABELS:
                prediction = LABELS[0]
            predicted_labels.append(prediction)
            
        predictions = pl.DataFrame({
            'timestamp': timestamps,
            'sample-id': sample_ids,
            f'{model_name}@{model_version}': predicted_labels,
        })
        predictions.write_csv(f'{prefix}_predictions.csv')
    
        print('Creating inputs data')
        inputs_data = ml3_models.ImageData(
            source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.JPG,
                is_folder=True,
                folder_type=ml3_enums.FolderType.ZIP,
                file_path=f'{prefix}_images.zip'
            ),
            mapping_source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.CSV,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_mapping.csv'
            ),
            embedding_source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.PARQUET,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_embeddings.parquet',
            )
        )
    
        print('Creating target data')
        target_data = ml3_models.TabularData(
            source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.CSV,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_target.csv'
            )
        )
    
        print('Creating predictions data')
        predictions_data = ml3_models.TabularData(
            source=ml3_models.LocalDataSource(
                file_type=ml3_enums.FileType.CSV,
                is_folder=False,
                folder_type=None,
                file_path=f'{prefix}_predictions.csv'
            )
        )
        
        return (
            inputs_data,
            target_data,
            predictions_data,
            next_starting_sample_id,
            next_starting_timestamp
        )


In [None]:
training_initial_sample_id = 0
training_initial_timestamp = datetime.datetime.now().timestamp()

(
    training_inputs_data,
    training_target_data,
    _,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset['train'],
    classifier_model,
    embedder,
    model_name,
    model_version,
    training_initial_sample_id,
    training_initial_timestamp,
    'train'
)
training_end_timestamp = starting_timestamp - 120

In [None]:
(
    production_0_inputs_data,
    production_0_target_data,
    production_0_prediction_data,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset['validation'],
    classifier_model,
    embedder,
    model_name,
    model_version,
    starting_id,
    starting_timestamp,
    'prod_0'
)

## Create data schema

The data schema specifies the type of data present in the task with their specific names.
A data schema must contain:
- *sample id* column that is used to uniquely identify each sample
- *timestamp* column that is used to order samples
- *input* column that specify the nature of the input. In this case IMAGE
- *input additional embedding* optional column for additional embedding of the image data
- *target* column that specify the nature of the target. In this case categorical with three values

Prediction column must not be specified because it will be automatically added during the model creation with the name like MODEL_NAME@MODEL_VERSION

In [None]:
data_schema = ml3_models.DataSchema(
    columns=[
        ml3_models.ColumnInfo(
            name='timestamp',
            role=ml3_enums.ColumnRole.TIME_ID,
            is_nullable=False,
            data_type=ml3_enums.DataType.FLOAT,
        ),
        ml3_models.ColumnInfo(
            name='sample-id',
            role=ml3_enums.ColumnRole.ID,
            is_nullable=False,
            data_type=ml3_enums.DataType.STRING,
        ),
        ml3_models.ColumnInfo(
            name='image',
            role=ml3_enums.ColumnRole.INPUT,
            is_nullable=False,
            data_type=ml3_enums.DataType.ARRAY_3,
            dims=(224, 224, 3)
        ),
        ml3_models.ColumnInfo(
            name="embedding",
            role=ml3_enums.ColumnRole.INPUT_ADDITIONAL_EMBEDDING,
            is_nullable=False,
            data_type=ml3_enums.DataType.ARRAY_1,
            dims=(768,),
        ),
        ml3_models.ColumnInfo(
            name='label',
            role=ml3_enums.ColumnRole.TARGET,
            is_nullable=False,
            data_type=ml3_enums.DataType.CATEGORICAL,
            possible_values=LABELS
        ),
    ]
)

## Interaction with ML cube Platform
To start, we create an instance of ML cube Platform client using the provided api key.
Then, we create a task, a dataschema, a model and finally we upload data.

In [None]:
client = ML3PlatformClient(URL, API_KEY)

When creating a task there are some information we need to specify:
- **task_type:** artificial intelligence task type. In this case it is a multiclass classification
- **data_structure:** the type of input data. In this case it is Image
- **optional_target:** if the target can be missing for production data.
    We assume that reference data being the training one always have the target.
    However, it is possible that ofr other historical data or for production data target is not available, enabling this option, ML cube Platform does not force its presence and it will not stop breaks the jobs.
- **cost_info**, this optional field allows to specify the costs of the error of the model and will be used during the retraining report computation.

In [None]:
task_id = client.create_task(
    project_id=PROJECT_ID,
    name='image_task',
    tags=["tag_1", "tag_2"],
    task_type=ml3_enums.TaskType.CLASSIFICATION_MULTICLASS,
    data_structure=ml3_enums.DataStructure.IMAGE,
    optional_target=False
)

In [None]:
client.add_data_schema(task_id=task_id, data_schema=data_schema)

After we added the data schema, we are able to create our model.

A model is uniquely identified by `name` and `model version`.

In [None]:
model_id: str = client.create_model(
    task_id=task_id,
    name=model_name,
    version=model_version,
    metric_name=ml3_enums.ModelMetricName.ACCURACY,
    preferred_suggestion_type=ml3_enums.SuggestionType.SAMPLE_WEIGHTS,
    with_probabilistic_output=False
)

Training data are uploaded as historical data i.e., any data that do not come from production.
Then we indicate them as reference data of our model in order to set up the detection algorithms.

Note that it is possible to add other historical data in any time.

In [None]:
job_id = client.add_historical_data(
    task_id=task_id,
    inputs=training_inputs_data,
    target=training_target_data
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

In [None]:
job_id = client.set_model_reference(
    model_id,
    from_timestamp=training_initial_timestamp,
    to_timestamp=training_end_timestamp,
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

Now, we are ready to upload production data.
Production data can be uploaded asynchronously, that means that we can upload each data category whenever, it is available without waiting for the others.
This is specially true for *target* data that usually are available with an amount of delay.

In [None]:
job_id = client.add_production_data(
    task_id=task_id,
    inputs=production_0_inputs_data,
    target=production_0_target_data,
    predictions=[(model_id, production_0_prediction_data)]
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')