# Text classification

This notebook shows how to use the ML cube Platform with text data.
We utilize a Huggingface dataset and a pre-trained model for Sentiment classification. We load the validation data and split the dataset in two parts, using the first as reference data and the second as production data. 

In a real-world scenario, the training and validation datasets would be considered historical/reference data, while production data would come from the production environment after the model's deployment.

**With this example you will learn:**
- how to create a text classification task
- how to define a data schema
- how to create a model
- how to upload historical data
- how to set reference for the model
- how to upload production data

**Requirements**

These are the dependencies your Python environment is required to have in order to properly run this notebook.
```
ml3-platform-sdk>=0.0.22
transformers[torch]==4.41.2
torch==2.2.0
datasets==2.15.0
sentence-transformers==3.0.1
polars==0.20.31
json==2.0.9
numpy==1.26.4
tqdm==4.66.4
```


Imports

In [1]:
from tqdm import tqdm

In [2]:
from ml3_platform_sdk import enums as ml3_enums
from ml3_platform_sdk import models as ml3_models
from ml3_platform_sdk.client import ML3PlatformClient

from datasets import load_dataset, DatasetDict
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from transformers.pipelines.pt_utils import KeyDataset
import polars as pl
import json
import datetime
import numpy as np
import tqdm

User Inputs

In [3]:
URL = 'https://api.platform.mlcube.com'
API_KEY = ""
PROJECT_ID = ''
model_name = 'mymodel'
model_version = 'v0.0.1'

## Dataset, model and predictions
Download dataset and model using Huggingface api.
After the dataset and the model are downloaded we run the model to get predictions.

Load dataset

In [4]:
complete_dataset = load_dataset('cardiffnlp/tweet_eval', name='sentiment', split='validation[:50%]')

In [5]:
def sample_dataset(dataset, reference_portion=0.5, first_production_portion=0.5, seed=42):
    sampled_dataset = DatasetDict()
    
    # Split the dataset into reference and production

    split = dataset.train_test_split(test_size=reference_portion, seed=seed)

    sampled_dataset['reference'] = split['train']

    split_2 = split['test'].train_test_split(test_size=first_production_portion, seed=seed)

    sampled_dataset['first_production'] = split_2['train']
    sampled_dataset['second_production'] = split_2['test']

    return sampled_dataset

# Perform the sampling
dataset = sample_dataset(complete_dataset)

In [None]:
len(dataset['reference']['text']), len(dataset['first_production']['text']), len(dataset['second_production']['text'])

Load model pipeline

In [None]:
sentiment_pip = pipeline(model='cardiffnlp/twitter-roberta-base-sentiment-latest')

In [None]:
embedder = SentenceTransformer('distilroberta-base')

In [9]:
LABELS = [0, 1, 2]

## Create data objects for ML cube Platform
We use local data sources to upload data, hence, we need to create local files that will be shared with ML cube Platform.
With text data we can upload data in json as a list of objects containing three fields: timestamp, sample-id, text.
Text data can be composed of only text sequences but also with their embeddings (optional).
On the other hand, target and predictions can be sent as csv tabular files.

In ML cube Platform data are uploaded separately for each category:
- **inputs:** TextData object in json format
- **target:** TabularData object in csv format
- **predictions:** TabularData object in csv format


When dealing with unstructured data like text it is possible to send them in three ways:
1. By sending only embeddings i.e., a numerical representation of the text sample as a vector, using `EmbeddingData`;
2. By sending only the raw text, using `TextData`. In this case ML cube Platform will create the numerical representation using internal encoders;
3. By sending the raw text along with the embeddings, using `TextData` with the  `embedding_source` attribute. This more complete option has two benefits:it allows the usage of a personal embedder, which is usually focused on the domain rather than a general one, and it enables the extraction of additional metrics from the text, providing more functionalities in the web application.

In [10]:
def build_data_objects(
    dataset,
    model,
    embedder,
    model_name: str,
    model_version: str,
    starting_id: int,
    starting_timestamp: float,
    prefix: str,
) -> tuple[
    ml3_models.TextData,
    ml3_models.TabularData,
    ml3_models.TabularData,
    int,
    float
]:
    """Builds data objects for inputs, target and predictions.
    
    Since each sample has associated an id and a timestamp we generate 
    them as incremental counters.
    Therefore, we need a starting point for both.

    Parameters
    ----------

    dataset: huggingface dataset
    starting_id: sample id to start from
    starting_timestamp: timestamp to start from

    Returns
    ------
    text_data, last_sample_id, last_timestamp
    """

    n_samples = len(dataset['text'])
    next_starting_sample_id = starting_id + n_samples
    sample_ids = list(map(lambda x: f'sample_{x}', range(starting_id, starting_id + n_samples)))
    # each sample has a time delta of 2 minutes
    next_starting_timestamp = starting_timestamp + n_samples * 120
    timestamps = list(np.arange(starting_timestamp, starting_timestamp + n_samples * 120, 120))

    # Create inputs embedding file and save it
    print('Creating text file')
    text_samples = []
    for (i, sample) in enumerate(dataset['text']):
        text_samples.append({
            'text': sample,
            'sample-id': sample_ids[i],
            'timestamp': timestamps[i]
        })

    with open(f'{prefix}_text_samples.json', 'w') as f:
        json.dump(text_samples, f)
    
    # Create embedding dataframe
    print('Creating embedding file')
    embeddings = pl.DataFrame({
        'timestamp': timestamps,
        'sample-id': sample_ids,
        'embedding': embedder.encode((dataset['text'])).tolist()
    })
    embeddings.write_parquet(f'{prefix}_embeddings.parquet')

    print('Creating target file')
    target = pl.DataFrame({
        'label': dataset['label'],
        'timestamp': timestamps,
        'sample-id': sample_ids,
    })
    target.write_csv(f'{prefix}_target.csv')

    print('Creating predictions file')
    predicted_labels = []
    for pred in tqdm.tqdm(model(KeyDataset(dataset, "text"))):
        prediction = model.model.config.label2id[pred['label']]
        if prediction not in LABELS:
                prediction = LABELS[0]
        predicted_labels.append(prediction)

    predictions = pl.DataFrame({
        'timestamp': timestamps,
        'sample-id': sample_ids,
        f'{model_name}@{model_version}': predicted_labels,
    })
    predictions.write_csv(f'{prefix}_predictions.csv')

    print('Creating inputs data')
    inputs_data = ml3_models.TextData(
        source=ml3_models.LocalDataSource(
            file_type=ml3_enums.FileType.JSON,
            is_folder=False,
            folder_type=None,
            file_path=f'{prefix}_text_samples.json'
        ),
        embedding_source=ml3_models.LocalDataSource(
            file_type=ml3_enums.FileType.PARQUET,
            is_folder=False,
            folder_type=None,
            file_path=f'{prefix}_embeddings.parquet',
        )
    )

    print('Creating target data')
    target_data = ml3_models.TabularData(
        source=ml3_models.LocalDataSource(
            file_type=ml3_enums.FileType.CSV,
            is_folder=False,
            folder_type=None,
            file_path=f'{prefix}_target.csv'
        )
    )

    print('Creating predictions data')
    predictions_data = ml3_models.TabularData(
        source=ml3_models.LocalDataSource(
            file_type=ml3_enums.FileType.CSV,
            is_folder=False,
            folder_type=None,
            file_path=f'{prefix}_predictions.csv'
        )
    )
    
    return (
        inputs_data,
        target_data,
        predictions_data,
        next_starting_sample_id,
        next_starting_timestamp
    )


In [None]:
training_initial_sample_id = 0
training_initial_timestamp = datetime.datetime.now().timestamp()

(
    training_inputs_data,
    training_target_data,
    _,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset['reference'],
    sentiment_pip,
    embedder,
    model_name,
    model_version,
    training_initial_sample_id,
    training_initial_timestamp,
    'reference'
)
training_end_timestamp = starting_timestamp - 120

In [None]:
(
    production_0_inputs_data,
    production_0_target_data,
    production_0_prediction_data,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset['first_production'],
    sentiment_pip,
    embedder,
    model_name,
    model_version,
    starting_id,
    starting_timestamp,
    'prod_0'
)

In [None]:
(
    production_1_inputs_data,
    production_1_target_data,
    production_1_prediction_data,
    starting_id,
    starting_timestamp,
) = build_data_objects(
    dataset['second_production'],
    sentiment_pip,
    embedder,
    model_name,
    model_version,
    starting_id,
    starting_timestamp,
    'prod_1'
)

## Create data schema

The data schema specifies the type of data present in the task with their specific names.
A data schema must contain:
- *sample id*, column that is used to uniquely identify each sample
- *timestamp*, column that is used to order samples
- *input*, column that specifies the nature of the input. In this case, it's a string, as we are dealing with text data.
- *input additional embedding*, optional column for the embedding of the text data
- *target*, column that specifies the nature of the target. In this case, categorical with three possible values

The prediction column must not be specified because it will be automatically added during the model creation, with a name like `MODEL_NAME@MODEL_VERSION`

In [14]:
data_schema = ml3_models.DataSchema(
    columns=[
        ml3_models.ColumnInfo(
            name='timestamp',
            role=ml3_enums.ColumnRole.TIME_ID,
            is_nullable=False,
            data_type=ml3_enums.DataType.FLOAT,
        ),
        ml3_models.ColumnInfo(
            name='sample-id',
            role=ml3_enums.ColumnRole.ID,
            is_nullable=False,
            data_type=ml3_enums.DataType.STRING,
        ),
        ml3_models.ColumnInfo(
            name='text',
            role=ml3_enums.ColumnRole.INPUT,
            is_nullable=False,
            data_type=ml3_enums.DataType.STRING,
        ),
        ml3_models.ColumnInfo(
            name="embedding",
            role=ml3_enums.ColumnRole.INPUT_ADDITIONAL_EMBEDDING,
            is_nullable=False,
            data_type=ml3_enums.DataType.ARRAY_1,
            dims=(768,),
        ),
        ml3_models.ColumnInfo(
            name='label',
            role=ml3_enums.ColumnRole.TARGET,
            is_nullable=False,
            data_type=ml3_enums.DataType.CATEGORICAL,
            possible_values=LABELS
        ),
    ]
)

## Interaction with ML cube Platform
To start, we create an instance of ML cube Platform client using the provided api key.
Then, we create a task, a dataschema, a model and finally we upload data.

In [None]:
client = ML3PlatformClient(URL, API_KEY)

When creating a task there are some information we need to specify:
- **task_type:** artificial intelligence task type. In this case it is a multiclass classification
- **data_structure:** the type of input data. In this case it is Text
- **optional_target:** if the target can be missing for production data.
    We assume that reference data being the training one always have the target.
    However, it is possible that ofr other historical data or for production data target is not available, enabling this option, ML cube Platform does not force its presence and it will not stop breaks the jobs.
- **text_language:** if the target can be missing for production data. We assume that reference data, being the training data, always have the target.
    However, it is possible that target is not available in other historical data or production data. By enabling the optional target option, the ML cube Platform will not check its presence in the data sent to the platform.

In [16]:
task_id = client.create_task(
    project_id=PROJECT_ID,
    name='task_name',
    tags=["tag_1", "tag_2"],
    task_type=ml3_enums.TaskType.CLASSIFICATION_MULTICLASS,
    data_structure=ml3_enums.DataStructure.TEXT,
    optional_target=False,
    text_language=ml3_enums.TextLanguage.ENGLISH
)

In [17]:
client.add_data_schema(task_id=task_id, data_schema=data_schema)

A model is uniquely identified by `name` and `model version`.

In [18]:
model_id: str = client.create_model(
    task_id=task_id,
    name=model_name,
    version=model_version,
    metric_name=ml3_enums.ModelMetricName.ACCURACY,
    preferred_suggestion_type=ml3_enums.SuggestionType.SAMPLE_WEIGHTS,
    with_probabilistic_output=False
)

Training data are uploaded as historical data i.e., any data that do not come from production.
Then we indicate them as reference data of our model in order to set up the detection algorithms.

Note that it is possible to add other historical data at any time.

In [None]:
job_id = client.add_historical_data(
    task_id=task_id,
    inputs=training_inputs_data,
    target=training_target_data
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

In [None]:
job_id = client.set_model_reference(
    model_id,
    from_timestamp=training_initial_timestamp,
    to_timestamp=training_end_timestamp,
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

Now, we are ready to upload production data.
Notice that production data can be uploaded asynchronously, which means that we can upload each data category whenever it is available, without waiting for the others.
This is specially true for *target* data, that usually are available with an amount of delay.

In [None]:
job_id = client.add_production_data(
    task_id=task_id,
    inputs=production_0_inputs_data,
    target=production_0_target_data,
    predictions=[(model_id, production_0_prediction_data)]
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

Send production data asynchronously, first *inputs* and *predictions* and then *target*

In [None]:
job_id = client.add_production_data(
    task_id=task_id,
    inputs=production_1_inputs_data,
    target=None,
    predictions=[(model_id, production_1_prediction_data)]
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')

In [None]:
job_id = client.add_production_data(
    task_id=task_id,
    inputs=None,
    target=production_1_target_data,
    predictions=None
)
print(f'Waiting for job {job_id}')
client.wait_job_completion(job_id=job_id)
print(f'Job {job_id} completed')