# Getting Started: Fine-tuning a model for Multi Label Classification task

**Learning Objectives** - By the end of this quickstart tutorial, you'll know how to create a pipeline to fine-tune a model for multi label classification task on Azure Machine Learning studio.

This tutorial covers:

- Connect to workspace
- Set up a compute resource on the Azure Machine Learning Studio via sdk
- Create arguments to be passed for each component for fine-tuning a model for multi label classification
- Build a end-to-end pipeline to fine-tune a model for multi label classification
    - prepares data for finetuning
    - fine-tunes the model
    - registers the model
- Submit the pipeline

##### Dependencies installation
Before starting off, if you are running the notebook on Azure Machine Learning Studio or running first time locally, you will need the following packages

In [None]:
! pip install azure-ai-ml==1.0.0
! pip install azure-identity
! pip install datasets==2.3.2

### Connect to Azure Machine Learning workspace

Before we dive in the code, you'll need to connect to your workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

We are using `DefaultAzureCredential` to get access to workspace. `DefaultAzureCredential` should be capable of handling most scenarios. If you want to learn more about other available credentials, go to [set up authentication doc](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk), [azure-identity reference doc](https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

Replace `<AML_WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` with their respective values in the below cell.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential


subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<AML_WORKSPACE_NAME>"
experiment_name = "AzureML-Train-Finetune-Samples"      # can rename to any valid name

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

workspace_ml_client = MLClient(
        credential,
        subscription_id,
        resource_group,
        workspace_name,
    )

registry_ml_client = MLClient(
    credential,
    subscription_id,
    resource_group,
    registry_name="azureml-preview",
)

preprocess_cluster_name = None
finetune_cluster_name = None
managed_identity_cluster_name = None

### Create compute

In order to finetune a model on Azure Machine Learning studio, you will need to create a compute resource first. **Creating a compute will take 3-4 minutes.** 

For additional references, see [Azure Machine Learning in a Day](https://github.com/Azure/azureml-examples/blob/main/tutorials/azureml-in-a-day/azureml-in-a-day.ipynb). 

##### Create CPU compute for model selection and data preprocess components

In [None]:
from azure.ai.ml.entities import AmlCompute

preprocess_cluster_name = "sample-preprocess-cluster"
new_compute = AmlCompute(
    name=preprocess_cluster_name,
    size="Standard_D12",
)
print("Creating/updating compute")
poller = workspace_ml_client.compute.begin_create_or_update(new_compute)
poller.wait()

##### Create GPU compute for finetune

The recommended GPU compute SKUs can be found [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series) and [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series).

In [None]:
finetune_cluster_name = "sample-finetune-cluster"

new_compute = AmlCompute(
    name=finetune_cluster_name,
    size="Standard_ND40rs_v2",
    max_instances=1 # Change this if you want to use multi node
)

print("Creating/updating compute")
poller = workspace_ml_client.compute.begin_create_or_update(new_compute)
poller.wait()
gpus_per_node = 8 # This varies with the compute SKU selected for finetune component
num_nodes = 1 # For multi node training set it to > 1

##### Create managed identity compute for registring the finetuned model (Optional)

> **Note**: The `managed_identity_resource_id` must have Contributor access to the resource group as scope

In [None]:
from azure.ai.ml.entities import IdentityConfiguration, ManagedIdentityConfiguration

managed_identity_cluster_name = "sample-managed-id-cluster"
managed_identity_resource_id = "<REPLACE WITH YOUR MANAGED IDENTITY RESOURCE ID>"

try:
    # Note - create different cluster if necessary only else use preprocess_compute
    uai = ManagedIdentityConfiguration(resource_id=managed_identity_resource_id)
    identity_config = IdentityConfiguration(type="UserAssigned", user_assigned_identities=[uai])

    new_compute = AmlCompute(
        name=managed_identity_cluster_name,
        size="Standard_D12",
        identity=identity_config,
    )

    print("Creating/updating compute")
    poller = workspace_ml_client.compute.begin_create_or_update(new_compute)
    poller.wait()
except Exception as e:
    print(f"Failed to create managed identity compute: {e}")
    managed_identity_cluster_name = None

#### Create arguments to be passed to each component

The detailed arguments for each component can be found at:
- Model Selection - [model_selector_component.md](../../docs/component_docs/model_selector_component.md)
- Data Pre Processing - [preprocess_component.md](../../docs/component_docs/preprocess_component.md)
- Fine Tuning - [finetune_component.md](../../docs/component_docs/finetune_component.md)
- Evaluate Model - [evaluate_model.md](../../docs/component_docs/evaluate_model.md)
- Register - [register_component.md](../../docs/component_docs/register_component.md)

In [None]:
model_selection_args = {
  "huggingface_id": "bert-base-uncased"
}

preprocess_args = {
  "sentence1_key": "sentence",
  "sentence2_key": None,
  "label_key": "labels",
}

finetune_args = {
  "lora_alpha": 128,
  "lora_r": 8,
  "lora_dropout": 0.0,
  "epochs": 10,
  "optimizer": "adamw_torch",
  "learning_rate": 2e-5,
  "train_batch_size": 1,
  "valid_batch_size": 1,
  "apply_deepspeed": "false",
  "apply_ort": "false",
  "apply_lora": "false",
  "merge_lora_weights": "true",
  "auto_find_batch_size": "false",
  "save_as_mlflow_model": "true",
  "evaluation_strategy": "steps",
  "evaluation_steps_interval": 0.25,
  "logging_strategy": "steps",
  "logging_steps": 100,
}

model_evaluation_args = {
    "task": "text-classification-multilabel",
    "test_data_label_column_name": "labels",
    "test_data_input_column_names": "sentence",
    "device": "gpu"
}

model_registration_args = {
  "name_for_registered_model": "sample_model_multi_label_classification",
  "model_type": "mlflow_model",
  "registry_name": None,
}

### Prepare dataset for fine-tuning

We can use a standard hugging-face dataset. For our current sample we are using dataset from local path `datasets`. You can change it to your dataset path.

The dataset directory path being provided must have the following file names:
- [train.jsonl](datasets/train.jsonl)
- [validation.jsonl](datasets/validation.jsonl)
- [test.jsonl](datasets/test.jsonl)

Each file must be of `jsonl` format with following keys and values:
- sentence_1 key (string, required)
- sentence_2 key (string, optional)
- label key (integer/string, required)

Sample example line - {"sentence1": "This is a sample sentence1 text", "sentence2": "This is a sample sentence2 text", "label": "['sample_label_x', 'sample_label_y']"}
You can also checkout [train.jsonl](datasets/train.jsonl), [validation.jsonl](datasets/validation.jsonl) and [test.jsonl](datasets/test.jsonl) for sample dataset format. 

Sample dataset schema as follows:
```
{"text":"Thank you friend","labels":[15],"id":"eeqd04y","labels_str":"['15']"}
{"text":"Happy to be able to help.","labels":[17],"id":"efeu6uo","labels_str":"['17']"}
{"text":"that is what retardation looks like","labels":[27],"id":"eeb9aft","labels_str":"['27']"}
{"text":"I miss them being alive","labels":[16,25],"id":"ee8mzwa","labels_str":"['16', '25']"}
{"text":"Super, thanks","labels":[15],"id":"ef462jc","labels_str":"['15']"}
```
The additional columns like `id`, `labels` will be excluded.

For additional references see Sequence Classification Inputs in [preprocess_component.md](../../docs/component_docs/preprocess_component.md)

In [None]:
import os
dataset_dir = "./datasets/"
dataset_dir = os.path.abspath(dataset_dir)
train_file = os.path.join(dataset_dir, "train.jsonl")
validation_file = os.path.join(dataset_dir, "validation.jsonl")
test_file = os.path.join(dataset_dir, "test.jsonl")

### Create Pipeline job

Let's create e2e pipeline job so that we can submit a pipeline to Azure Machine Learning Studio.

Create the pipeline with multi label classification components for fine-tuning a model. Optionally register the fine-tuned model to workspace. It will show in `Models`.

> **Note**: The pipeline job registers the fine-tuned model based on the `managed_identity_cluster_name` arg.

##### Create components

In [None]:
model_selector_func = registry_ml_client.components.get(name="textclassificationmultilabel_modelselection", label="latest")
preprocess_func = registry_ml_client.components.get(name="textclassificationmultilabel_datapreprocessing", label="latest")
finetune_func = registry_ml_client.components.get(name="textclassificationmultilabel_finetuning", label="latest")
registration_func = registry_ml_client.components.get(name="register_model", label="latest")
model_evaluation_func = registry_ml_client.components.get(name="evaluate_model", label="latest")

##### Utility function to create pipeline

In [None]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import CommandComponent, Job, Component
from azure.ai.ml import PyTorchDistribution, Input

@pipeline()
def create_pipeline():
    """Create pipeline."""

    model_selector: CommandComponent = model_selector_func(**model_selection_args)
    model_selector.code = None
    model_selector.compute = preprocess_cluster_name

    preprocess: CommandComponent = preprocess_func(
        model_path=model_selector.outputs["output_dir"], 
        train_file_path = Input(type="uri_file", path=train_file),
        valid_file_path = Input(type="uri_file", path=validation_file),
        **preprocess_args
    )
    preprocess.code = None  # dirty workaround for dpv2
    preprocess.compute = preprocess_cluster_name

    finetune: CommandComponent = finetune_func(
        model_path=model_selector.outputs["output_dir"],
        dataset_path=preprocess.outputs["output_dir"],
        **finetune_args,
    )
    finetune.code = None
    finetune.compute = finetune_cluster_name
    finetune.distribution = PyTorchDistribution(process_count_per_instance=gpus_per_node)
    finetune.resources.instance_count = num_nodes

    model_evaluation = model_evaluation_func(
        test_data=Input(type="uri_file", path=validation_file),
        mlflow_model=finetune.outputs["mlflow_model_folder"],
        **model_evaluation_args,
    )
    
    model_evaluation.code = None
    model_evaluation.compute = finetune_cluster_name

    if managed_identity_cluster_name is not None:
        
        registration: CommandComponent = registration_func(
            model_path=finetune.outputs["mlflow_model_folder"], **model_registration_args
        )
        registration.code = None
        registration.compute = managed_identity_cluster_name

##### Create and submit sample pipeline

In [None]:
pipeline_object = create_pipeline()
# print(pipeline_object)
pipeline_object.display_name = model_selection_args["huggingface_id"] + "_pipeline_run_" + "multilabel"

print("Submitting pipeline")

pipeline_run = workspace_ml_client.jobs.create_or_update(pipeline_object, experiment_name=experiment_name)

print(f"Pipeline created. URL: {pipeline_run.studio_url}")