For this workshop, you need:

* An Azure Machine Learning workspace. 
* The Azure Machine Learning Python SDK v2 installed. 

To install the SDK you can either,

Create a compute instance, which already has installed the latest AzureML Python SDK and is pre-configured for ML workflows.

Use the followings commands to install Azure ML Python SDK v2:

```bash
conda activate <virtual_env_name>
pip install azure-ai-ml==1.0.0
```

If you're using a virtual env, make sure to install the sdk inside the virtual env.

The virtual environment for sdkv2 on Azure Notebooks is called `azureml_py310_sdkv2`.


## Connect to ML Client

To connect to a workspace, you need to provide a subscription, resource group and workspace name. These details are used in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace.

In the following example, the default Azure authentication is used along with the default workspace configuration or from any `config.json` file you might have copied into the folders structure. If no `config.json` is found, then you need to manually introduce the subscription_id, resource_group and workspace when creating `MLClient`.

```python
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AzureML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AZUREML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)
```


In [1]:
# import required libraries
from azure.ai.ml import MLClient, command, Input, Output, load_component
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data, Environment
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml.dsl import pipeline

In [6]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Found the config file in: /config.json


# Model Training

## (Option) 1. Create Managed Compute

A compute is a designated compute resource where you run your job or host your endpoint. Azure Machine learning supports the following types of compute:

- **Compute instance** - a fully configured and managed development environment in the cloud. You can use the instance as a training or inference compute for development and testing. It's similar to a virtual machine on the cloud.

- **Compute cluster** - a managed-compute infrastructure that allows you to easily create a cluster of CPU or GPU compute nodes in the cloud.

- **Inference cluster** - used to deploy trained machine learning models to Azure Kubernetes Service. You can create an Azure Kubernetes Service (AKS) cluster from your Azure ML workspace, or attach an existing AKS cluster.

- **Attached compute** - You can attach your own compute resources to your workspace and use them for training and inference.

You can create a compute using the Studio, the cli and the sdk.

<hr>

We can create a **compute instance** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_compute_instance.png" width = "700px" alt="Create Compute Instance cli vs sdk">
</center>


<hr>

We can create a **compute cluster** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_compute_cluster.png" width = "700px" alt="Create Compute Instance cli vs sdk">
</center>


Let's create a managed compute cluster for the training workload.

In [7]:
from azure.ai.ml.entities import AmlCompute

try:
    ml_client.compute.get(name="cpu-cluster")
    print("Compute already exists")

except:
    print("Compute not found; Proceding to create")
    
    my_cluster = AmlCompute(
    name="cpu-cluster",
    type="amlcompute", 
    size="STANDARD_DS3_V2", 
    min_instances=0, 
    max_instances=4,
    )
    ml_client.compute.begin_create_or_update(my_cluster)

Compute not found; Proceding to create


## 2. Register Data Asset

**Datastore** - Azure Machine Learning Datastores securely keep the connection information to your data storage on Azure, so you don't have to code it in your scripts.

An Azure Machine Learning datastore is a **reference** to an **existing** storage account on Azure. The benefits of creating and using a datastore are:
* A common and easy-to-use API to interact with different storage type. 
* Easier to discover useful datastores when working as a team.
* When using credential-based access (service principal/SAS/key), the connection information is secured so you don't have to code it in your scripts.

Supported Data Resources: 

* Azure Storage blob container
* Azure Storage file share
* Azure Data Lake Gen 1
* Azure Data Lake Gen 2
* Azure SQL Database 
* Azure PostgreSQL Database
* Azure MySQL Database

It is not a requirement to use Azure Machine Learning datastores - you can use storage URIs directly assuming you have access to the underlying data.

You can create a datastore using the Studio, the cli and the sdk.

<hr>

We can create a **datastore** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_datastore.png" width = "700px" alt="Create Datastore cli vs sdk">
</center>



**Data asset** - Create data assets in your workspace to share with team members, version, and track data lineage.

By creating a data asset, you create a reference to the data source location, along with a copy of its metadata. 

The benefits of creating data assets are:

* You can **share and reuse data** with other members of the team such that they do not need to remember file locations.
* You can **seamlessly access data** during model training (on any supported compute type) without worrying about connection strings or data paths.
* You can **version** the data.

<hr>

We can create a **data asset** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_data_asset.png" width = "700px" alt="Create Data Asset cli vs sdk">
</center>

In [8]:
my_data = Data(
    path="../../data/taxi-data.csv",
    type=AssetTypes.URI_FILE,
    description="Taxi dataset",
    name="taxi-data"
)
ml_client.data.create_or_update(my_data)

[32mUploading taxi-data.csv[32m (< 1 MB): 100%|██████████| 1.21M/1.21M [00:00<00:00, 19.9MB/s]
[39m



Data({'path': 'azureml://subscriptions/e86f0482-6203-4a73-adbe-7c9c39754c57/resourcegroups/aml/workspaces/amlce/datastores/workspaceblobstore/paths/LocalUpload/9292ec840b5d1db6306dba71da69ab7f/taxi-data.csv', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'taxi-data', 'description': 'Taxi dataset', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/e86f0482-6203-4a73-adbe-7c9c39754c57/resourceGroups/aml/providers/Microsoft.MachineLearningServices/workspaces/amlce/data/taxi-data/versions/1', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/lenisha1/code/Users/lenisha/mlops-v2-workshop/ml-pipelines/sdk', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fd288c6d4e0>, 'serialize': <msrest.serialization.Serializer object at 0x7fd288c6d3f0>, 'version': '

## 3. Register Train Environment

Azure Machine Learning environments define the execution environments for your **jobs** or **deployments** and encapsulate the dependencies for your code. 

Azure ML uses the environment specification to create the Docker container that your **training** or **scoring code** runs in on the specified compute target.

Create an environment from a
* conda specification
* Docker image
* Docker build context

There are two types of environments in Azure ML: **curated** and **custom environments**. Curated environments are predefined environments containing popular ML frameworks and tooling. Custom environments are user-defined.

<hr>

We can register an **environment** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_environment.png" width = "700px" alt="Create Environment cli vs sdk">
</center>

In [None]:
my_environment = Environment(
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04",
    conda_file="../../environment/train-conda.yml",
    name="taxi-train-env",
    description="Environment created from a Docker image plus Conda environment to train taxi model.",
)

ml_client.environments.create_or_update(my_environment)

## 4. Create Pipeline Job

**AML Job**:

Azure ML provides several ways to train your models, from code-first solutions to low-code solutions:

* Azure ML supports script files in python, R, Java, Julia or C#. All you need to learn is YAML format and command lines to use Azure ML.

* Distributed Training: AML supports integrations with popular frameworks, PyTorch and TensorFlow. Both frameworks employ data parallelism & model parallelism for distributed training.

* Automated ML - Train models without extensive data science or programming knowledge.

* Designer - drag and drop web-based UI.

<hr>

We can submit a **job** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_job.png" width = "700px" alt="Create Job cli vs sdk">
</center>

<br>
    
**AML Pipelines**:

An AML pipeline is an independently executable workflow of a complete machine learning task. It helps standardizing the best practices of producing a machine learning model: The core of a machine learning pipeline is to split a complete machine learning task into a multistep workflow. Each step is a manageable component that can be developed, optimized, configured, and automated individually. 

<hr>

We can submit a **pipeline job** with cli v2 or sdk v2 using the following syntax:

<center>
<img src="../../imgs/create_pipeline.png" width = "700px" alt="Create Pipeline cli vs sdk">
</center>

In [None]:
# Create pipeline job
parent_dir = "../../components"

# 1. Load components
prepare_data = load_component(source=parent_dir + "/prep.yml")
train_model = load_component(source=parent_dir + "/train.yml")
evaluate_model = load_component(source=parent_dir + "/evaluate.yml")
register_model = load_component(source=parent_dir + "/register.yml")

# 2. Construct pipeline
@pipeline()
def taxi_training_pipeline(raw_data, enable_monitoring, table_name):
    
    prepare = prepare_data(
        raw_data=raw_data,
        enable_monitoring=enable_monitoring, 
        table_name=table_name
    )

    train = train_model(
        train_data=prepare.outputs.train_data
    )

    evaluate = evaluate_model(
        model_name="taxi-model",
        model_input=train.outputs.model_output,
        test_data=prepare.outputs.test_data
    )


    register = register_model(
        model_name="taxi-model",
        model_path=train.outputs.model_output,
        evaluation_output=evaluate.outputs.evaluation_output
    )

    return {
        "pipeline_job_train_data": prepare.outputs.train_data,
        "pipeline_job_test_data": prepare.outputs.test_data,
        "pipeline_job_trained_model": train.outputs.model_output,
        "pipeline_job_score_report": evaluate.outputs.evaluation_output,
    }


pipeline_job = taxi_training_pipeline(
    Input(type=AssetTypes.URI_FILE, path="taxi-data@latest"), "false", "taximonitoring"
)

# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

In [None]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_samples-sdk"
)
pipeline_job