For this workshop, you need:

* An Azure Machine Learning workspace. 
* The Azure Machine Learning Python SDK v2 installed. 

To install the SDK you can either,

Create a compute instance, which already has installed the latest AzureML Python SDK and is pre-configured for ML workflows.

Use the followings commands to install Azure ML Python SDK v2:

```bash
conda activate <virtual_env_name>
pip install azure-ai-ml==1.0.0
```

If you're using a virtual env, make sure to install the sdk inside the virtual env.

The virtual environment for sdkv2 on Azure Notebooks is called `azureml_py310_sdkv2`.


## Connect to ML Client

To connect to a workspace, you need to provide a subscription, resource group and workspace name. These details are used in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace.

In the following example, the default Azure authentication is used along with the default workspace configuration or from any `config.json` file you might have copied into the folders structure. If no `config.json` is found, then you need to manually introduce the subscription_id, resource_group and workspace when creating `MLClient`.

```python
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AzureML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AZUREML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)
```


In [2]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

# Add config.json file to the workspace
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential, path="config.json")

Found the config file in: /mnt/batch/tasks/shared/LS_root/mounts/clusters/jcharley4/code/Users/jcharley/config.json


# Model Training

## 1. Create Managed Compute

A compute is a designated compute resource where you run your job or host your endpoint. Azure Machine learning supports the following types of compute:

- **Compute instance** - a fully configured and managed development environment in the cloud. You can use the instance as a training or inference compute for development and testing. It's similar to a virtual machine on the cloud.

- **Compute cluster** - a managed-compute infrastructure that allows you to easily create a cluster of CPU or GPU compute nodes in the cloud.

- **Inference cluster** - used to deploy trained machine learning models to Azure Kubernetes Service. You can create an Azure Kubernetes Service (AKS) cluster from your Azure ML workspace, or attach an existing AKS cluster.

- **Attached compute** - You can attach your own compute resources to your workspace and use them for training and inference.

You can create a compute using the Studio, the cli and the sdk.

<hr>

In [3]:
from azure.ai.ml.entities import AmlCompute

my_cluster = AmlCompute(
    name="cpu-cluster-CA",
    type="amlcompute", 
    size="STANDARD_DS3_V2", 
    min_instances=0, 
    max_instances=2,
    location="westeurope", 	
)

ml_client.compute.begin_create_or_update(my_cluster)


<azure.core.polling._poller.LROPoller at 0x7f29f2bd6190>

## 2. Register File Data Asset

**Datastore** - Azure Machine Learning Datastores securely keep the connection information to your data storage on Azure, so you don't have to code it in your scripts.

An Azure Machine Learning datastore is a **reference** to an **existing** storage account on Azure. The benefits of creating and using a datastore are:
* A common and easy-to-use API to interact with different storage type. 
* Easier to discover useful datastores when working as a team.
* When using credential-based access (service principal/SAS/key), the connection information is secured so you don't have to code it in your scripts.

Supported Data Resources: 

* Azure Storage blob container
* Azure Storage file share
* Azure Data Lake Gen 1
* Azure Data Lake Gen 2
* Azure SQL Database 
* Azure PostgreSQL Database
* Azure MySQL Database

It is not a requirement to use Azure Machine Learning datastores - you can use storage URIs directly assuming you have access to the underlying data.

You can create a datastore using the Studio, the cli and the sdk.

<hr>



**Data asset** - Create data assets in your workspace to share with team members, version, and track data lineage.

By creating a data asset, you create a reference to the data source location, along with a copy of its metadata. 

The benefits of creating data assets are:

* You can **share and reuse data** with other members of the team such that they do not need to remember file locations.
* You can **seamlessly access data** during model training (on any supported compute type) without worrying about connection strings or data paths.
* You can **version** the data.

<hr>


In [3]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_data = Data(
    path="../data/Day1-exercice5-Taxi/taxi-data.csv",
    type=AssetTypes.URI_FILE, # URI_FOLDER
    description="Taxi dataset",
    name="taxi-data"
)
ml_client.data.create_or_update(my_data)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'taxi-data', 'description': 'Taxi dataset', 'tags': {}, 'properties': {}, 'id': '/subscriptions/66914bb5-9cb2-4f6d-a84d-8ff900446b22/resourceGroups/Learning/providers/Microsoft.MachineLearningServices/workspaces/test_learn/data/taxi-data/versions/4', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/jcharley4/code/Users/jcharley/VBD_Day1/correction', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7ff20d45bf10>, 'serialize': <msrest.serialization.Serializer object at 0x7ff20d45bdc0>, 'version': '4', 'latest_version': None, 'path': 'azureml://subscriptions/66914bb5-9cb2-4f6d-a84d-8ff900446b22/resourcegroups/Learning/workspaces/test_learn/datastores/workspaceblobstore/paths/LocalUpload/2e56e9007690a9db90f90b8830ddcde4/taxi-data.csv', 'datastore': Non

## Create a tabular dataset/data asset - MLTable format

## MLTable

`MLTable` is a way to abstract the schema definition for tabular data so that it is easier for consumers of the data to materialize the table into a Pandas/Dask/Spark dataframe. [A more detailed explanation and motivation is provided on docs.microsoft.com.](https://docs.microsoft.com/azure/machine-learning/concept-data#mltable).

The ideal scenarios to use `MLTable` are:

- The schema of your data is complex and/or changes frequently.
- You only need a subset of data (for example: a sample of rows or files, specific columns, etc).
- AutoML jobs requiring tabular data.

If your scenario does not fit the above then it is likely that URIs are a more suitable type.

### The `MLTable` file

The `MLTable` file defines the schema for tabular data. Below is a sample:

In [4]:
! cat ../data/Day1-exercice5-Taxi/MLTable

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json 

paths:
    - pattern: ./taxi-data.csv

transformations:
  - read_delimited:
      delimiter: ","
      header: all_files_same_headers
      encoding: utf8


In [5]:
# ML table file - ML table target a folder
import mltable

tbl = mltable.load(uri="../data/Day1-exercice5-Taxi")
tbl.to_pandas_dataframe()

Unnamed: 0,Column1,cost,distance,dropoff_latitude,dropoff_longitude,passengers,pickup_latitude,pickup_longitude,store_forward,vendor,...,pickup_monthday,pickup_hour,pickup_minute,pickup_second,dropoff_weekday,dropoff_month,dropoff_monthday,dropoff_hour,dropoff_minute,dropoff_second
0,0,4.5,0.83,40.694546,-73.976112,1,40.693836,-73.987267,False,2,...,3,21,2,35,6,True,3,21,5,52
1,1,6.0,1.27,40.812149,-73.959755,1,40.801468,-73.948456,False,2,...,19,21,49,17,1,True,19,21,54,37
2,2,9.5,1.80,40.678741,-73.980309,1,40.679798,-73.955444,False,2,...,5,9,46,18,1,True,5,9,57,28
3,3,4.0,0.50,40.754715,-73.925499,1,40.760818,-73.922935,False,1,...,8,17,49,12,4,True,8,17,52,20
4,4,6.0,0.90,40.669662,-73.911041,1,40.664940,-73.923042,False,1,...,29,10,28,21,4,True,29,10,34,59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9768,9768,9.0,1.39,40.757576,-73.974464,1,40.755352,-73.985252,False,2,...,29,21,10,23,4,True,29,21,22,1
9769,9769,9.5,1.70,40.770500,-73.989861,1,40.755215,-73.981499,False,1,...,4,14,32,16,0,True,4,14,44,38
9770,9770,6.5,1.40,40.739834,-73.985512,1,40.724628,-73.987572,False,1,...,25,6,6,39,0,True,25,6,12,10
9771,9771,7.0,0.90,40.769672,-73.966759,1,40.766201,-73.952728,False,1,...,12,13,42,38,1,True,12,13,50,54


In [6]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# my_path must point to folder containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: './<path>'
# blob:  'https://<account_name>.blob.core.windows.net/<container_name>/<path>'
# ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/'
# Datastore: 'azureml://datastores/<data_store_name>/paths/<path>'

my_data = Data(
    path="../data/Day1-exercice5-Taxi/",
    type=AssetTypes.MLTABLE,
    description="Taxi tabular dataset",
    name="taxi-mltable-data",
)

ml_client.data.create_or_update(my_data)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': ['./taxi-data.csv'], 'type': 'mltable', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'taxi-mltable-data', 'description': 'Taxi tabular dataset', 'tags': {}, 'properties': {}, 'id': '/subscriptions/66914bb5-9cb2-4f6d-a84d-8ff900446b22/resourceGroups/Learning/providers/Microsoft.MachineLearningServices/workspaces/test_learn/data/taxi-mltable-data/versions/2', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/jcharley4/code/Users/jcharley/VBD_Day1/correction', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7ff1e4c5c340>, 'serialize': <msrest.serialization.Serializer object at 0x7ff1e4c5c7f0>, 'version': '2', 'latest_version': None, 'path': 'azureml://subscriptions/66914bb5-9cb2-4f6d-a84d-8ff900446b22/resourcegroups/Learning/workspaces/test_learn/datastores/workspaceblobstore/paths/LocalUpload/1bbc5d0cd9365b9adf705e5fdfd

## 3. Register Train Environment

Azure Machine Learning environments define the execution environments for your **jobs** or **deployments** and encapsulate the dependencies for your code. 

Azure ML uses the environment specification to create the Docker container that your **training** or **scoring code** runs in on the specified compute target.

Create an environment from a
* conda specification
* Docker image
* Docker build context

There are two types of environments in Azure ML: **curated** and **custom environments**. Curated environments are predefined environments containing popular ML frameworks and tooling. Custom environments are user-defined.

<hr>

We can register an **environment** with cli v2 or sdk v2 using the following syntax:


In [12]:
from azure.ai.ml.entities import Environment

my_environment = Environment(
    image="azureml://registries/azureml/environments/AzureML-sklearn-0.24-ubuntu18.04-py37-cpu/versions/46",
    conda_file = "./src-exercice6/environment/train-conda.yml",
    name="taxi-train-env",
    description="Environment created from a Docker image plus Conda environment to train taxi model.",
)

ml_client.environments.create_or_update(my_environment)

## 4. Simple job / pipeline one step

In [7]:
env = ml_client.environments.get(name="AzureML-minimal-ubuntu18.04-py37-cpu-inference", version="45")
print(env)

Environment({'is_anonymous': False, 'auto_increment_version': False, 'name': 'AzureML-minimal-ubuntu18.04-py37-cpu-inference', 'description': 'A minimal environment for Inference, does not contain any framework and additional python packages.', 'tags': {'OS': 'Ubuntu18.04', 'Python': '3.7', 'Framework': 'None', 'Inference': ''}, 'properties': {}, 'id': '/subscriptions/66914bb5-9cb2-4f6d-a84d-8ff900446b22/resourceGroups/Learning/providers/Microsoft.MachineLearningServices/workspaces/test_learn/environments/AzureML-minimal-ubuntu18.04-py37-cpu-inference/versions/45', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/jcharley4/code/Users/jcharley/VBD_Day1/correction', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f29f1eb3a90>, 'serialize': <msrest.serialization.Serializer object at 0x7f29f0b6fd00>, 'version': '45', 'latest_version': None, 'conda_file': None, 'image': None, 'build': None, 'inference_config': None, 

In [9]:
from azure.ai.ml import command, Input


# create the command
job = command(
    code="./src-exercice6",  # local path where the code is stored
    command="python main.py --diabetes-csv ${{inputs.taxi}}",
    inputs={
        "taxi": Input(
            type="uri_file",
            path="azureml:taxi-data:4",
        )
    },
    environment="AzureML-minimal-ubuntu18.04-py37-cpu-inference@latest",
    compute='cpu-cluster-CA',
    display_name="taxi-singlejob-example",
    # description,
    # experiment_name
)

# submit the command
returned_job = ml_client.create_or_update(job)