# Welcome to the ufs2arco + Anemoi + wxvx Framework in AzureML!

This notebook will guide you through the basic steps of running the full framework.

1) Create your training and validation datasets with ufs2arco
2) Submit a training job with anemoi-training
3) Run inference with anemoi-inference
4) Run verification with wxvx

More details will be provided in each individual section.

## Step 0: Configure azure settings and permissions

In [None]:
from azureml.core import Workspace, Experiment
from azure.ai.ml import MLClient, Input, Output, command
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Environment, BuildContext
from azure.ai.ml.constants import AssetTypes

ws = Workspace.from_config()

ml_client = MLClient(
    DefaultAzureCredential(),
    ws.subscription_id,
    ws.resource_group,
    ws.name,
)

# Get your workspace's default datastore (or specify another registered datastore)
default_ds = ml_client.datastores.get_default()

## Step 1: Create conda environments
You only need to do this once (additionally, check if these environments were already created by someone else before running). After creation, you should be able to find them within the "environments" tab in AzureML. Anytime you re-run this, it will create a new version (e.g. myenv:1 would then become myenv:2). This is helpful if there is anything you wish to update in your environment (package versions, add an additional package, etc.)

Environments we will create:
1) ufs2arco for data processing
2) anemoi for training a graph based model
3) wxvx for post-processing and verification

In [None]:
# create ufs2arco conda environment

env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="conf/conda/ufs2arco-conda.yaml",
    name="ufs2arco",
    description="ufs2arco environment created from a Docker image plus Conda environment.",
)

ml_client.environments.create_or_update(env_docker_conda)

In [None]:
# create anemoi conda environment

env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04",
    conda_file="conf/conda/anemoi-conda.yaml",
    name="anemoi",
    description="anemoi environment created from a Docker image plus Conda environment.",
)

ml_client.environments.create_or_update(env_docker_conda)

In [None]:
# create wxvs conda environment

env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="conf/conda/wxvx_conda.yaml",
    name="wxvx",
    description="wxvx environment created from a Docker image plus Conda environment.",
)

returned_job = ml_client.environments.create_or_update(env_docker_conda)

## Step 2: Create replay dataset with ufs2arco
This step saves a dataset that will include training and validation data together in one dataset. The zarr is saved to your default datastore. 


The default set up here assumes you have access to MPI. If you do not, here are a few changes to make to successfully create data without using MPI. Note, this will make the process take a lot longer (feel free to change your date range to something smaller if you are simply trying to run tests).

- conf/data/replay.yaml: At the top of the yaml change mover to `datamover` instead of `mpidatamover`, and add a line beneath that that says `batch_size: 2`
- conf/data/submit_ufs2arco.sh: change `mpirun --allow-run-as-root -np 8 ufs2arco replay.yaml` to simply say `ufs2arco replay.yaml`

Otherwise, you should not have to change anything and should be good to go to simply run this cell!

There are a few additional things to note:
- If you have executed this cell before and the job started saving any output, due to job failure or maybe you wanted to change some configurations and test them after, the job will likely fail because a zarr already exists. Either go delete the zarr that you have, or simply rename the new zarr you wish to save by changing `output_zarr`
- This job assumes you have access to a `Standard-D13-v2` instance. If you do not, change this to a CPU instance that you do have access to.
- The current format of this cell also assumes that you are using version 1 (e.g. `environment="ufs2arco:1"`) of your ufs2arco environment. If you have recreated this environment for whatever reason, you will need to update the version number to the most recent version.

In [None]:
output_path = "replay/data"
output_zarr = "replay.zarr"
outputs = Output(
    type="uri_folder",
    path=f"azureml://datastores/{default_ds.name}/paths/{output_path}/{output_zarr}" 
)

command_job = command(
    code="conf/data",
    command=f"bash submit_ufs2arco.sh ${{outputs.output_blob}} {output_path} {output_zarr}",
    environment="ufs2arco:1",
    compute="Standard-D13-v2",
    experiment_name="ufs2arco",
    display_name="training_dataset",
    outputs={"output_blob": outputs},
)

returned_job = ml_client.jobs.create_or_update(command_job)

returned_job.services["Studio"].endpoint

## Step 3: Submit a training job with anemoi-core

After your dataset job has completed, submit this cell to complete model training. Checkpoints and plots all saved to the default datastore.

A few notes to check before submission:
- As noted in the data step, please check your environment version and compute to make sure that they match what you intend to use. 
- A `Standard-NC4as-T4-v3` is a great option for this task if it is available to you.

In future work we will make a version of this that can succesfully run on a CPU. This is so users may run this with free resources.

In [None]:
input_path = "replay/data"
input_zarr = "replay.zarr"
replay_inputs = Input(
    type="uri_folder",
    mode="ro_mount",
    path=f"azureml://datastores/{default_ds.name}/paths/{input_path}/{input_zarr}",
)

outputs = Output(
    type="uri_folder",
    mode="upload",
    path=f"azureml://datastores/{default_ds.name}/paths/training_output",
)

command_job = command(
    code="conf/training",
    command=f"bash submit_training.sh ${{inputs.data}} ${{outputs.output_dir}}",
    environment="anemoi:1",
    compute="Standard-NC4as-T4-v3",
    experiment_name="anemoi-training",
    display_name="anemoi-training",
    outputs={"output_dir": outputs},
    inputs={"data": replay_inputs},
)

returned_job = ml_client.jobs.create_or_update(command_job)

returned_job.services["Studio"].endpoint

## Step 4: Submit inference job with anemoi-inference

Run this cell after you have successfully completed trainnig. This will load the model checkpoint from the default datastore and create one 240-hr forecast. The output is saved to the default datastore.

A few notes to check before submission:
- As always, check your compute and environment. 
- In the first line (`input_path`), you will need to go to your default datastore and find the run_id (`3f476fd7-65ca-4d98-b3d5-b622d88a0d7d`) that is unique to you. This ID differs with each model run.

In [None]:
input_path = "replay/training-output/checkpoint/3f476fd7-65ca-4d98-b3d5-b622d88a0d7d"
input_ckpt = "inference-last.ckpt"
ckpt_input = Input(
    type="uri_file",
    mode="ro_mount",
    path=f"azureml://datastores/{default_ds.name}/paths/{input_path}/{input_ckpt}",
)

input_path = "replay/data"
zarr_input = Input(
    type="uri_folder",
    mode="ro_mount",
    path=f"azureml://datastores/{default_ds.name}/paths/{input_path}",
)

outputs = Output(
    type="uri_folder",
    mode="upload",
    path=f"azureml://datastores/{default_ds.name}/paths/replay/inference",
)

command_job = command(
    code="conf/inference",
    command=f"python inference.py ${{inputs.ckpt}} ${{inputs.zarr}} ${{outputs.output_dir}}",
    environment="anemoi:1",
    compute="Standard-NC4as-T4-v3",
    experiment_name="anemoi-inference",
    display_name="anemoi-inference",
    outputs={"output_dir": outputs},
    inputs={
        "ckpt": ckpt_input,
        "zarr": zarr_input,
    },
)

returned_job = ml_client.jobs.create_or_update(command_job)

returned_job.services["Studio"].endpoint

## Step 5: Submit verification job with wxvx

Load inference, post-process the output so that will work with wxvx, and run verification against the GFS for a handful of variables.

In [None]:
input_path = "replay/inference"
zarr_input = Input(
    type="uri_folder",
    mode="ro_mount",
    path=f"azureml://datastores/{default_ds.name}/paths/{input_path}",
)

outputs = Output(
    type="uri_folder",
    mode="upload",
    path=f"azureml://datastores/{default_ds.name}/paths/replay/verification",
)

command_job = command(
    code="conf/verification",
    command=f"bash submit_wxvx.sh ${{inputs.zarr}} ${{outputs.output_dir}}",
    environment="wxvx:1",
    compute="Standard-D13-v2",
    experiment_name="wxvx",
    display_name="wxvx",
    outputs={"output_dir": outputs},
    inputs={"zarr": zarr_input},
)

returned_job = ml_client.jobs.create_or_update(command_job)

returned_job.services["Studio"].endpoint