# 1. RoseTTAFold on Azure ML - Setup

## Introduction

Azure is collaborating with the Baker Lab to expose their RoseTTAFold model as a service. This document describes how to get started exploring RoseTTAFold on Azure Machine Learning (Azure ML).

In this first notebook, we'll set up your Azure Machine Learning workspace with the data and compute cluster you'll need to get started.

**The setup steps in this first notebook only need to be followed once.** 

In the next notebook, [2-run-experiment.ipynb](2-run-experiment.ipynb) we'll specify some amino acid sequence data and run a RoseTTAFold job in your Azure Machine Learning workspace.

Then, in [3-batch-endpoint.ipynb](3-batch-endpoint.ipynb) we'll create a Batch Endpoint so that this can be called from the Azure CLI or as a REST call. 

**Note.** This RoseTTAFold endpoint is not designed to run in production environments, and is strictly for non-production test environments.

## Setup

In [None]:
%pip install --user azureml-core azureml-dataset-runtime azureml-pipeline

### VS Code Setup
If running this notebook within VS Code, ensure that the [Azure Machine Learning extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.vscode-ai) is installed.

Select the Azure icon in the left pane, and sign into your Account account.


### Create a workspace 
1. Sign in to the [Azure ML Studio](https://ml.azure.com/home) using the credentials for your Azure subscription.
2. Click the "Create workspace" button
3. In the "Create new workspace" dialog, select your subsciption, then fill out the remaining details and select "Create".
4. Once the workspace is created, tap your workspace name in the top right corner to open the dropdown.
5. Select "Download config file", and save (or copy) the config.json file to the directory where this notebook is located.


### Load Workspace Config
You must first download the config.json file from your Azure ML workspace (see previous steps in "Create a workspace"). This file should be saved locally, in the same directory as this notebook.

Run the following block to load your workspace config, and ensure it succeeds before proceeding. Watch the output feed, as it may require you to launch your browser and sign in.

In [None]:
from azureml.core import Workspace

try:
    ws = Workspace.from_config()
    print(ws.name, ws.location, ws.resource_group, ws.location, sep='\t')
    print('Azure ML workspace loaded')
except:
    print('Azure ML workspace not found')

### Create Azure ML Compute Cluster
Next, we'll create a compute cluster, or load an existing one. First, specify the name and parameters for your compute cluster.

By default, we've selected a 'Standard_NC12' VM size, which includes an NVIDIA Tesla K80 GPU. See more details on Azure VM specifications [here](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series).

In [None]:
# Specify the name for your compute cluster
compute_name = 'gpu-cluster'

# If this is a *new* compute cluster, also specify the following VM properties
cluster_vm_size = 'Standard_NC12' # 'Standard_NC12' is a VM size that includes an NVIDIA Tesla K80 GPU 
cluster_min_nodes = 0
cluster_max_nodes = 1

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Check if a compute cluster with this name already exists. If so, use it. 
try:
    compute_target = ComputeTarget(workspace=ws, name=compute_name)
    print('Found an existing cluster with this name, use it.')
except ComputeTargetException:
    print('Creating a new cluster with the specified name and parameters...')
    config = AmlCompute.provisioning_configuration(vm_size=cluster_vm_size,
                                                    min_nodes=cluster_min_nodes,
                                                    max_nodes=cluster_max_nodes)
    compute_target = ComputeTarget.create(workspace=ws,
                                    name=compute_name, 
                                    provisioning_configuration=config)

    compute_target.wait_for_completion(show_output=True,
                                        min_node_count=None,
                                        timeout_in_minutes=20)

### Import the Dependencies as Datasets
The endpoint assumes the databases required for RoseTTAFold are available via registered datasets in the workspace.
   The data is available in a [public blob storage](https://ms.portal.azure.com/#blade/Microsoft_Azure_Storage/ContainerMenuBlade/overview/storageAccountId/%2Fsubscriptions%2F48bbc269-ce89-4f6f-9a12-c6f91fcb772d%2FresourceGroups%2Faml1p-rg%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2Frosettafold/path/rosettafold-dependencies%2Fdbs%2FUniRef%2F), under these paths:
   - rosettafold_weights :"weights/"
   - rosettafold_bfd :"dbs/bfd/bfd/"
   - rosettafold_UniRef:"dbs/UniRef/"
   - rosettafold_pdb:"dbs/pdb/"

First, connect to the datastore (Azure Blob):

In [None]:
from azureml.core.datastore import Datastore

ds = ws.get_default_datastore()

dependencies_blob = Datastore.register_azure_blob_container(ws, 
                      datastore_name="rosettafold_dependencies", 
                      container_name="rosettafold-dependencies", 
                      account_name="rosettafold", 
                      overwrite=True)

Now create the dependency Datasets, which are references to subdirectories within the above Datastore

In [None]:
from azureml.core.dataset import Dataset

datastore_sources = [
    ('rosettafold_weights', 'weights/'),
    ('rosettafold_bfd', 'dbs/bfd/bfd/'),
    ('rosettafold_UniRef', 'dbs/UniRef/'),
    ('rosettafold_pdb', 'dbs/pdb/'),
]

for ds_name, ds_path in datastore_sources:
    # Create the dataset using as a subfolder reference winthin the dependencies_blob
    dataset = Dataset.File.from_files((dependencies_blob, ds_path))

    # Register the dataset with your Azure ML workspace, so that the inference script can easily reference by name
    dataset = dataset.register(workspace=ws, name=ds_name)

Workspace setup is complete! You only have to do this one time (per workspace). Next, we'll specify some data and run the RoseTTAFold algorithm. 

See next notebook: [2-run-experiment.ipynb](2-run-experiment.ipynb)