## 0.0 Create Azure resources

### 0.0.1 Create Azure ML Workspace with Azure Portal

If you don't have Microsoft Azure resource, please create it from from [this page](https://azure.microsoft.com/en-us/free/).

Once the resource is prepared, create Azure ML Workspace with [the following instruction](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=azure-portal). 
During the instruction, please note the following variables, which will be used in actual scenario:

![Azure ML Provisioning](../../documentation/images/aml_provisioning.png)

- `subscription`
- `resource group`
- `region`
- `workspace name` 
- `storage account`

### 0.0.2 Confirm storage key

Please check storage key in Azure poral with [this site](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal#view-account-access-keys).


### 0.0.3 Create related resource in Cognitive Service from Azure portal

You need two types of resouces:
- Please create `Speech` resource from [this site](https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-apis-create-account?tabs=speech%2Cwindows#create-a-new-azure-cognitive-services-resource).
- For `language`, please select `language service` from [this site](https://ms.portal.azure.com/#create/Microsoft.CognitiveServicesTextAnalytics).

Note: Both resouces requires you to use the same variables `subscription`, `resource group` and `region` as Azure ML.

After generating them, please note the `keys` and `endpoint`. For speech servce, please note `location` as well. For your reference, please visit [the site](https://www.youtube.com/watch?v=WZi0fhJtLJI).


### 0.0.4 Authentication

We adopted `CLI` and `managed ID` for AML workspace authentication. Especially, `managed identity` is used for computer cluster in Azure ML. Visit [the site](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-setup-authenticatio), if you're interested in authentication mechanism in Azure ML.

## 01. Set variables in local `.env` file

We noted plural variables like `subscription` in the previous section, and we put them into `.env` file for preparing our succeeding process.
Please put the following format and locate it in `/environment/.env` as local file. 

```.env
SUBSCRIPTION_ID=AAA
RESOURCE_GROUP=BBB
REGION=CCC
TENANT_ID=DDD
STORAGE_ACCOUNT=EEE
SECRET_KEY=DDD
  :
```

Note.  `AAA`, `BBB`, .. are dummy values, and please modify them with your values. Necessary variables are as follows:

| variables in .env  | description                                | # of process |
|--------------------|--------------------------------------------|--------------|
| SUBSCRIPTION_ID    | Subscription ID related to Azure account   | 0.0.1        |
| RESOURCE_GROUP     | Resource group name                        | 0.0.1        |
| REGION             | Region of Azure account                    | 0.0.1        |
| TENANT_ID          | Tenant ID of Azure account                 | 0.0.1        |
| WORKSPACE_NAME     | Workspace name of Azure Machine Learning   | 0.0.1        |
| STORAGE_ACCOUNT    | Storage account related to AML             | 0.0.1        |
| SECRET_KEY         | Storage key related to storage account     | 0.0.2        |
| SPEECH_KEY         | Speech key                                 | 0.0.3        |
| LOCATION           | Location related to speech key             | 0.0.3        |
| TEXT_ANALYTICS_KEY | Key for text analytics                     | 0.0.3        |

## 0.2 Python environment

### 0.2.1 Azure ML Compute

Please create [Azure ML Compute instance](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-instance#create) with your favorite name, where we execute this script.

### 0.2.2 Library install

Please install necessary libraries as follows:

In [None]:
%pip install -r ../environment/requirements_v2.txt

Please execute `sudo apt-get install libsndfile1` for preparing audio files on Ubuntu, if needed.

## 0.3 Confirm our environment variables

Please check your environment variables with the following cell.

In [None]:
## We confirm our setting with this cell.

import os, sys
currentDir = os.path.dirname(os.getcwd())
print(f'Current working directory: {currentDir}')
sys.path.append(currentDir)
sys.path.append('./../')
sys.path.append('././')

from dotenv import load_dotenv, find_dotenv
from common.constants import *

print('Loading environmental variables', load_dotenv(find_dotenv(ENVIORNMENT_FILE)))

SUBSCRIPTION_ID = os.environ.get('SUBSCRIPTION_ID')
RESOURCE_GROUP = os.environ.get('RESOURCE_GROUP')
REGION = os.environ.get('REGION')
TENANT_ID = os.environ.get('TENANT_ID')
WORKSPACE_NAME = os.environ.get('WORKSPACE_NAME')
STORAGE_ACCOUNT = os.environ.get('STORAGE_ACCOUNT')
SECRET_KEY = os.environ.get('SECRET_KEY')
SPEECH_KEY = os.environ.get('SPEECH_KEY')
LOCATION=os.environ.get('LOCATION')
TEXT_ANALYTICS_KEY = os.environ.get('TEXT_ANALYTICS_KEY')

print('---- Check Azure setting ----')
print(f'Subscription ID         : {SUBSCRIPTION_ID}')
print(f'Resource group          : {RESOURCE_GROUP}')
print(f'Region                  : {REGION}')
print(f'Tenant                  : {TENANT_ID}')
print(f'AML Workspace           : {WORKSPACE_NAME}')
print(f'Storage account         : {STORAGE_ACCOUNT}')
print(f'Storage secret key      : {SECRET_KEY}')
print(f'Speech key              : {SPEECH_KEY}')
print(f'Text analytics key      : {TEXT_ANALYTICS_KEY}')

## 0.4 Azure ML Configuration

In [None]:
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace
from azure.identity import DefaultAzureCredential

# get a handle to the subscription
ml_client = MLClient(DefaultAzureCredential(), 
                    subscription_id=SUBSCRIPTION_ID,
                    resource_group_name=RESOURCE_GROUP,
                    workspace_name=WORKSPACE_NAME)

In [None]:
from common.azureml_configuration_v2 import *

# configure Azure ML services
#-----------------------------
# initilaise the azureml config class
azuremlConfig = AzureMLConfiguration_v2(workspace=WORKSPACE_NAME
                                    ,tenant_id=TENANT_ID
                                    ,subscription_id=SUBSCRIPTION_ID
                                    ,resource_group=RESOURCE_GROUP
                                    ,location=REGION)

# configure Azure ML workspace
ws = azuremlConfig.configWorkspace()

In [None]:
from common.azureml_configuration_v2 import *

# configure Azure ML services
#-----------------------------
# initilaise the azureml config class
azuremlConfig = AzureMLConfiguration_v2(workspace=WORKSPACE_NAME
                                    ,tenant_id=TENANT_ID
                                    ,subscription_id=SUBSCRIPTION_ID
                                    ,resource_group=RESOURCE_GROUP
                                    ,location=REGION)

# configure Azure ML workspace
azuremlConfig.configWorkspace()

# configure the azure ML compute 
azuremlConfig.configCompute()

# configure the experiment(s)
azuremlConfig.configExperiment(experiment_name=EXPERIMENT_NAME)

# configure the environment - condaa
azuremlConfig.configEnvironment(environment_name=ENVIRONMENT_NAME)

By setting in `azuremlConfig.configCompute()`, you can use managed identity to retrieve AML workspace in executing batch pipelines. You may make sure the populated managed identity(=`Principal ID`) in red-rectangle.

![Managed Identity](../../documentation/images/managed_id_computer_cluster.png)

To provision system-assigned identity, please follow the steps. 

![Set Managed Identity](../../documentation/images/set_managed_ID.png)

![Set System-Assigned Identity](../../documentation/images/system_managed_ID.png)

After generating the managed identity, you need to assign the appropriate rights like READ or WRITE(IAM) in Azure AD.

- Find application name
    - Go to `Enterprise Applications` in your `Azure AD`:
    
        ![Select Enterprise Applications](../../documentation/images/aad_ea.png)

    - Select `All applications` and input your `Principal ID`, which was generated in AML:
    
        ![Search your application with managed identity](../../documentation/images/all_search_ea.png)

        If you find your application, go to the link:

        ![Go to the link](../../documentation/images/enterprise_application.png)

    - Copy the application name:

        ![Copy application name](../../documentation/images/sp_copy.png)

- Give appropriate rights to the application:
    - Move to AML pane, provide appropriate rights to application:

        ![Managed Identity](../../documentation/images/iam.png)

    - Use "Select members" to add role assignment.

        ![Add Role Assignment](../../documentation/images/add_role.png)

## 0.5 (Sub)directory configuration

In [None]:
from common.general_utilities import *

# create a temp directory to store results with sub-foldrs
#------------------------------------------------------------
# set the use-case
use_case = 'comms-classification'

# set the correct paths and mounting points
dsp_results_folders = f'{RESULTS_PATH}{RESULTS_DSP_PATH}{use_case}/'
transcripts_results_folder = f'{RESULTS_PATH}{RESULTS_TRANSCRIBE_PATH}{use_case}/' 
assessed_results_folder =  f'{RESULTS_PATH}{RESULTS_ASSESSED_PATH}{use_case}/' 

RECORDINGS_DATASET_NAME = f'{RAW_CONTAINER_NAME}-{use_case}'
TRUTH_DATASET_NAME = f'{TRUTH_TRANSCRIPTS}-{use_case}'
ASSESSED_DATASET_NAME = f'{ASSESSED_CONATINER_NAME}-{use_case}'

RECORDINGS_MOUNT_PATH = f'{MOUNT_PATH_ROOT}{use_case}/{RECORDINGS_FOLDER}'
TRUTH_MOUNT_PATH = f'{MOUNT_PATH_ROOT}{use_case}/{TRUTH_TRANSCRIPTED_FOLDER}'
ONTOLOGY_MOUNT_PATH = f'{MOUNT_PATH_ROOT}{use_case}/{ONTOLOGY_FOLDER}'

# create the results directories - based on the use_case
utilConfig = GeneraltUtilities()
utilConfig.createTmpDir(dsp_results_folders)
utilConfig.createTmpDir(transcripts_results_folder)
utilConfig.createTmpDir(assessed_results_folder)

In [None]:
# confogure and register the datastore(s) with Azure ML piplines
raw_datastore = azuremlConfig.configDataStore(datastore=RAW_DATASTORE_NAME, container_name=RAW_CONTAINER_NAME)
processed_datastore = azuremlConfig.configDataStore(datastore=DSP_DATASTORE_NAME, container_name=DSP_CONATINER_NAME)
transcribed_datastore = azuremlConfig.configDataStore(datastore=TRANSCRIBED_DATASTORE_NAME, container_name=TRANSCRIBED_CONATINER_NAME)
assessed_datastore = azuremlConfig.configDataStore(datastore=ASSESSED_DATASTORE_NAME, container_name=ASSESSED_CONATINER_NAME)

In [None]:
## Make container in each container
azuremlConfig.make_directory_in_container(container_name=RAW_CONTAINER_NAME, directory=RECORDINGS_FOLDER)
azuremlConfig.make_directory_in_container(container_name=RAW_CONTAINER_NAME, directory=TRUTH_TRANSCRIPTED_FOLDER)
azuremlConfig.make_directory_in_container(container_name=RAW_CONTAINER_NAME, directory=ONTOLOGY_FOLDER)
azuremlConfig.make_directory_in_container(container_name=ASSESSED_CONATINER_NAME, directory=RESULTS_ASSESSED_PATH)

## 0.6 Upload those files

After generating these datastore, we need to upload the provided sample files into the locations as shown in the following tables.

- For training

| container_name     | sub-directory              | file name                                  | contents                  |
|--------------------|----------------------------|--------------------------------------------|---------------------------|
| raw                | recordings                 | xxx.wav                                    | Raw audio data.           |
| raw                | provided-transcripts       | transcripts-truth.csv                      | True transcriptions for raw audio data. |
| raw                | ontology                   | homophone-list.txt                         | A list of pairs of words with similar pronuciation but different meanings. The latter one word is domian specific. |
| raw                | ontology                   | key-phrases-to-search.json                 | It defines key phrases to search.|
| raw                | ontology                   | general-ontology.json      | It defines general ontology.|
| raw                | ontology                   | message-protocols.json     | It defines compliant message protocols.|
| raw                | ontology                   | radio-check-ontology.json  | It defines ontology related to radio check.|

- For inference

| container_name     | sub-directory              | file name                                   | contents                 |
|--------------------|----------------------------|---------------------------------------------|--------------------------|
| assessed           | assessed                   | xxx.wav                                     | Audio data for assessment.|


In [None]:
raw_recordings_datasets = azuremlConfig.configDatasets(datastore=raw_datastore, file_path= RECORDINGS_FOLDER, 
                                            dataset_name=RECORDINGS_DATASET_NAME, description='raw datasets')

In [None]:
# Prepare the datasets
#------------------------
# register the datasets associated with the datastore - recordings
raw_recordings_datasets = azuremlConfig.configDatasets(datastore=raw_datastore, file_path= RECORDINGS_FOLDER, 
                                            dataset_name=RECORDINGS_DATASET_NAME, description='raw datasets')

# register the datasets associated with the datastore - truth transcription provided
truth_transcribed_datasets = azuremlConfig.configDatasets(datastore=raw_datastore, file_path = TRUTH_TRANSCRIPTED_FOLDER, 
                                            dataset_name=TRUTH_DATASET_NAME, description='truth transcripted datasets')

# register the datasets associated with the datastore - key phrases
key_phrases_datasets = azuremlConfig.configDatasets(datastore=raw_datastore, file_path = ONTOLOGY_FOLDER, 
                                            dataset_name=ONTOLOGY_DATASET_NAME, description='ontology datasets')

# register the datasets associated with the datastore - assessed data
assessed_datasets = azuremlConfig.configDatasets(datastore=assessed_datastore, file_path = RESULTS_ASSESSED_PATH, 
                                            dataset_name=ASSESSED_DATASET_NAME, description='assessed datasets')