# Azure ML Local Run
In this notebook, we create an Azure ML workspace, and use it to locally run the training script.

The steps in this notebook are
- [import libraries](#import),
- [set the Azure subscription](#subscription),
- [create an Azure ML workspace](#workspace),
- [create an estimator](#estimator),
- [create an experiment](#experiment),
- [submit the estimator](#submit), and
- [get the results](#results).

## Imports  <a id='import'></a>

In [1]:
import os
import pandas as pd
from azure.common.credentials import get_cli_profile
from azureml.core import Workspace, Experiment
from azureml.train.estimator import Estimator
from azureml.train.automl import AutoMLConfig
import azureml.core
from get_auth import get_auth
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

tensorflow module is not present, models based on tensorflow would not work
azureml.core.VERSION=1.0.21


## Azure subscription <a id='subscription'></a>
If you have multiple subscriptions select the subscription you want to use. You can also set the name of the resource group in which this tutorial will add resources. *IMPORTANT NOTE:* The last notebook in this example will delete this resource group and all associated resources.

In [2]:
selected_subscription="AG-AzureCAT-AIDanielle-Test-COGSNonProd-IO1685734"
resource_group="hypetuning"

Login to Azure if not already logged in.

In [3]:
%%bash
list=`az account list -o table`
if [ "$list" == '[]' ] || [ "$list" == '' ]; then 
  az login -o table
else
  az account list -o table 
fi

Name                                               CloudName    SubscriptionId                        State    IsDefault
-------------------------------------------------  -----------  ------------------------------------  -------  -----------
AG-AzureCAT-AIDanC-Test-COGSNonProd-IO1685734      AzureCloud   3bcfa59c-82a0-44f9-ac08-b3479370bace  Enabled  False
DEMO - how RepDemo are you                         AzureCloud   fe4d94f0-dc5b-4c09-9b85-863413b0192b  Enabled  False
Edge-ES-CI-Manual                                  AzureCloud   333e402a-65a0-45a9-8e23-867ca146c290  Enabled  False
Cosmos_WDG_Core_BnB_100348                         AzureCloud   dae41bd3-9db4-4b9b-943e-832b57cac828  Enabled  False
Azure Stack Diagnostics CI and Production VaaS     AzureCloud   a8183b2d-7a4c-45e9-8736-dac11b84ff14  Enabled  False
Data Wrangling Preview                             AzureCloud   215613ac-9dfb-488c-be46-c387e999b127  Enabled  False
CAT_Eng                                            Azu

Set the selected subscription as the default.

In [4]:
%%bash -s "$selected_subscription"
az account set --subscription "$1"
az account show -o table

EnvironmentName    IsDefault    Name                                               State    TenantId
-----------------  -----------  -------------------------------------------------  -------  ------------------------------------
AzureCloud         True         AG-AzureCAT-AIDanielle-Test-COGSNonProd-IO1685734  Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47


Get the information for the selected Azure subscription.

In [5]:
az_profile = get_cli_profile()
subscription_id = az_profile.get_subscription_id()

## Create an Azure ML workspace <a id='workspace'></a>
Create a workspace if it does not already exist or recover it if it does exist, and write out its details to `config.json` to reference it between notebooks. THe first time this is run, this can take about a minute.

In [6]:
auth = get_auth()
ws = Workspace.create(name='hypetuning',
                      subscription_id=subscription_id,
                      resource_group=resource_group,
                      create_resource_group=True,
                      exist_ok=True,
                      location='eastus',
                      auth=auth)
ws.write_config()

Trying to create Workspace with CLI Authentication
Wrote the config file config.json to: /data/home/mabou/Source/Repos/MLHyperparameterTuning/aml_config/config.json


## Training data

In [7]:
data_path = "data"
train_path = os.path.join(data_path, "balanced_pairs_train.tsv")
tune_path = os.path.join(data_path, "balanced_pairs_tune.tsv")

In [8]:
train = pd.read_csv(train_path, sep='\t', encoding='latin1')
tune = pd.read_csv(tune_path, sep='\t', encoding='latin1')

In [9]:
train.columns

Index(['Id_x', 'AnswerId_x', 'Text_x', 'Id_y', 'Text_y', 'AnswerId_y', 'Label',
       'n'],
      dtype='object')

In [10]:
feature_columns = ["Text_x", "Text_y"]
label_column = "Label"
group_column = 'Id_x'
answerid_column = 'AnswerId_y'

In [11]:
train_X = train[feature_columns]
train_y = train[label_column].values
tune_X = tune[feature_columns]
tune_y = tune[label_column].values

In [12]:
train_label_counts = train[label_column].value_counts()
train_label_weight = train.shape[0] / (train_label_counts.shape[0] * train_label_counts)
print(train_label_weight)
train_weight = train[label_column].apply(lambda x: train_label_weight[x]).values

0    0.51
1   20.00
Name: Label, dtype: float64


In [13]:
tune_label_counts = tune[label_column].value_counts()
tune_label_weight = tune.shape[0] / (tune_label_counts.shape[0] * tune_label_counts)
print(tune_label_weight)
tune_weight = tune[label_column].apply(lambda x: tune_label_weight[x]).values

0    0.50
1   91.00
Name: Label, dtype: float64


In [14]:
automated_ml_config = AutoMLConfig(task="classification",
                                   primary_metric="accuracy",
                                   X=train_X,
                                   y=train_y,
                                   sample_weight=train_weight,
                                   X_valid=tune_X,
                                   y_valid=tune_y,
                                   sample_weight_valid=tune_weight,
                                   preprocess=True,
                                   iterations=30,
                                   iteration_timeout_minutes=5)
#                                    iterations=96,
#                                    iteration_timeout_minutes=15,
#                                    blacklist_models=["LightGBM"])

In [15]:
exp = Experiment(workspace=ws, name='hypetuning')

In [16]:
local_run = exp.submit(automated_ml_config, show_output=True)
local_run

Running on local machine
Parent Run ID: AutoML_3444614f-ba44-4971-8ca8-f42fcee58c58
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          100.0000    0:00:07       0.5857    0.5857
         1                                                  100.0000    0:00:04          nan    0.5857
ERROR: Run AutoML_3444614f-ba44-4971-8ca8-f42fcee58c58_1 failed

        26   TruncatedSVDWrapper LogisticRegression         100.0000    0:00:30       0.5000    0.7361
        27   TruncatedSVDWrapper LogisticRegression         100.0000    0:00:26       0.7515    0.7515
        28   TruncatedSVDWrapper LogisticRegression         100.0000    0:04:28       0.5021    0.7515
        29   Finished loading model, total used 10 iterations
Finished loading model, total used 50 iterations
Finished loading model, total used 10 iterations
Finished loading model, total used 50 iterations
Finished loading model, total used 10 iterations
Finished loading model, total used 50 iterations
Finished loading model, total used 10 iterations
Finished loading model, total used 50 iterations
Ensemble                                       100.0000    0:00:24       0.7128    0.7515


Experiment,Id,Type,Status,Details Page,Docs Page
hypetuning,AutoML_3444614f-ba44-4971-8ca8-f42fcee58c58,automl,Completed,Link to Azure Portal,Link to Documentation


In [17]:
best_run = local_run.get_output()

In [18]:
best_run[0]

Experiment,Id,Type,Status,Details Page,Docs Page
hypetuning,AutoML_3444614f-ba44-4971-8ca8-f42fcee58c58_27,,Completed,Link to Azure Portal,Link to Documentation


In [19]:
best_run[1]

Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(is_onnx_compatible=None, logger=None, task=None)), ('TruncatedSVDWrapper', TruncatedSVDWrapper(n_components=0.5047368421052632, random_state=None)), ('LogisticRegression', LogisticRegression(C=1048.1131341546852, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])

In [20]:
best_run_weighted = local_run.get_output()
best_run_weighted[1]

Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(is_onnx_compatible=None, logger=None, task=None)), ('TruncatedSVDWrapper', TruncatedSVDWrapper(n_components=0.5047368421052632, random_state=None)), ('LogisticRegression', LogisticRegression(C=1048.1131341546852, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])

In [21]:
help(best_run_weighted[1])

Help on Pipeline in module sklearn.pipeline object:

class Pipeline(sklearn.utils.metaestimators._BaseComposition)
 |  Pipeline of transforms with a final estimator.
 |  
 |  Sequentially apply a list of transforms and a final estimator.
 |  Intermediate steps of the pipeline must be 'transforms', that is, they
 |  must implement fit and transform methods.
 |  The final estimator only needs to implement fit.
 |  The transformers in the pipeline can be cached using ``memory`` argument.
 |  
 |  The purpose of the pipeline is to assemble several steps that can be
 |  cross-validated together while setting different parameters.
 |  For this, it enables setting parameters of the various steps using their
 |  names and the parameter name separated by a '__', as in the example below.
 |  A step's estimator may be replaced entirely by setting the parameter
 |  with its name to another estimator, or a transformer removed by setting
 |  to None.
 |  
 |  Read more in the :ref:`User Guide <pipel

## Test the best model
Read in the test data.

In [22]:
test_path = os.path.join(data_path, "balanced_pairs_test.tsv")
test = pd.read_csv(test_path, sep='\t', encoding='latin1')

In [23]:
test_X = test[feature_columns]
test_y = test[label_column]

In [26]:
test['probabilities'] = best_run_weighted[1].predict_proba(test_X)[:, 1]

In [27]:
# Order the testing data by dupe Id and question AnswerId.
test.sort_values([group_column, answerid_column], inplace=True)

# Extract the ordered probabilities.
probabilities = (
    test.probabilities
    .groupby(test[group_column], sort=False)
    .apply(lambda x: tuple(x.values)))

# Get the individual records.
output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
test_score = (test[output_columns_x]
              .drop_duplicates()
              .set_index(group_column))
test_score['probabilities'] = probabilities
test_score.reset_index(inplace=True)
test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']

In [28]:
import numpy as np

def score_rank(scores):
    """Compute the ranks of the scores."""
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    """Compute the index of label in label_order."""
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    """Compute the rank of label using the scores."""
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]

In [29]:
print("Evaluating the model's performance.")

# Collect the ordered AnswerId for computing scores.
labels = sorted(train[answerid_column].unique())
label_order = pd.DataFrame({'label': labels})

# Compute the ranks of the correct answers.
test_score['Ranks'] = test_score.apply(lambda x:
                                       label_rank(x.AnswerId,
                                                  x.probabilities,
                                                  label_order.label),
                                       axis=1)

# Compute the number of correctly ranked answers
args_rank = 3
for i in range(1, args_rank+1):
    print('Accuracy @{} = {:.2%}'
          .format(i, (test_score['Ranks'] <= i).mean()))
mean_rank = test_score['Ranks'].mean()
print('Mean Rank {:.4f}'.format(mean_rank))

Evaluating the model's performance.
Accuracy @1 = 17.72%
Accuracy @2 = 25.64%
Accuracy @3 = 33.02%
Mean Rank 33.6322
