# Azure ML Local Run
In this notebook, we create an Azure ML workspace, and use it to locally run the training script.

The steps in this notebook are
- [import libraries](#import),
- [set the Azure subscription](#subscription),
- [create an Azure ML workspace](#workspace),
- [create an estimator](#estimator),
- [create an experiment](#experiment),
- [submit the estimator](#submit), and
- [get the results](#results).

## Imports  <a id='import'></a>

In [1]:
import os
import pandas as pd
from azure.common.credentials import get_cli_profile
from azureml.core import Workspace, Experiment
from azureml.train.estimator import Estimator
from azureml.train.automl import AutoMLConfig
import azureml.core
from get_auth import get_auth
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

tensorflow module is not present, models based on tensorflow would not work
azureml.core.VERSION=1.0.21


## Azure subscription <a id='subscription'></a>
If you have multiple subscriptions select the subscription you want to use. You can also set the name of the resource group in which this tutorial will add resources. *IMPORTANT NOTE:* The last notebook in this example will delete this resource group and all associated resources.

In [2]:
selected_subscription="AG-AzureCAT-AIDanielle-Test-COGSNonProd-IO1685734"
resource_group="hypetuning"

Login to Azure if not already logged in.

In [3]:
%%bash
list=`az account list -o table`
if [ "$list" == '[]' ] || [ "$list" == '' ]; then 
  az login -o table
else
  az account list -o table 
fi

Name                                               CloudName    SubscriptionId                        State    IsDefault
-------------------------------------------------  -----------  ------------------------------------  -------  -----------
AG-AzureCAT-AIDanC-Test-COGSNonProd-IO1685734      AzureCloud   3bcfa59c-82a0-44f9-ac08-b3479370bace  Enabled  False
DEMO - how RepDemo are you                         AzureCloud   fe4d94f0-dc5b-4c09-9b85-863413b0192b  Enabled  False
Edge-ES-CI-Manual                                  AzureCloud   333e402a-65a0-45a9-8e23-867ca146c290  Enabled  False
Cosmos_WDG_Core_BnB_100348                         AzureCloud   dae41bd3-9db4-4b9b-943e-832b57cac828  Enabled  False
Azure Stack Diagnostics CI and Production VaaS     AzureCloud   a8183b2d-7a4c-45e9-8736-dac11b84ff14  Enabled  False
Data Wrangling Preview                             AzureCloud   215613ac-9dfb-488c-be46-c387e999b127  Enabled  False
CAT_Eng                                            Azu

Set the selected subscription as the default.

In [4]:
%%bash -s "$selected_subscription"
az account set --subscription "$1"
az account show -o table

EnvironmentName    IsDefault    Name                                               State    TenantId
-----------------  -----------  -------------------------------------------------  -------  ------------------------------------
AzureCloud         True         AG-AzureCAT-AIDanielle-Test-COGSNonProd-IO1685734  Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47


Get the information for the selected Azure subscription.

In [5]:
az_profile = get_cli_profile()
subscription_id = az_profile.get_subscription_id()

## Create an Azure ML workspace <a id='workspace'></a>
Create a workspace if it does not already exist or recover it if it does exist, and write out its details to `config.json` to reference it between notebooks. THe first time this is run, this can take about a minute.

In [6]:
auth = get_auth()
ws = Workspace.create(name='hypetuning',
                      subscription_id=subscription_id,
                      resource_group=resource_group,
                      create_resource_group=True,
                      exist_ok=True,
                      location='eastus',
                      auth=auth)
ws.write_config()

Trying to create Workspace with CLI Authentication
Wrote the config file config.json to: /data/home/mabou/Source/Repos/MLHyperparameterTuning/aml_config/config.json


## Training data

In [7]:
data_path = "data"
train_path = os.path.join(data_path, "balanced_pairs_train.tsv")
tune_path = os.path.join(data_path, "balanced_pairs_tune.tsv")

In [8]:
train = pd.read_csv(train_path, sep='\t', encoding='latin1')
tune = pd.read_csv(tune_path, sep='\t', encoding='latin1')

In [9]:
feature_columns = ["Text_x", "Text_y"]
label_column = "Label"
group_column = 'Id_x'
answerid_column = 'AnswerId_y'

In [48]:
train_X = (train.Text_x + ' ' + train.Text_y).values  # train_X = train[feature_columns]
train_y = train[label_column].values
tune_X = (tune.Text_x + ' ' + tune.Text_y).values  # tune_X = tune[feature_columns]
tune_y = tune[label_column].values

In [11]:
train_label_counts = train[label_column].value_counts()
train_label_weight = train.shape[0] / (train_label_counts.shape[0] * train_label_counts)
print(train_label_weight)
train_weight = train[label_column].apply(lambda x: train_label_weight[x]).values

0    0.51
1   20.00
Name: Label, dtype: float64


In [12]:
tune_label_counts = tune[label_column].value_counts()
tune_label_weight = tune.shape[0] / (tune_label_counts.shape[0] * tune_label_counts)
print(tune_label_weight)
tune_weight = tune[label_column].apply(lambda x: tune_label_weight[x]).values

0    0.50
1   91.00
Name: Label, dtype: float64


In [50]:
automated_ml_config = AutoMLConfig(task="classification",
                                   primary_metric="accuracy",
                                   X=train_X,
                                   y=train_y,
                                   sample_weight=train_weight,
                                   X_valid=tune_X,
                                   y_valid=tune_y,
                                   sample_weight_valid=tune_weight,
                                   preprocess=True,
                                   iterations=30,
                                   iteration_timeout_minutes=15,
                                   blacklist_models=["LightGBM"])

In [51]:
exp = Experiment(workspace=ws, name='hypetuning')

In [52]:
local_run = exp.submit(automated_ml_config, show_output=True)
local_run

Running on local machine
Parent Run ID: AutoML_6640c010-3387-4494-b79c-8de76dea44a3
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler ExtremeRandomTrees                100.0000    0:05:45       0.5000    0.5000
         1   MaxAbsScaler SGD                               100.0000    0:04:10       0.6945    0.6945
         2   MaxAbsScaler RandomForest                      100

Experiment,Id,Type,Status,Details Page,Docs Page
hypetuning,AutoML_6640c010-3387-4494-b79c-8de76dea44a3,automl,Completed,Link to Azure Portal,Link to Documentation


In [53]:
best_run = local_run.get_output()

In [54]:
best_run[0]

Experiment,Id,Type,Status,Details Page,Docs Page
hypetuning,AutoML_6640c010-3387-4494-b79c-8de76dea44a3_29,,Completed,Link to Azure Portal,Link to Documentation


In [55]:
pipeline = best_run[1]
pipeline

Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(is_onnx_compatible=None, logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('ExtremeRandomTrees_8', Pipeline(memory=None,
     steps=[('sparsenormalizer', <automl...6666667, 0.13333333333333333, 0.06666666666666667, 0.26666666666666666, 0.06666666666666667, 0.2]))])

In [25]:
pipeline.get_params(deep=True)

{'memory': None,
 'steps': [('datatransformer',
   DataTransformer(is_onnx_compatible=None, logger=None, task=None)),
  ('MaxAbsScaler', MaxAbsScaler(copy=True)),
  ('SGDClassifierWrapper',
   SGDClassifierWrapper(alpha=4.693930612244897, class_weight=None, eta0=0.01,
              fit_intercept=True, l1_ratio=0.6326530612244897,
              learning_rate='constant', loss='modified_huber', max_iter=1000,
              n_jobs=1, penalty='none', power_t=0, random_state=None,
              tol=0.01))],
 'datatransformer': DataTransformer(is_onnx_compatible=None, logger=None, task=None),
 'MaxAbsScaler': MaxAbsScaler(copy=True),
 'SGDClassifierWrapper': SGDClassifierWrapper(alpha=4.693930612244897, class_weight=None, eta0=0.01,
            fit_intercept=True, l1_ratio=0.6326530612244897,
            learning_rate='constant', loss='modified_huber', max_iter=1000,
            n_jobs=1, penalty='none', power_t=0, random_state=None,
            tol=0.01),
 'datatransformer__is_onnx_compatibl

In [47]:
pipeline.named_steps['datatransformer'].get_engineered_feature_names()[-183:-180]

['Text_x_HashOneHotEncode_8191',
 "Text_y_CharGramCountVec_'innertext' works in ie, but not in firefox. i have some javascript code that works in ie containing the following:  however, it seems that the 'innertext' property does not work in firefox. is there some firefox equivalent? or is there a more generic, cross browser property that can be used?",
 'Text_y_CharGramCountVec_.prop() vs .attr(). so jquery 1.6 has the new function prop().  or in this case do they do the same thing? and if i do have to switch to using prop(), all the old attr() calls will break if i switch to 1.6? update see this fiddle: http://jsfiddle.net/maniator/jpuf2/ the console logs the getattribute as a string, and the attr as a string, but the prop as a cssstyledeclaration, why? and how does that affect my coding in the future?']

## Test the best model
Read in the test data.

In [56]:
test_path = os.path.join(data_path, "balanced_pairs_test.tsv")
test = pd.read_csv(test_path, sep='\t', encoding='latin1')

In [57]:
test_X = (test.Text_x + ' ' + test.Text_y).values  # test[feature_columns]
test_y = test[label_column]

In [59]:
test['probabilities'] = best_run[1].predict_proba(test_X)[:, 1]

In [60]:
# Order the testing data by dupe Id and question AnswerId.
test.sort_values([group_column, answerid_column], inplace=True)

# Extract the ordered probabilities.
probabilities = (
    test.probabilities
    .groupby(test[group_column], sort=False)
    .apply(lambda x: tuple(x.values)))

# Get the individual records.
output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
test_score = (test[output_columns_x]
              .drop_duplicates()
              .set_index(group_column))
test_score['probabilities'] = probabilities
test_score.reset_index(inplace=True)
test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']

In [61]:
import numpy as np

def score_rank(scores):
    """Compute the ranks of the scores."""
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    """Compute the index of label in label_order."""
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    """Compute the rank of label using the scores."""
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]

In [62]:
print("Evaluating the model's performance.")

# Collect the ordered AnswerId for computing scores.
labels = sorted(train[answerid_column].unique())
label_order = pd.DataFrame({'label': labels})

# Compute the ranks of the correct answers.
test_score['Ranks'] = test_score.apply(lambda x:
                                       label_rank(x.AnswerId,
                                                  x.probabilities,
                                                  label_order.label),
                                       axis=1)

# Compute the number of correctly ranked answers
args_rank = 3
for i in range(1, args_rank+1):
    print('Accuracy @{} = {:.2%}'
          .format(i, (test_score['Ranks'] <= i).mean()))
mean_rank = test_score['Ranks'].mean()
print('Mean Rank {:.4f}'.format(mean_rank))

Evaluating the model's performance.
Accuracy @1 = 19.19%
Accuracy @2 = 28.05%
Accuracy @3 = 33.83%
Mean Rank 18.2631


In [63]:
from sklearn.externals import joblib


In [None]:
run120_path = os.path.join("outputs", "AutoML_ef96bf9d-54e7-4c0e-b365-b2f507ef80d9-120.pkl")
run120 = joblib.load()