*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Text Classification of MultiNLI Sentences using BERT with Azure ML Pipelines

## 0. Introduction

In this notebook, we fine-tune and evaluate a pretrained [BERT](https://arxiv.org/abs/1810.04805) model on a subset of the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset using [AzureML](https://azure.microsoft.com/en-us/services/machine-learning-service/) Pipelines.

We use a [distributed sequence classifier](../../utils_nlp/bert/sequence_classification_distributed.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of Google's [BERT](https://github.com/google-research/bert).

The notebooks acts as a template to,
1. Process a massive dataset in parallel by dividing the dataset into chunks using [DASK](https://dask.org/) .
2. Perform distributed training on AzureML compute on these processed chunks.

We create an [AzureML Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) for the two steps mentioned above. With this pipeline, the notebook can be scheduled regularly to fine tune BERT with new data and get a model which can be further deployed on [Azure Container Instance](https://docs.microsoft.com/en-us/azure/container-service/).

AzureML Pipeline define reusable machine learning workflows that can be used as a template for your machine learning scenarios. Pipelines allow you to optimize your workflow and spend time on machine learning rather than infrastructure. If you are new to the concept of pipelines, [this would be a good place to get started](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines).

In [1]:
import sys
sys.path.append("../../")
import os
import json
import random
import shutil
import pandas as pd

from utils_nlp.azureml import azureml_utils
from utils_nlp.dataset.multinli import get_generator

from sklearn.preprocessing import LabelEncoder
import azureml.core
from azureml.core import Datastore, Experiment,  get_run
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import RunConfiguration
from azureml.core.compute import ComputeTarget,  AmlCompute
from azureml.exceptions import ComputeTargetException
from azureml.data.data_reference import DataReference
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.widgets import RunDetails
from azureml.train.dnn import PyTorch
from azureml.core.runconfig import MpiConfiguration
from azureml.pipeline.steps import EstimatorStep

print("System version: {}".format(sys.version))
print("Azure ML SDK Version:", azureml.core.VERSION)

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Azure ML SDK Version: 1.0.48


Let's define a few variables before we get started, these variables define the folder where the data would reside, the batch size and the number of epochs we are training for. 
We also define the variables for AzureML workspace, which you can use to create a new workspace. You can ignore these variables if you have `config.json` in `.azureml` directory.

In [2]:
LABEL_COL = "genre"
DATA_FOLDER = "../../data/temp"
TRAIN_FOLDER = "../../data/temp/train"
TEST_FOLDER = "../../data/temp/test"
ENCODED_LABEL_COL = "label"
NUM_PARTITIONS = None
LABELS = ['telephone', 'government', 'travel', 'slate', 'fiction']
PROJECT_FOLDER = "../../"
NODE_COUNT = 4

config_path = (
    "./.azureml"
)  # Path to the directory containing config.json with azureml credentials

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2 and so on.
cluster_name = "pipelines-tc-12"

In this example we will use AzureML pipelines to execute training pipelines. Each preprocessing step is included as a step in the pipeline. For a more detailed walkthrough of what pipelines are with a getting started guidelines check this [notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-getting-started.ipynb). We start by doing some AzureML related setup below.

### 0.1 Create a workspace

First, go through the [Configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML `Workspace`. This will create a config.json file containing the values needed below to create a workspace.

**Note**: you do not need to fill in these values if you have a config.json in the same folder as this notebook

In [3]:
ws = azureml_utils.get_or_create_workspace(
    config_path=config_path,
    subscription_id=subscription_id,
    resource_group=resource_group,
    workspace_name=workspace_name,
    workspace_region=workspace_region,
)

### 0.2 Create a compute target
We create and attach a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training the model. Here we use the AzureML-managed compute target ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) as our remote training compute resource. Our cluster autoscales from 0 to 8 `STANDARD_NC12` GPU nodes.

Creating and configuring the AmlCompute cluster takes approximately 5 minutes the first time around. Once a cluster with the given configuration is created, it does not need to be created again.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Read more about the default limits and how to request more quota [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas).

In [4]:
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target.")
except ComputeTargetException:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC12", max_nodes=8
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute.
print(compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-07T16:25:35.196000+00:00', 'errors': None, 'creationTime': '2019-07-25T04:16:20.598768+00:00', 'modifiedTime': '2019-08-05T06:40:12.292030+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 10, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC12'}


## 1. Preprocessing

The pipeline is defined by a series of steps, the first being a PythonScriptStep which utilizes [DASK](https://dask.org/) to load dataframes in partitions allowing us to load and preprocess different sets of data in parallel.

### 1.1 Read Dataset

In [5]:
train_batches = get_generator(DATA_FOLDER, "train", num_batches=NUM_PARTITIONS, batch_size=10e6)
test_batches = get_generator(DATA_FOLDER, "dev_matched", num_batches=NUM_PARTITIONS, batch_size=10e6)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 222k/222k [02:51<00:00, 1.29kKB/s]


### 1.2 Preprocess and Tokenize

In the classification task, we use the first sentence only as the text input, and the corresponding genre as the label. Select the examples corresponding to one of the entailment labels (*neutral* in this case) to avoid duplicate rows, as the sentences are not unique, whereas the sentence pairs are.

Once filtered, we encode the labels. To do this, fit a label encoder with the known labels in a MNLI dataset.

In [6]:
os.makedirs(TRAIN_FOLDER, exist_ok=True)
os.makedirs(TEST_FOLDER, exist_ok=True)

labels = LABELS
label_encoder = LabelEncoder()
label_encoder.fit(labels)

num_train_partitions = 0
for batch in train_batches:
    batch = batch[batch["gold_label"]=="neutral"]
    batch[ENCODED_LABEL_COL] = label_encoder.transform(batch[LABEL_COL])
    batch.to_csv(TRAIN_FOLDER+"/batch{}.csv".format(str(num_train_partitions)))
    num_train_partitions += 1
    
num_test_partitions = 0
for batch in test_batches:
    batch = batch[batch["gold_label"]=="neutral"]
    batch[ENCODED_LABEL_COL] = label_encoder.transform(batch[LABEL_COL])
    batch.to_csv(TEST_FOLDER+"/batch{}.csv".format(str(num_test_partitions)))
    num_test_partitions += 1

Once we have the partitions of data ready they are uploaded to the datastore.

In [7]:
ds = ws.get_default_datastore()
ds.upload(src_dir=TRAIN_FOLDER, target_path="mnli_data/train", overwrite=True, show_progress=False)
ds.upload(src_dir=TEST_FOLDER, target_path="mnli_data/test", overwrite=True, show_progress=False)

$AZUREML_DATAREFERENCE_dcbe2cda6f344f4da1a1d47bbe8de76e

In [8]:
shutil.rmtree(TRAIN_FOLDER)
shutil.rmtree(TEST_FOLDER)

We can now parallely operate on each batch to tokenize the data and preprocess the tokens. To do this, we create a PythonScript step below.

In [9]:
%%writefile preprocess.py
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.
import argparse
import logging
import os

import pandas as pd

from utils_nlp.models.bert.common import Language, Tokenizer

LABEL_COL = "genre"
TEXT_COL = "sentence1"
LANGUAGE = Language.ENGLISH
TO_LOWER = True
MAX_LEN = 150

logger = logging.getLogger(__name__)


def tokenize(df):
    """Tokenize the text documents and convert them to lists of tokens using the BERT tokenizer.
    Args:
        df(pd.Dataframe): Dataframe with training or test samples

    Returns:

        list: List of lists of tokens for train set.

    """
    tokenizer = Tokenizer(
        LANGUAGE, to_lower=TO_LOWER)
    tokens = tokenizer.tokenize(list(df[TEXT_COL]))

    return tokens


def preprocess(tokens):
    """ Preprocess method that does the following,
            Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary
            Add the special tokens [CLS] and [SEP] to mark the beginning and end of a sentence
            Pad or truncate the token lists to the specified max length
            Return mask lists that indicate paddings' positions
            Return token type id lists that indicate which sentence the tokens belong to (not needed
            for one-sequence classification)

    Args:
        tokens(pd.Dataframe): Dataframe with tokens for train set.

    Returns:
        list: List of lists of tokens for train or test set with special tokens added.
        list: Input mask.
    """
    tokenizer = Tokenizer(
        LANGUAGE, to_lower=TO_LOWER)
    tokens, mask, _ = tokenizer.preprocess_classification_tokens(
        tokens, MAX_LEN
    )

    return tokens, mask


parser = argparse.ArgumentParser()
parser.add_argument("--input_data", type=str, help="input data")
parser.add_argument("--output_data", type=str, help="Path to the output file.")

args = parser.parse_args()
input_data = args.input_data
output_data = args.output_data
output_dir = os.path.dirname(os.path.abspath(output_data))

if output_dir is not None:
    os.makedirs(output_dir, exist_ok=True)
    logger.info("%s created" % output_dir)

df = pd.read_csv(args.input_data)
tokens_array = tokenize(df)
tokens_array, mask_array = preprocess(tokens_array)

df['tokens'] = tokens_array
df['mask'] = mask_array

# Filter columns
cols = ['tokens', 'mask', 'label']
df = df[cols]
df.to_csv(output_data, header=False, index=False)
logger.info("Completed")

Writing preprocess.py


In [10]:
preprocess_file = os.path.join(PROJECT_FOLDER,'utils_nlp/models/bert/preprocess.py')
shutil.move('preprocess.py',preprocess_file)

'../../utils_nlp/models/bert/preprocess.py'

Create a conda environment for the steps below.

In [11]:
conda_dependencies = CondaDependencies.create(
    conda_packages=[
        "numpy",
        "scikit-learn",
        "pandas",
    ],
    pip_packages=["azureml-sdk==1.0.43.*", 
                  "torch==1.1", 
                  "tqdm==4.31.1",
                 "pytorch-pretrained-bert>=0.6"],
    python_version="3.6.8",
)
run_config = RunConfiguration(conda_dependencies=conda_dependencies)
run_config.environment.docker.enabled = True

Then create the list of steps that use the preprocess.py created above. We use the output of these steps as input to training in the next section.

In [12]:
processed_train_files = []
processed_test_files = []
ds = ws.get_default_datastore()

for i in range(num_train_partitions):
        input_data = DataReference(datastore=ds, 
                                   data_reference_name='train_batch_{}'.format(str(i)), 
                                   path_on_datastore='mnli_data/train/batch{}.csv'.format(str(i)),
                                   overwrite=False)

        output_data = PipelineData(name="train{}".format(str(i)), datastore=ds,
                       output_path_on_compute='mnli_data/processed_train/batch{}.csv'.format(str(i)))

        step = PythonScriptStep(
            name='preprocess_step_train_{}'.format(str(i)),
            arguments=["--input_data", input_data, "--output_data", output_data],
            script_name= 'utils_nlp/models/bert/preprocess.py',
            inputs=[input_data],
            outputs=[output_data],
            source_directory=PROJECT_FOLDER,
            compute_target=compute_target,
            runconfig=run_config,
            allow_reuse=False,
        )
        
        processed_train_files.append(output_data)         
            
for i in range(num_test_partitions):
            input_data = DataReference(datastore=ds, 
                                       data_reference_name='test_batch_{}'.format(str(i)), 
                                       path_on_datastore='mnli_data/test/batch{}.csv'.format(str(i)),
                                       overwrite=False)
        
            output_data = PipelineData(name="test{}".format(str(i)), datastore=ds,
                        output_path_on_compute='mnli_data/processed_test/batch{}.csv'.format(str(i)))
            
            step = PythonScriptStep(
                name='preprocess_step_test_{}'.format(str(i)),
                arguments=["--input_data", input_data, "--output_data", output_data],
                script_name= 'utils_nlp/models/bert/preprocess.py',
                inputs=[input_data],
                outputs=[output_data],
                source_directory=PROJECT_FOLDER,
                compute_target=compute_target,
                runconfig=run_config,
                allow_reuse=False,
            )
            
            processed_test_files.append(output_data)

## 2. Train and Score

Once the data is processed and available on datastore, we  train the classifier using the training examples. This involves fine-tuning the BERT Transformer and learning a linear classification layer on top of that. After training is complete we score the performance of the model on the test dataset

The training is distributed and is done AzureML's capability to support distributed using MPI with horovod. 

**Please note** that training requires a GPU enabled cluster in AzureML Compute. We suggest using NC12. If you would like to change the GPU configuration, please changes `NUM_GPUS` variable accordingly.


### 2.1 Setup training script

In [13]:
%%writefile train.py
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License.

import argparse
import json
import logging
import os
import torch

from sklearn.metrics import classification_report

from utils_nlp.common.timer import Timer
from utils_nlp.models.bert.common import Language, get_dataset_multiple_files
from utils_nlp.models.bert.sequence_classification_distributed import (
    BERTSequenceClassifier,
)

BATCH_SIZE = 32
NUM_GPUS = 2
NUM_EPOCHS = 1
LABELS = ["telephone", "government", "travel", "slate", "fiction"]
OUTPUT_DIR = "./outputs/"

logger = logging.getLogger(__name__)

parser = argparse.ArgumentParser()
parser.add_argument(
    "--train_files",
    nargs="+",
    default=[],
    help="List of file paths to all the files in train dataset.",
)

parser.add_argument(
    "--test_files",
    nargs="+",
    default=[],
    help="List of file paths to all the files in test dataset.",
)

args = parser.parse_args()
train_files = [file.strip() for file in args.train_files]
test_files = [file.strip() for file in args.test_files]

# Handle square brackets from train list
train_files[0] = train_files[0][1:]
train_files[len(train_files) - 1] = train_files[len(train_files) - 1][:-1]
train_dataset = get_dataset_multiple_files(train_files)

# Handle square brackets from test list
test_files[0] = test_files[0][1:]
test_files[len(test_files) - 1] = test_files[len(test_files) - 1][:-1]
test_dataset = get_dataset_multiple_files(test_files)

# Train
classifier = BERTSequenceClassifier(
    language=Language.ENGLISH, num_labels=len(LABELS), use_distributed=True
)

# Create data loaders.
kwargs = (
    {"num_workers": NUM_GPUS, "pin_memory": True} if torch.cuda.is_available() else {}
)
train_data_loader = classifier.create_data_loader(
    train_dataset, batch_size=BATCH_SIZE, **kwargs
)
test_data_loader = classifier.create_data_loader(
    test_dataset, batch_size=BATCH_SIZE, mode="test", **kwargs
)

# Create optimizer
num_examples = len(train_dataset)
num_batches = int(num_examples / BATCH_SIZE)
num_train_optimization_steps = num_batches * NUM_EPOCHS
optimizer = classifier.create_optimizer(num_train_optimization_steps)

with Timer() as t:
    for epoch in range(1, NUM_EPOCHS + 1):
        classifier.fit(
            train_data_loader,
            epoch=epoch,
            bert_optimizer=optimizer,
            num_gpus=NUM_GPUS,
            num_epochs=NUM_EPOCHS,
        )

# Predict
preds, labels_test = classifier.predict(test_data_loader, num_gpus=NUM_GPUS)

# Evaluate
results = classification_report(
    labels_test, preds, target_names=LABELS, output_dict=True
)

# Write out results.
classifier.save_model()
result_file = os.path.join(OUTPUT_DIR, "results.json")
with open(result_file, "w+") as fp:
    json.dump(results, fp)

Writing train.py


In [14]:
train_file = os.path.join(PROJECT_FOLDER,'utils_nlp/models/bert/train.py')
shutil.move('train.py',train_file)

'../../utils_nlp/models/bert/train.py'

### 2.2 Create a Pytorch Estimator

We create a Pytorch Estimator using AzureML SDK and additonally define an EstimatorStep to run it on AzureML pipelines.

The Azure ML SDK's PyTorch Estimator allows us to submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch).

This Estimator specifies that the training script will run on 4 nodes, with 2 worker per node. In order to execute a distributed run using GPU, we must define `use_gpu` and `distributed_backend` to use MPI/Horovod. PyTorch, Horovod, and other necessary dependencies are installed automatically.

In [15]:
estimator = PyTorch(source_directory=PROJECT_FOLDER,
                    compute_target=compute_target,
                    entry_script='utils_nlp/models/bert/train.py',
                    node_count= NODE_COUNT,
                    distributed_training=MpiConfiguration(),
                    process_count_per_node=2,
                    use_gpu=True,
                    conda_packages=['scikit-learn=0.20.3', 'numpy>=1.16.0', 'pandas'],
                    pip_packages=["tqdm==4.31.1","pytorch-pretrained-bert>=0.6"]
                   )



In [16]:
inputs = processed_train_files + processed_test_files

est_step = EstimatorStep(name="Estimator-Train", 
                         estimator=estimator, 
                         estimator_entry_script_arguments=[
                             '--train_files',  str(processed_train_files),
                             '--test_files', str(processed_test_files)],
                         inputs = inputs,
                         runconfig_pipeline_params=None, 
                         compute_target=compute_target)

### 2.3 Submit the pipeline

The model is fine tuned on AML Compute and takes **45 minutes** to train. The total time to run the pipeline will be around **1h 30 minutes** if you use the default value `max_epoch=1`.

In [17]:
pipeline = Pipeline(workspace=ws, steps=[est_step])
experiment = Experiment(ws, 'NLP-TC-BERT-distributed')
pipeline_run = experiment.submit(pipeline)

Created step Estimator-Train [4f513673][55d2761c-0eb5-4f8d-a088-50f28b083019], (This step will run and generate new outputs)
Created step preprocess_step_train_0 [0380ae72][5f2e9315-f2a5-460e-a55d-5dc7bea20a16], (This step will run and generate new outputs)
Created step preprocess_step_train_1 [cb103698][6e88e8a8-2fb6-4f05-b451-5e2e32f485fd], (This step will run and generate new outputs)
Created step preprocess_step_train_2 [138598eb][a77d37a2-981d-408c-b722-96d2e371cb8b], (This step will run and generate new outputs)
Created step preprocess_step_train_3 [5ad9bb9f][1e34b695-788a-4718-bd9b-1166b5713291], (This step will run and generate new outputs)
Created step preprocess_step_train_4 [1b60e61a][f76c2810-f15e-4ac4-922f-4aecf370f719], (This step will run and generate new outputs)
Created step preprocess_step_train_5 [1c347740][e61feaf4-949d-4d99-beb8-24eba6f85c00], (This step will run and generate new outputs)
Created step preprocess_step_train_6 [4ee2963e][9dae8ba8-800a-473d-9740-fe9f7

Using data reference train_batch_7 for StepId [0ad87574][d2cb8ea4-4650-4143-85cc-3a50b2d92fb3], (Consumers of this data are eligible to reuse prior runs.)
Using data reference train_batch_8 for StepId [d18f48c6][0b8ae803-5f49-4361-9c27-156a4f69486b], (Consumers of this data are eligible to reuse prior runs.)
Using data reference train_batch_9 for StepId [0a7990bf][2d50bd14-2fb0-4742-8805-3bd1dc19c3c3], (Consumers of this data are eligible to reuse prior runs.)
Using data reference train_batch_10 for StepId [cad0992d][cba68341-2c4a-432e-a4bc-38af276684a6], (Consumers of this data are eligible to reuse prior runs.)
Using data reference train_batch_11 for StepId [1dc644aa][f7250703-3bad-4db6-9c4a-e68c98565f0a], (Consumers of this data are eligible to reuse prior runs.)
Using data reference train_batch_12 for StepId [b0c56b95][1e833559-7b27-422d-ad18-c7b8421af99a], (Consumers of this data are eligible to reuse prior runs.)
Using data reference train_batch_13 for StepId [a6c27383][c28e0d83-

In [18]:
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [19]:
#If you would like to cancel the job for any reasons uncomment the code below.
#pipeline_run.cancel()

In [20]:
#wait for the run to complete before continuing in the notebook
pipeline_run.wait_for_completion()

PipelineRunId: 6c289dac-5da1-4a28-b112-e4725763c85f
Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/15ae9cb6-95c1-483d-a0e3-b1a1a3b06324/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/MAIDAPTest/experiments/NLP-TC-BERT-distributed/runs/6c289dac-5da1-4a28-b112-e4725763c85f
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: 0cde738f-16f5-4e27-a5c2-c418d3100678
Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/15ae9cb6-95c1-483d-a0e3-b1a1a3b06324/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/MAIDAPTest/experiments/NLP-TC-BERT-distributed/runs/0cde738f-16f5-4e27-a5c2-c418d3100678
StepRun( preprocess_step_train_0 ) Status: NotStarted
StepRun( preprocess_step_train_0 ) Status: Running

Streaming azureml-logs/70_driver_log.txt
Starting the daemon thread to refresh tokens in background for process with pid = 131

  0%|          | 0/231508 [00:00<?, ?B/s]
100%|██████████| 231508/231

Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/15ae9cb6-95c1-483d-a0e3-b1a1a3b06324/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/MAIDAPTest/experiments/NLP-TC-BERT-distributed/runs/b323d9c2-d05f-4df7-9847-bce2e5b58376
StepRun( preprocess_step_train_20 ) Status: Running

Streaming azureml-logs/70_driver_log.txt
Starting the daemon thread to refresh tokens in background for process with pid = 145

  0%|          | 0/231508 [00:00<?, ?B/s]
100%|██████████| 231508/231508 [00:00<00:00, 18173590.31B/s]

  0%|          | 0/2608 [00:00<?, ?it/s]
 11%|█▏        | 298/2608 [00:00<00:00, 2977.26it/s]
 22%|██▏       | 561/2608 [00:00<00:00, 2862.20it/s]
 32%|███▏      | 840/2608 [00:00<00:00, 2836.85it/s]
 43%|████▎     | 1128/2608 [00:00<00:00, 2844.08it/s]
 55%|█████▍    | 1425/2608 [00:00<00:00, 2879.33it/s]
 66%|██████▌   | 1709/2608 [00:00<00:00, 2865.39it/s]
 77%|███████▋  | 2014/2608 [00:00<00:00, 2916.09it/s]
 88%|████████▊ | 2293/2608 [00:

{'runId': 'd1c0c18d-246a-46e5-a2a1-98ddbc19cb6a', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:46.236295Z', 'endTimeUtc': '2019-08-07T18:11:35.414903Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_14', '--output_data', '$AZUREML_DATAREFERENCE_train14'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_14': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch14.csv',

{'runId': '54cd84d3-79a4-4d33-93ed-22dc6e20b18d', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:39.935248Z', 'endTimeUtc': '2019-08-07T18:11:41.603397Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_17', '--output_data', '$AZUREML_DATAREFERENCE_train17'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_17': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch17.csv',

{'runId': '71e622a3-5941-4bd0-ac4a-e057aa58bc1a', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:36.826858Z', 'endTimeUtc': '2019-08-07T18:11:53.243834Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_3', '--output_data', '$AZUREML_DATAREFERENCE_train3'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_3': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch3.csv', 'pa

{'runId': '508f1d59-7878-41e8-bdf0-fa5da0f6fbc7', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:36.087866Z', 'endTimeUtc': '2019-08-07T18:11:52.516104Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_4', '--output_data', '$AZUREML_DATAREFERENCE_train4'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_4': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch4.csv', 'pa

{'runId': 'ca09ab9b-bb85-4222-a8ba-1ccec03d0a0f', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:35.857215Z', 'endTimeUtc': '2019-08-07T18:11:52.669786Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_45', '--output_data', '$AZUREML_DATAREFERENCE_train45'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_45': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch45.csv',

{'runId': '402e9c23-4677-4bab-a3b8-b1acb18fd69a', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:32.044993Z', 'endTimeUtc': '2019-08-07T18:11:29.808272Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_29', '--output_data', '$AZUREML_DATAREFERENCE_train29'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_29': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch29.csv',

{'runId': 'e09973d9-2651-4b93-a58d-77afc4e7407e', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:31.614721Z', 'endTimeUtc': '2019-08-07T18:12:00.863377Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_5', '--output_data', '$AZUREML_DATAREFERENCE_train5'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_5': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch5.csv', 'pa

{'runId': '05e4adc6-3b4f-4e81-a973-6ff00fd53d54', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:29.372094Z', 'endTimeUtc': '2019-08-07T18:11:59.394168Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_8', '--output_data', '$AZUREML_DATAREFERENCE_train8'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_8': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch8.csv', 'pa

{'runId': '6dc1942a-153b-4780-b68e-593612e44f16', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:07:27.277994Z', 'endTimeUtc': '2019-08-07T18:11:56.52078Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_42', '--output_data', '$AZUREML_DATAREFERENCE_train42'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_42': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch42.csv', 

{'runId': 'e8d01185-cd26-4e9d-a559-f80fc8c26089', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:01:10.956164Z', 'endTimeUtc': '2019-08-07T18:07:35.419473Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_25', '--output_data', '$AZUREML_DATAREFERENCE_train25'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_25': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch25.csv',

{'runId': '119e543e-41fd-45b7-895c-8906f93a99f9', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:01:09.305913Z', 'endTimeUtc': '2019-08-07T18:07:44.17468Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_21', '--output_data', '$AZUREML_DATAREFERENCE_train21'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_21': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch21.csv', 

{'runId': 'a0fccf0b-9848-4d12-8570-2113ebb87cb4', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:01:07.927756Z', 'endTimeUtc': '2019-08-07T18:07:31.72495Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_2', '--output_data', '$AZUREML_DATAREFERENCE_train2'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_2': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch2.csv', 'pat

{'runId': '22b15029-5762-45a3-864a-7ee29185cb75', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:01:01.102188Z', 'endTimeUtc': '2019-08-07T18:07:42.735963Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_43', '--output_data', '$AZUREML_DATAREFERENCE_train43'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_43': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch43.csv',

{'runId': '19279aa4-49cb-43e3-b32f-19fd5b113445', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:00:59.982214Z', 'endTimeUtc': '2019-08-07T18:07:31.757827Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_24', '--output_data', '$AZUREML_DATAREFERENCE_train24'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_24': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch24.csv',

{'runId': '5682ed9b-e6ed-4b4f-88b1-8eae0feda573', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:00:59.871991Z', 'endTimeUtc': '2019-08-07T18:08:01.812385Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_15', '--output_data', '$AZUREML_DATAREFERENCE_train15'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_15': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch15.csv',

{'runId': '68494b3f-5111-4d5a-a3b2-d9acf827c2ac', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:00:58.926538Z', 'endTimeUtc': '2019-08-07T18:07:37.142827Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_16', '--output_data', '$AZUREML_DATAREFERENCE_train16'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_16': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch16.csv',

{'runId': '20edb9f8-d0f0-4bf7-8af1-634cd3f4eab4', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:00:58.722266Z', 'endTimeUtc': '2019-08-07T18:07:37.454449Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_1', '--output_data', '$AZUREML_DATAREFERENCE_train1'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_1': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch1.csv', 'pa

{'runId': '6c2e008f-21d5-48a0-bd12-bcc51bac4ea0', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:00:57.702249Z', 'endTimeUtc': '2019-08-07T18:07:36.66412Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_test_batch_1', '--output_data', '$AZUREML_DATAREFERENCE_test1'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'test_batch_1': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/test/batch1.csv', 'pathOnC

{'runId': '3dd70301-3951-485d-80b2-8e70bbeae1fe', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:16:23.606421Z', 'endTimeUtc': '2019-08-07T18:19:36.568428Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_49', '--output_data', '$AZUREML_DATAREFERENCE_train49'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_49': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch49.csv',

{'runId': 'd9e57ee0-3d0b-40a5-889d-f7e115fb6a68', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:15:38.512873Z', 'endTimeUtc': '2019-08-07T18:19:57.408098Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_44', '--output_data', '$AZUREML_DATAREFERENCE_train44'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_44': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch44.csv',

{'runId': '32e85836-72ff-4ba1-b6df-3e25b15f50ef', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:30.901583Z', 'endTimeUtc': '2019-08-07T18:22:43.616129Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_27', '--output_data', '$AZUREML_DATAREFERENCE_train27'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_27': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch27.csv',

{'runId': '9c8f2c7b-80cf-457a-9c5a-af9e4c3b44ff', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:16:04.900905Z', 'endTimeUtc': '2019-08-07T18:19:34.304394Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_47', '--output_data', '$AZUREML_DATAREFERENCE_train47'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_47': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch47.csv',

{'runId': '1eef5b91-bd98-4850-8547-0b0212aee7ed', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:59.167554Z', 'endTimeUtc': '2019-08-07T18:16:12.178699Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_36', '--output_data', '$AZUREML_DATAREFERENCE_train36'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_36': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch36.csv',

{'runId': '2c814972-663c-42e7-8966-4216f4fc57f0', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:47.504536Z', 'endTimeUtc': '2019-08-07T18:16:07.747601Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_46', '--output_data', '$AZUREML_DATAREFERENCE_train46'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_46': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch46.csv',

{'runId': '43d57ed6-0802-483b-bc63-7b9e556e5695', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:15:38.627888Z', 'endTimeUtc': '2019-08-07T18:19:26.330089Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_26', '--output_data', '$AZUREML_DATAREFERENCE_train26'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_26': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch26.csv',

{'runId': '41198ed1-1800-4e48-a27d-e5f8f40a2cb4', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:12.987364Z', 'endTimeUtc': '2019-08-07T18:22:49.209059Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_12', '--output_data', '$AZUREML_DATAREFERENCE_train12'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_12': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch12.csv',

{'runId': '88c3d18e-bd69-4e22-97e8-530a1c08183c', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:16:05.945915Z', 'endTimeUtc': '2019-08-07T18:19:36.374037Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_48', '--output_data', '$AZUREML_DATAREFERENCE_train48'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_48': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch48.csv',

{'runId': '30771443-4e49-4cee-9162-41252f31ca64', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:15:59.765266Z', 'endTimeUtc': '2019-08-07T18:19:28.919649Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_19', '--output_data', '$AZUREML_DATAREFERENCE_train19'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_19': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch19.csv',

{'runId': '6ceaf977-fd50-4d7f-bef2-62c7eccc9a81', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:35.769771Z', 'endTimeUtc': '2019-08-07T18:22:50.005524Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_23', '--output_data', '$AZUREML_DATAREFERENCE_train23'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_23': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch23.csv',

{'runId': 'c6b71417-42fc-42b6-b4ff-4ec41a6ef9d3', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:40.922397Z', 'endTimeUtc': '2019-08-07T18:22:55.499484Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_35', '--output_data', '$AZUREML_DATAREFERENCE_train35'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_35': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch35.csv',

{'runId': '452392a3-5138-45d0-bf22-29a5b556d18c', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:16:07.869799Z', 'endTimeUtc': '2019-08-07T18:19:47.465454Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_37', '--output_data', '$AZUREML_DATAREFERENCE_train37'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_37': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch37.csv',

{'runId': '7c161e03-ea44-4e7c-9482-56fd42fb2ec5', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:26.773259Z', 'endTimeUtc': '2019-08-07T18:22:39.406471Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_22', '--output_data', '$AZUREML_DATAREFERENCE_train22'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_22': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch22.csv',

{'runId': 'b3944a0d-09cb-47f9-b4f3-dc40951be0ab', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:29.524117Z', 'endTimeUtc': '2019-08-07T18:22:42.673243Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_30', '--output_data', '$AZUREML_DATAREFERENCE_train30'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_30': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch30.csv',

{'runId': 'e33b58f0-16d0-4fe8-a58d-135f86d27f35', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:22:39.400775Z', 'endTimeUtc': '2019-08-07T18:25:21.326544Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_34', '--output_data', '$AZUREML_DATAREFERENCE_train34'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_34': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch34.csv',

{'runId': 'a6378446-b516-435f-915c-815b3671dfc8', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:22:36.865885Z', 'endTimeUtc': '2019-08-07T18:25:47.106196Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_10', '--output_data', '$AZUREML_DATAREFERENCE_train10'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_10': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch10.csv',

{'runId': 'e07db2d6-39ff-43b3-9dea-a96ad237d630', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:23.993213Z', 'endTimeUtc': '2019-08-07T18:22:37.605467Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_32', '--output_data', '$AZUREML_DATAREFERENCE_train32'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_32': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch32.csv',

{'runId': '260a0179-afd2-4060-84b9-285fab51f7ab', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:16:07.901006Z', 'endTimeUtc': '2019-08-07T18:19:38.391084Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_41', '--output_data', '$AZUREML_DATAREFERENCE_train41'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_41': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch41.csv',

{'runId': 'db828e3c-a42f-466e-ba15-45a729522b3b', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:47.272966Z', 'endTimeUtc': '2019-08-07T18:22:38.306038Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_18', '--output_data', '$AZUREML_DATAREFERENCE_train18'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_18': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch18.csv',

{'runId': 'd2397228-50c1-4b42-9085-dadce854eebd', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:25.021708Z', 'endTimeUtc': '2019-08-07T18:22:37.093482Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_31', '--output_data', '$AZUREML_DATAREFERENCE_train31'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_31': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch31.csv',

{'runId': 'b76ae46d-a92f-48a9-be96-80d6f1b38893', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:19:26.402131Z', 'endTimeUtc': '2019-08-07T18:22:46.104557Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_38', '--output_data', '$AZUREML_DATAREFERENCE_train38'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_38': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch38.csv',

{'runId': 'd27b4d61-f39c-4779-9083-804fda8d06aa', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:35.373134Z', 'endTimeUtc': '2019-08-07T18:15:36.807514Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_7', '--output_data', '$AZUREML_DATAREFERENCE_train7'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_7': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch7.csv', 'pa

{'runId': '5f63748e-6f0d-4e2a-83fe-131b37e2435b', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:15:43.19059Z', 'endTimeUtc': '2019-08-07T18:19:36.343885Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_33', '--output_data', '$AZUREML_DATAREFERENCE_train33'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_33': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch33.csv', 

{'runId': '8e5f01c0-1ed1-4272-bb2c-9c75bffca46d', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:21.940422Z', 'endTimeUtc': '2019-08-07T18:15:37.118111Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_11', '--output_data', '$AZUREML_DATAREFERENCE_train11'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_11': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch11.csv',

{'runId': '9d66042e-3b8b-4478-baa2-c3025f86193f', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:40.80073Z', 'endTimeUtc': '2019-08-07T18:15:41.107086Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_test_batch_0', '--output_data', '$AZUREML_DATAREFERENCE_test0'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'test_batch_0': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/test/batch0.csv', 'pathOnC

{'runId': '6a26f9e9-312a-4fb8-aeda-459ebf40298a', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:40.262344Z', 'endTimeUtc': '2019-08-07T18:15:59.840397Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_39', '--output_data', '$AZUREML_DATAREFERENCE_train39'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_39': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch39.csv',

{'runId': '3b2d7262-f8fe-4dc9-b531-a980ad7d3163', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:15:59.916911Z', 'endTimeUtc': '2019-08-07T18:19:29.307303Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_6', '--output_data', '$AZUREML_DATAREFERENCE_train6'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_6': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch6.csv', 'pa

{'runId': '042b0f4c-cde8-44dc-8329-8b94d6a0320c', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:57.999975Z', 'endTimeUtc': '2019-08-07T18:16:01.184721Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_40', '--output_data', '$AZUREML_DATAREFERENCE_train40'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_40': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch40.csv',

{'runId': 'cd1fafb0-ca7d-4147-bf7a-f9802ecc5fc9', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:45.053029Z', 'endTimeUtc': '2019-08-07T18:16:04.121577Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_28', '--output_data', '$AZUREML_DATAREFERENCE_train28'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_28': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch28.csv',

{'runId': '7c565227-f554-491f-8fde-1c1c7ac01c4d', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:12:00.185373Z', 'endTimeUtc': '2019-08-07T18:16:23.35679Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_13', '--output_data', '$AZUREML_DATAREFERENCE_train13'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_13': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch13.csv', 

{'runId': '6429d05e-6bf6-4a50-b115-6ab319c6bb5f', 'target': 'pipelines-tc-12', 'status': 'Completed', 'startTimeUtc': '2019-08-07T18:11:51.289188Z', 'endTimeUtc': '2019-08-07T18:15:54.960346Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '8268cb24-22dd-4300-bbd0-e4f7f5bace13', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '6c289dac-5da1-4a28-b112-e4725763c85f', '_azureml.ComputeTargetType': 'batchai', 'AzureML.DerivedImageName': 'azureml/azureml_d23dd54ad4993141bebe494ae106b052'}, 'runDefinition': {'script': 'utils_nlp/models/bert/preprocess.py', 'arguments': ['--input_data', '$AZUREML_DATAREFERENCE_train_batch_9', '--output_data', '$AZUREML_DATAREFERENCE_train9'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'pipelines-tc-12', 'dataReferences': {'train_batch_9': {'dataStoreName': 'liqungensen', 'mode': 'Mount', 'pathOnDataStore': 'mnli_data/train/batch9.csv', 'pa

eb68b885d980411e8b2b371eb064f8fe00000A:138:282 [0] NCCL INFO comm 0x7f084c338040 rank 0 nranks 4 cudaDev 0 nvmlDev 0 - Init COMPLETE
eb68b885d980411e8b2b371eb064f8fe00000A:138:282 [0] NCCL INFO Launch mode Parallel
Train Epoch: 1/1 (0%) 	 Batch:1 	Loss: 1.658950
Train Epoch: 1/1 (98%) 	 Batch:1001 	Loss: 0.080279

Iteration:   0%|          | 0/98 [00:00<?, ?it/s]
Iteration:   1%|          | 1/98 [00:04<07:23,  4.57s/it]
Iteration:   2%|▏         | 2/98 [00:04<05:19,  3.33s/it]
Iteration:   3%|▎         | 3/98 [00:05<03:53,  2.45s/it]
Iteration:   4%|▍         | 4/98 [00:05<02:53,  1.84s/it]
Iteration:   5%|▌         | 5/98 [00:06<02:11,  1.42s/it]
Iteration:   6%|▌         | 6/98 [00:06<01:42,  1.12s/it]
Iteration:   7%|▋         | 7/98 [00:07<01:22,  1.10it/s]
Iteration:   8%|▊         | 8/98 [00:07<01:08,  1.32it/s]
Iteration:   9%|▉         | 9/98 [00:07<00:58,  1.52it/s]
Iteration:  10%|█         | 10/98 [00:08<00:51,  1.71it/s]
Iteration:  11%|█         | 11/98 [00:08<00:46,  1.87






PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '6c289dac-5da1-4a28-b112-e4725763c85f', 'status': 'Completed', 'startTimeUtc': '2019-08-07T17:56:45.803284Z', 'endTimeUtc': '2019-08-07T19:45:08.919211Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': None, 'runType': 'HTTP', 'azureml.parameters': '{}'}, 'logFiles': {'logs/azureml/executionlogs.txt': 'https://maidaptest3334372853.blob.core.windows.net/azureml/ExperimentRun/dcid.6c289dac-5da1-4a28-b112-e4725763c85f/logs/azureml/executionlogs.txt?sv=2018-11-09&sr=b&sig=Dvuz3Ai%2BMkZjDJcEr%2Fl06vHS1UrSxndVUNVQ5472rDk%3D&st=2019-08-07T19%3A35%3A12Z&se=2019-08-08T03%3A45%3A12Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://maidaptest3334372853.blob.core.windows.net/azureml/ExperimentRun/dcid.6c289dac-5da1-4a28-b112-e4725763c85f/logs/azureml/stderrlogs.txt?sv=2018-11-09&sr=b&sig=m%2BQK5AbB1KJxYsvQhrjnJiz3SxcoGb3sr4tDTPIPS88%3D&st=2019-08-07T19%3A35%3A12Z&se=2019-08-08T03%3A45%3A12Z&sp=r', 'logs/

'Finished'

### 2.4 Download and analyze results

In [22]:
step_run = pipeline_run.find_step_run("Estimator-Train")[0]
file_names = ['outputs/results.json', 'outputs/bert-large-uncased', 'outputs/bert_config.json' ]
azureml_utils.get_output_files(step_run, './outputs', file_names=file_names)

Downloading file outputs/results.json to ./outputs\results.json...
Downloading file outputs/bert-large-uncased to ./outputs\bert-large-uncased...
Downloading file outputs/bert_config.json to ./outputs\bert_config.json...


In [None]:
with open('outputs/results.json', 'r') as handle:
    parsed = json.load(handle)
    print(pd.DataFrame.from_dict(parsed).transpose())

From the above chart we can notice the performance of the model trained on a distributed setup in AzureML Compute. From our comparison to fine tuning the same model on MNLI dataset on a `STANDARD_NC12` machine [here](tc_mnli_bert.ipynb) we notice a gain of 20% in the model training time with no drop in performance for AzureML Compute. We present the comparison of weight avg of the metrics along with the training time below,

| Training Setup | F1-Score | Precision | Recall | Training Time |
| --- | --- | --- | --- | --- |
|Standard NC12 | 0.93 |0.93 |0.93 | 58 min |
|AzureML Compute*|0.934| 0.934 | 0.934| 46 min |

* AzureML Compute - The setup used 4 nodes with `STANDARD_NC12` machines.

We also observe common tradeoffs associated with distributed training. We make use of [Horovod](https://github.com/horovod/horovod), a distributed training tool for many popular deep learning frameworks that enables parallelization of work across the nodes in the cluster. Distributed training decreases the time it takes for the model to converge in theory, but the model may also take more time in communicating with each node. Note that the communication time will eventually become negligible when training on larger and larger datasets, but being aware of this tradeoff is helpful for choosing the node configuration when training on smaller datasets. We expect the gains of using AzureML to increase with increased dataset size.

Finally clean up any intermediate files we created.

In [None]:
os.remove(train_file)
os.remove(preprocess_file)