# PyTorch Pretrained BERT on Teams Slot Tagging

In this notebook we will show how to perform slot tagging on the Teams dataset. Follow the requirements to run Azure ML notebook by checking https://github.com/danielsc/dogbreeds/blob/master/dog-breed-simple.ipynb

In [47]:
import azureml.core
print("SDK version:", azureml.core.VERSION)


SDK version: 1.0.33


# Connect to Workspace and select gpu cluster

if there is not existing cluster, create one

In [48]:
from azureml.core import Workspace

subscription_id = "4a66f470-dd54-4c5e-bd19-8cb65a426003"
resource_group  = "AML_Playground"
workspace_name  = "Teams_ws"

try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    ws.write_config()
    print('Library configuration succeeded')
    print('https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource' + ws.get_details()['id'])
except:
    print('Workspace not found')



Library configuration succeeded
https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/4a66f470-dd54-4c5e-bd19-8cb65a426003/resourceGroups/AML_Playground/providers/Microsoft.MachineLearningServices/workspaces/Teams_ws


In [49]:
from azureml.core.compute import AmlCompute, ComputeTarget

cluster_name = "p100cluster"

try:
    compute_target = ws.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6s_v2', 
                                                           idle_seconds_before_scaledown=1800,
                                                           min_nodes=0, 
                                                           max_nodes=10)
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)

Found existing compute target.


# Connect to Datastore and upload local data

When you have large data and model, you need to create one seperate Datastore.

If not, AML will have error and you can't track your outputs. 

Each workspace is associated with a default Azure Blob datastore named 'workspaceblobstore'. In this work, we use this default datastore to store our local data.

In [50]:
ds = ws.get_default_datastore()

In [51]:
#upload local model
model_path_on_datastore = 'Teams_slot_model' #cased model,vocab is too small? Do not have frequent word like common
ds_model = ds.path(model_path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Tagging_data\bert-base-English-cased-pytorch',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model.as_mount())

$AZUREML_DATAREFERENCE_ca7e0d0fdc984f7daff28ed53f474f3a


In [52]:
#upload unsupervised local model
model_path_on_datastore = 'Communication_slot_model_unsupervised' #cased model,vocab is too small? Do not have frequent word like common
ds_model_unsupervised = ds.path(model_path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Data_model\Communication_unsupervised',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model_unsupervised.as_mount())

$AZUREML_DATAREFERENCE_948406753abe4dae96bb9fd471cfc00e


In [53]:
#upload local data set
path_on_datastore = 'datasets/Teams_communication'
ds_data_communication = ds.path(path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Data_model\Communication_data',
          target_path= path_on_datastore,
          overwrite=False,
          show_progress=True)



$AZUREML_DATAREFERENCE_996a0d3f55db4f77b6bce5e449edcf85

# Create an experiment
Create an Experiment to track all the runs in your workspace. 


In [54]:
from azureml.core import Experiment

experiment_name = 'Teams_slot_uncased' 
experiment = Experiment(ws, name=experiment_name)

# Submit your Job
The follow section creates one pytorch estimator, you can easily specify your parameters.

When you submit your job, it will autoamtically upload your local repo to the cloud cluster. 

You can also submit tensorflow or keras job

In [55]:
##BATCH AI
from azureml.train.dnn import PyTorch




script_params = {
    #'--data_dir': ds_data.as_mount(),
    #'--data_dir': 'Communication_data', #update for golden data
    '--data_dir': ds_data_communication.as_mount(), #update for golden data
    '--train_dir':ds.path(f'datasets/Teams_communication/comm_train_prod.txt').as_mount(),
    #'--train_dir':ds.path(f'datasets/Teams_communication/train_teams.txt').as_mount(),
    '--eval_dir':ds.path(f'datasets/Teams_communication/valid.txt').as_mount(),
    '--test_dir':ds.path(f'datasets/Teams_communication/test.txt').as_mount(),
    
    #'--bert_model':ds_model_unsupervised.as_mount(),#for pre-trained model with indomain unsupervised data
    #'--bert_model':ds.path(f'pretrained-model/output_merged_unlabeled').as_mount(),
    #'--bert_model':ds.path(f'pretrained-model/output_merged_unlabeled_4_percent').as_mount(),
    #'--bert_model':'bert-large-uncased',#for pre-trained model
    
    #for pre-trained model
    #'--bert_model':'bert-base-uncased',
    '--bert_model':'bert-base-uncased',
    '--do_lower_case':'',
    
    
    '--task_name':'ner',
    '--output_dir':'./outputs',
    #'--output_dir':ds.path(f'bert_data/uncased_model/outputs_base_uncased_real').as_mount(),
    '--do_train':'',
    '--do_eval':'',
    '--learning_rate':'2e-5',#larger learning rate given init
    #'--learning_rate':'5e-5',#larger learning rate given init
    '--num_train_epochs':'20',
    '--warmup_proportion':'0.1',
    #'--max_seq_length':'32',
    '--max_seq_length':'128',
    #'--train_batch_size':'8'#8 for smaller base
    '--train_batch_size':'64'#8 for smaller base
}

estimator10 = PyTorch(source_directory='..', 
                    script_params=script_params,
                    compute_target=compute_target, 
                    entry_script='Teacher_training.py',
                    #pip_packages=['pandas','pytorch-pretrained-bert==0.4.0','seqeval==0.0.5'],
                    pip_packages=['pandas','pytorch-pretrained-bert==0.6.1','seqeval==0.0.5','transformers==2.1.1'],
                    use_gpu=True)

# Set up your running environment

You can create your own virtual environment. It takes a while the first time you submit the job. If you do not change dependency, the job submission will be fast.


In [56]:
print(estimator10.run_config.environment.docker.base_image)

mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda9.0-cudnn7-ubuntu16.04


In [57]:
print(estimator10.conda_dependencies.serialize_to_string())

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - pandas
  - pytorch-pretrained-bert==0.6.1
  - seqeval==0.0.5
  - transformers==2.1.1
  - azureml-defaults
  - torch==1.0
  - torchvision==0.2.1
  - horovod==0.15.2



# Submit and Monitor your run

You can also find your previous runs if your open the azure portal

In [58]:
run = experiment.submit(estimator10)

In [59]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Hyperparameter selection

In [60]:
from azureml.widgets import RunDetails
from azureml.train.hyperdrive import *
import math
ps = RandomParameterSampling(
    {
        #'--learning_rate': loguniform(math.log(1e-5), math.log(1e-4)),
        '--learning_rate': choice(1e-5,2e-5,5e-5,1e-4),
        #'--beta': choice(1,5,10,20)
        #"--temperature":choice(1,5,10,20)
        "--train_batch_size": choice(8,16,32,64)
    }
)

policy = BanditPolicy(evaluation_interval=2, slack_factor=0.2)


hdc = HyperDriveConfig(estimator=estimator10, 
                          hyperparameter_sampling=ps, 
                          policy=policy, 
                          primary_metric_name='best_val_f1', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=10,
                          max_concurrent_runs=4)


In [61]:
hd_run = experiment.submit(hdc)
RunDetails(hd_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

# To do

Fine-Tuning BERT with Hyperparameter Tuning

Fine-Tuning BERT with mutiple GPU

Generate azure pipeline


Link to the blob and datastore account

In [62]:
from azureml.core import Workspace, Datastore
# Default datastore 
def_data_store = ws.get_default_datastore()


# The following call GETS the Azure Blob Store associated with your workspace.
# Note that workspaceblobstore is **the name of this store and CANNOT BE CHANGED and must be used as is** 
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))

# Get file storage associated with the workspace
#it will be put into files share folder in the workspace?
def_file_store = Datastore(ws, "workspacefilestore")




Blobstore's name: workspaceblobstore


In [63]:
def_data_store.name

'workspaceblobstore'

In [64]:
def_blob_store.name

'workspaceblobstore'

In [65]:
def_file_store.name

'workspacefilestore'

## Upload data to specific  blob

we need to upload data into specific blob

In [66]:
#upload local model
model_path_on_datastore = f'bert_data/uncased_model/outputs_base_uncased_no_basic_tokenizer' #cased model,vocab is too small? Do not have frequent word like common
ds_model = ds.path(model_path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\BERT-NER\comm_out',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model.as_mount())

$AZUREML_DATAREFERENCE_6d2d43d7dffe4e218b12917640a82203


In [67]:
from azureml.pipeline.core import PipelineData

# Define intermediate data using PipelineData
# Syntax

# PipelineData(name, 
#              datastore=None, 
#              output_name=None, 
#              output_mode='mount', 
#              output_path_on_compute=None, 
#              output_overwrite=None, 
#              data_type=None, 
#              is_directory=None)


output_dir = PipelineData("results",datastore=def_file_store)

In [68]:
output_dir

$AZUREML_DATAREFERENCE_results

In [69]:
ds_model

$AZUREML_DATAREFERENCE_bf753cf9267f4b169251a6bfc753a941

## Looks like there is a recent solution for using pipleine with azure batch scoring

https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-pipeline-batch-scoring-classification


For more info, https://docs.microsoft.com/en-us/azure/machine-learning/


