# PyTorch Pretrained BERT on Communication Slot Tagging

In this notebook we will show how to perform slot tagging on the Teams dataset. Follow the requirements to run Azure ML notebook by checking https://github.com/danielsc/dogbreeds/blob/master/dog-breed-simple.ipynb

In [1]:
import azureml.core
print("SDK version:", azureml.core.VERSION)




SDK version: 1.0.33


# Connect to Workspace and select gpu cluster

if there is not existing cluster, create one

In [2]:
from azureml.core import Workspace

#subscription_id = "4a66f470-dd54-4c5e-bd19-8cb65a426003"
#resource_group  = "AML_Playground"
#workspace_name  = "Teams_ws"

subscription_id = "ddb33dc4-889c-4fa1-90ce-482d793d6480"
resource_group = "DevExp"
workspace_name = "DevExperimentation"
try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    ws.write_config()
    print('Library configuration succeeded')
    print('https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource' + ws.get_details()['id'])
except:
    print('Workspace not found')



Library configuration succeeded
https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/ddb33dc4-889c-4fa1-90ce-482d793d6480/resourceGroups/DevExp/providers/Microsoft.MachineLearningServices/workspaces/DevExperimentation


In [3]:
from azureml.core.compute import AmlCompute, ComputeTarget

#cluster_name = "p100cluster"
cluster_name  ="P100-SingleGPU"

try:
    compute_target = ws.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6s_v2', 
                                                           idle_seconds_before_scaledown=1800,
                                                           min_nodes=0, 
                                                           max_nodes=10)
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)

Found existing compute target.


# Connect to Datastore and upload local data

When you have large data and model, you need to create one seperate Datastore.

If not, AML will have error and you can't track your outputs. 

Each workspace is associated with a default Azure Blob datastore named 'workspaceblobstore'. In this work, we use this default datastore to store our local data.

In [4]:
#ds = ws.get_default_datastore()
from azureml.core import Workspace,Datastore 

ds = Datastore.get(ws, datastore_name='compliant_lu_haochu')


In [5]:
#upload local model
model_path_on_datastore = f'bert_data/uncased_model/outputs_base_uncased' #cased model,vocab is too small? Do not have frequent word like common
ds_model = ds.path(model_path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Data_model\Communication_teacher_uncased',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model.as_mount())

Target already exists. Skipping upload for bert_data/uncased_model/outputs_base_uncased\model_config.json
Target already exists. Skipping upload for bert_data/uncased_model/outputs_base_uncased\vocab.txt


$AZUREML_DATAREFERENCE_ee226e320acf4235a4bd20b88c26c50b


In [6]:
#upload unsupervised local model
model_path_on_datastore = 'Communication_slot_model_unsupervised' #cased model,vocab is too small? Do not have frequent word like common
ds_model_unsupervised = ds.path(model_path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Data_model\Communication_unsupervised',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model_unsupervised.as_mount())

Target already exists. Skipping upload for Communication_slot_model_unsupervised\bert_config.json
Target already exists. Skipping upload for Communication_slot_model_unsupervised\pytorch_model.bin
Target already exists. Skipping upload for Communication_slot_model_unsupervised\vocab.txt


$AZUREML_DATAREFERENCE_a2c043d92b044c608e6838f898d79d6b


In [7]:
#upload local data set
path_on_datastore = 'datasets/Teams_communication'
ds_data_communication = ds.path(path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Data_model\Communication_data',
          target_path= path_on_datastore,
          overwrite=False,
          show_progress=True)

Target already exists. Skipping upload for datasets/Teams_communication\BIO-tags.txt
Target already exists. Skipping upload for datasets/Teams_communication\comm_train_prod.txt
Target already exists. Skipping upload for datasets/Teams_communication\Target_set_message_new_conll.txt
Target already exists. Skipping upload for datasets/Teams_communication\test.txt
Target already exists. Skipping upload for datasets/Teams_communication\test_blind_old.txt
Target already exists. Skipping upload for datasets/Teams_communication\test_nonormal.txt
Target already exists. Skipping upload for datasets/Teams_communication\train.txt
Target already exists. Skipping upload for datasets/Teams_communication\train_19k_only.txt
Target already exists. Skipping upload for datasets/Teams_communication\train_legacy_1m.txt
Target already exists. Skipping upload for datasets/Teams_communication\train_positive_generated.txt
Target already exists. Skipping upload for datasets/Teams_communication\train_teams.txt
Ta

Uploading D:\dl_repo\Data_model\Communication_data\valid_dec.txt


Target already exists. Skipping upload for datasets/Teams_communication\generated_data\communication_message_generated_contact.txt
Target already exists. Skipping upload for datasets/Teams_communication\generated_data\communication_message_generated_no_contact.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\raequery_for_prod_with_target.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_contact_message_generated.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_expanision-outputs.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_for_prod.txt


Uploaded D:\dl_repo\Data_model\Communication_data\valid_dec.txt, 1 files out of an estimated total of 16


Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_for_prod_from_valid_nov.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_for_prod_from_valid_nov_normalized.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_for_prod_with_civ.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_from_civ.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_merged.txt
Target already exists. Skipping upload for datasets/Teams_communication\raw_generated_data\rawquery_no_contact_message_generated.txt


$AZUREML_DATAREFERENCE_ce2df08c41ad48f88adb44a0853988ae

In [8]:
#upload pre-trained model
#TODO: reuse azure blob path in aml
model_path_on_datastore = 'Communication_slot_model_fine-tuned' #cased model,vocab is too small? Do not have frequent word like common
ds_model_pretrained = ds.path(model_path_on_datastore)
ds.upload(src_dir=r'D:\dl_repo\Data_model\Communication_finetuned',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model.as_mount())


Target already exists. Skipping upload for Communication_slot_model_fine-tuned\bert_config.json
Target already exists. Skipping upload for Communication_slot_model_fine-tuned\eval_results.txt
Target already exists. Skipping upload for Communication_slot_model_fine-tuned\model_config.json
Target already exists. Skipping upload for Communication_slot_model_fine-tuned\pytorch_model.bin


$AZUREML_DATAREFERENCE_79ad4d425a894461b86e6e1b05f81c42


# Create an experiment
Create an Experiment to track all the runs in your workspace. 


In [9]:
from azureml.core import Experiment

experiment_name = 'UDA' 
experiment = Experiment(ws, name=experiment_name)

# Submit your Job
The follow section creates one pytorch estimator, you can easily specify your parameters.

When you submit your job, it will autoamtically upload your local repo to the cloud cluster. 

You can also submit tensorflow or keras job

## Supervised Finetuning

With pre-trained model and labeled data

In [43]:
##BATCH AI
from azureml.train.dnn import PyTorch



script_params = {
    #'--data_dir': ds_data.as_mount(),
    #'--data_dir': 'teams_data', #update for golden data
    '--data_dir': ds_data_communication.as_mount(), #update for communication data
    #'--train_dir':ds.path(f'datasets/Teams_communication/train_valid_oct.txt').as_mount(),
    #'--train_dir':ds.path(f'datasets/Teams_communication/train_teams.txt').as_mount(),
    #'--train_dir':ds.path(f'datasets/Teams_communication/comm_train_prod.txt').as_mount(),
    '--train_dir':ds.path(f'datasets/Communication_prod_data/communication_slot_train.txt').as_mount(),
    '--valid_dir':ds.path(f'datasets/Teams_communication/valid_dec.txt').as_mount(),
    '--test_generated_dir':ds.path(f'datasets/Teams_communication/generated_data/communication_message_generated_no_contact.txt').as_mount(),
    '--test_generated_no_contact_dir':ds.path(f'datasets/Teams_communication/generated_data/communication_message_generated_contact.txt').as_mount(),
    '--target_set_dir':ds.path(f'datasets/Teams_communication/Target_set_message_new_conll.txt').as_mount(),
    
    #'--teacher_model_path':ds_model_pretrained.as_mount(),
    #'--teacher_model_path':ds.path(f'bert_data/outputs').as_mount(),
    #'--teacher_model_path':ds.path(f'bert_data/outputs1').as_mount(),#from prod labeled sources
     #'--bert_model': ds_model.as_mount(),
    
    
    ##for uncased longer teacher
    '--teacher_model_path':ds.path(f'bert_data/uncased_model/outputs_base_uncased_no_basic_tokenizer').as_mount(),#from prod labeled sources
    '--bert_model':'bert-base-uncased', 
    '--do_lower_case':'',
    '--do_basic_tokenize':'', # add for comparation
    
    
    '--task_name':'ner',
    '--output_dir':'./outputs',
    '--do_train':'',
    '--do_eval':'',
    #'--crf':'',
    #'--learning_rate':'0.001',#add large learning rate for student model
    '--learning_rate':'1e-4',#smaller learning rate given init
    '--num_train_epochs':'50', ##get larger number when you have smaller unsupervised data
    '--warmup_proportion':'0.1',
    #'--max_seq_length':'32',
    '--max_seq_length':'128',
    '--train_batch_size':'64',
    "--temperature": '1',
    "--alpha": '1',#for ablation study,weight for labeled data
    "--beta": '1', #for ablation study,weight for unlabeled data
    #"--unsupervised_train_corpus":'./lm_data/train.txt' #add for debug
    #"--unsupervised_train_corpus":'./lm_data/augmented_data.txt' #add for debug
    '--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_merged.txt').as_mount()
    #'--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_for_prod.txt').as_mount()
    #'--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_for_prod_from_valid_nov_normalized.txt').as_mount()   
    #'--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_from_civ.txt').as_mount()
    #'--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_expanision-outputs.txt').as_mount()
    #'--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/raequery_for_prod_with_target.txt').as_mount()
    #'--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_for_prod_with_civ.txt').as_mount()
    #"--unsupervised_train_corpus":'./lm_data/communication_postive_generated_raw.txt' #add for debug
    
    
    #'--multi_gpu':'',
}

estimator10 = PyTorch(source_directory='..', 
                    script_params=script_params,
                    compute_target=compute_target, 
                    entry_script='src/distillation.py',
                    #pip_packages=['pandas','pytorch-pretrained-bert==0.4.0','seqeval==0.0.5'],
                    #pip_packages=['pandas','pytorch-pretrained-bert==0.6.1','seqeval==0.0.5','nltk'],
                    pip_packages=['pandas','pytorch-pretrained-bert==0.6.1','seqeval==0.0.5','transformers==2.1.1','nltk'],
                    use_gpu=True)

# Set up your running environment

You can create your own virtual environment. It takes a while the first time you submit the job. If you do not change dependency, the job submission will be fast.


In [44]:
print(estimator10.run_config.environment.docker.base_image)

mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda9.0-cudnn7-ubuntu16.04


In [45]:
print(estimator10.conda_dependencies.serialize_to_string())

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - pandas
  - pytorch-pretrained-bert==0.6.1
  - seqeval==0.0.5
  - transformers==2.1.1
  - nltk
  - azureml-defaults
  - torch==1.0
  - torchvision==0.2.1
  - horovod==0.15.2



# Submit and Monitor your run

You can also find your previous runs if your open the azure portal

In [46]:
run = experiment.submit(estimator10)

In [47]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Hyperparameter Tuning

In [41]:

from azureml.widgets import RunDetails
from azureml.train.hyperdrive import *
import math
ps = RandomParameterSampling(
    {
        '--learning_rate': loguniform(math.log(1e-5), math.log(1e-3)),
        #'--beta': choice(1,0),
        #'--alpha':choice(1,0),
        #"--temperature":choice(1,5,10,20)
        "--encoder_type": choice('LSTM','GRU'),
        "--hidden_units": choice(300,400,500,600),
        "--weight_decay": choice (0.0,0.001,0.01)
    }
)

policy = BanditPolicy(evaluation_interval=2, slack_factor=0.2)


hdc = HyperDriveConfig(estimator=estimator10, 
                          hyperparameter_sampling=ps, 
                          policy=policy, 
                          primary_metric_name='best_val_f1', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=20,
                          max_concurrent_runs=4)


In [42]:
hd_run = experiment.submit(hdc)
RunDetails(hd_run).show()

The same input parameter(s) are specified in estimator/run_config script params and HyperDrive parameter space. HyperDrive parameter space definition will override these duplicate entries. ['--learning_rate'] is the list of overridden parameter(s).


_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Pre-training with unsupervised in domain data

In [None]:
##BATCH AI
from azureml.train.dnn import PyTorch

#script_params = {

#    '--data_path': 'data/sst2',
#    '--num_epochs': '10',
#    '--embedding_type': 'elmo',
#    '--output_dir': './outputs'
#}


script_params = {

    #'--train_corpus' : './lm_data/train.txt',
    '--train_corpus' : ds.path(f'azureml-blobstore-1dd9ddd7-6d52-40ad-8b00-be73cd90fd63/lm_data_merged_4_percent.txt').as_mount(),
    #'--data_dir': 'gold_teams_data', #update for golden data
    '--bert_model': ds_model.as_mount(),
    '--output_dir':ds.path(f'pretrained-model/output_merged_unlabeled_4_percent').as_mount(),
    '--do_train':'',
    '--learning_rate':'5e-5',#larger learning rate given init
    '--num_train_epochs':'2',
   # '--warmup_proportion':'0.1',
    '--max_seq_length':'32',
    '--train_batch_size':'64'
    #'--train_batch_size':'16' #smaller batch for low resource
    
    #'--multi_gpu':'',
}

estimator10 = PyTorch(source_directory='..', 
                    script_params=script_params,
                    compute_target=compute_target, 
                    entry_script='simple_lm_finetune.py',
                    #pip_packages=['allennlp','pandas','tensorflow','pytorch-pretrained-bert'],
                    #pip_packages=['pandas','pytorch-pretrained-bert==0.4.0','seqeval==0.0.5'],
                    pip_packages=['pandas','pytorch-pretrained-bert==0.6.1','seqeval==0.0.5'],
                    use_gpu=True)

## Find and register the best model
Once all the runs complete, we can find the run that produced the model with the highest evaluation f1.

ref: https://github.com/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/Pretrained-BERT-NER.ipynb

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(best_run)
print('Best Run is:\n  F1: {0:.5f} \n  Learning rate: {1:.8f}'.format(
        best_run_metrics['best_val_f1'][-1],
        best_run_metrics['lr']
     ))


## Read data

In [None]:
ds = ws.get_default_datastore()
sample_data = ds.path(f'datasets/Transfer_set/testing_sample_input.txt').as_mount()
print (sample_data)
file = open(sample_data)
for line in file:
    print (line)

# To do

Fine-Tuning BERT with Hyperparameter Tuning

Fine-Tuning BERT with mutiple GPU

Generate azure pipeline


Link to the blob and datastore account

In [None]:
from azureml.core import Workspace, Datastore
# Default datastore 
def_data_store = ws.get_default_datastore()


# The following call GETS the Azure Blob Store associated with your workspace.
# Note that workspaceblobstore is **the name of this store and CANNOT BE CHANGED and must be used as is** 
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))

# Get file storage associated with the workspace
#it will be put into files share folder in the workspace?
def_file_store = Datastore(ws, "workspacefilestore")




In [None]:
def_data_store.name

In [None]:
def_blob_store.name

In [None]:
def_file_store.name

## Upload data to specific  blob

we need to upload data into specific blob

In [None]:
#upload local model
model_path_on_datastore = 'Teams_slot_model' #cased model,vocab is too small? Do not have frequent word like common
ds_model = ds.path(model_path_on_datastore)
def_file_store.upload(src_dir=r'D:\dl_repo\Tagging_data\bert-base-English-cased-pytorch',
          target_path= model_path_on_datastore,
          overwrite=False,
          show_progress=True)
print(ds_model.as_mount())

In [None]:
from azureml.pipeline.core import PipelineData

# Define intermediate data using PipelineData
# Syntax

# PipelineData(name, 
#              datastore=None, 
#              output_name=None, 
#              output_mode='mount', 
#              output_path_on_compute=None, 
#              output_overwrite=None, 
#              data_type=None, 
#              is_directory=None)


output_dir = PipelineData("results",datastore=def_file_store)

In [None]:
output_dir

In [None]:
ds_model

## Register new datastore

In [None]:

# Register the datastore with the workspace
ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name='BERT_Preprocessed_Data',
                                             container_name='data',
                                             account_name='<name goes here>', 
                                             account_key='<key goes here>'
                                            )

# Help from: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data