# Continual training for DSAT queries

In this notebook we will show how to resolve DSAT queries via continuial training on pre-trained student network

In [298]:
import azureml.core
print("SDK version:", azureml.core.VERSION)


SDK version: 1.0.85


# Connect to Workspace and select gpu cluster

if there is not existing cluster, create one

In [299]:
from azureml.core import Workspace

#subscription_id = "4a66f470-dd54-4c5e-bd19-8cb65a426003"
#resource_group  = "AML_Playground"
#workspace_name  = "Teams_ws"

subscription_id = "ddb33dc4-889c-4fa1-90ce-482d793d6480"
resource_group = "DevExp"
workspace_name = "DevExperimentation"
try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    ws.write_config()
    print('Library configuration succeeded')
    print('https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource' + ws.get_details()['id'])
except:
    print('Workspace not found')



Library configuration succeeded
https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/ddb33dc4-889c-4fa1-90ce-482d793d6480/resourceGroups/DevExp/providers/Microsoft.MachineLearningServices/workspaces/DevExperimentation


In [300]:
from azureml.core.compute import AmlCompute, ComputeTarget

#cluster_name = "p100cluster"
cluster_name  ="P100-SingleGPU"

try:
    compute_target = ws.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6s_v2', 
                                                           idle_seconds_before_scaledown=1800,
                                                           min_nodes=0, 
                                                           max_nodes=10)
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)

Found existing compute target.


# Connect to Datastore and upload local data

When you have large data and model, you need to create one seperate Datastore.

If not, AML will have error and you can't track your outputs. 

Each workspace is associated with a default Azure Blob datastore named 'workspaceblobstore'. In this work, we use this default datastore to store our local data.

In [301]:
#ds = ws.get_default_datastore()
from azureml.core import Workspace,Datastore 

ds = Datastore.get(ws, datastore_name='compliant_lu_haochu')


# Create an experiment
Create an Experiment to track all the runs in your workspace. 


In [26]:
from azureml.core import Experiment

experiment_name = 'Communication_Lu_DSAT_fix' 
experiment = Experiment(ws, name=experiment_name)

# Submit your Job
The follow section creates one pytorch estimator, you can easily specify your parameters.

When you submit your job, it will autoamtically upload your local repo to the cloud cluster. 

You can also submit tensorflow or keras job

## Supervised Finetuning

With pre-trained model and labeled data

In [319]:
##BATCH AI
from azureml.train.dnn import PyTorch



script_params = {
   
    '--data_dir': ds.path(f'datasets/Communication_prod_data').as_mount(),     
    #'--valid_dir':ds.path(f'datasets/Communication_prod_data/Mustpass_full.txt').as_mount(),
    #'--train_dir':ds.path(f'datasets/Communication_prod_data/Mustpass_full.txt').as_mount(),
    '--train_dir':'sample_data\Mustpass.txt',
    '--valid_dir':'sample_data\Mustpass.txt',
    
    '--test_generated_dir':ds.path(f'datasets/Teams_communication/generated_data/communication_message_generated_no_contact.txt').as_mount(),
    '--test_generated_no_contact_dir':ds.path(f'datasets/Teams_communication/generated_data/communication_message_generated_contact.txt').as_mount(),
    '--target_set_dir':ds.path(f'datasets/Communication_prod_data/Mustpass_full.txt').as_mount(),
    '--target_set_dir':ds.path(f'datasets/Communication_prod_data/Mustpass_full.txt').as_mount(),
    #'--target_set_dir':ds.path(f'datasets/Teams_communication/Target_set_message_new_conll.txt').as_mount(),
    
    
    ##for uncased longer teacher
    '--student_model_dir': ds.path(f'Communication_student_model/MV4_model_cleanup').as_mount(),
    '--teacher_model_path':ds.path(f'bert_data/uncased_model/outputs_base_uncased_no_basic_tokenizer').as_mount(),#from prod labeled sources
    
    '--bert_model':'bert-base-uncased', 
    '--do_lower_case':'',
    
    
    
    '--task_name':'ner',
    '--output_dir':'./outputs',
    
    
    
    '--do_continual_training':'',
    '--do_eval':'',
    '--encoder_type':'GRU',
    '--hidden_units':'300',
    '--eval_batch_size':'1',

    '--learning_rate':'1e-5',#smaller learning rate given init
    '--num_train_epochs':'50', ##get larger number when you have smaller unsupervised data
    '--warmup_proportion':'1',
    #'--max_seq_length':'32',
    '--max_seq_length':'128',
    '--train_batch_size':'5',
    "--temperature": '1',
    "--alpha": '1',#for ablation study,weight for labeled data
    "--beta": '1', #for ablation study,weight for unlabeled data
    '--unsupervised_train_corpus':ds.path(f'datasets/Teams_communication/raw_generated_data/rawquery_merged.txt').as_mount()
    

}

estimator10 = PyTorch(source_directory='..', 
                    script_params=script_params,
                    compute_target=compute_target, 
                    entry_script='src/Continual_training.py',
                    #pip_packages=['pandas','pytorch-pretrained-bert==0.4.0','seqeval==0.0.5'],
                    #pip_packages=['pandas','pytorch-pretrained-bert==0.6.1','seqeval==0.0.5','nltk'],
                    pip_packages=['pandas','pytorch-pretrained-bert==0.6.1','seqeval==0.0.5','transformers==2.1.1','nltk'],
                    use_gpu=True)



# Set up your running environment

You can create your own virtual environment. It takes a while the first time you submit the job. If you do not change dependency, the job submission will be fast.


In [320]:
print(estimator10.run_config.environment.docker.base_image)

mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04


In [321]:
print(estimator10.conda_dependencies.serialize_to_string())

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - pandas
  - pytorch-pretrained-bert==0.6.1
  - seqeval==0.0.5
  - transformers==2.1.1
  - nltk
  - azureml-defaults
  - torch==1.3.1
  - torchvision==0.4.1
  - horovod==0.18.1
  - tensorboard==1.14.0
  - future==0.17.1
channels:
- conda-forge



# Submit and Monitor your run

You can also find your previous runs if your open the azure portal

In [322]:
run = experiment.submit(estimator10)

..\src\bert_inference.py:321:14: F821 undefined name 'argparse'
