Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License

# Distributed Training For Extractive Summarization on CNN/DM Dataset

## Summary
This notebook demonstrates how to use Azure Machine Learning to run distributed training using Distributed Data Parallel in Pytorch. For more detailed model related information, please see [extractive_summarization_cnndm_transformer.ipynb](extractive_summarization_cnndm_transformer.ipynb)

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, refer to the [Configuration Notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) first if you haven't already to establish your connection to the AzureML Workspace. Prerequisites are:

- Azure subscription
- Azure Machine Learning Workspace
- Azure Machine Learning SDK

To run rouge evaluation, please refer to the section of compute_rouge_perl in [summarization_evaluation.ipynb](summarization_evaluation.ipynb). 

## Create AML Workspace 

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.PipelineRun = azureml.pipeline.core:PipelineRun._from_dto with exception (azureml-core 1.0.83 (/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages), Requirement.parse('azureml-core==1.0.57.*')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.ReusedStepRun = azureml.pipeline.core:StepRun._from_reused_dto with exception (azureml-core 1.0.83 (/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages), Requirement.parse('azureml-core==1.0.57.*')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.StepRun = azureml.pipeline.core:StepRun._from_dto with exception (azureml-core 1.0.83 (/dadendev/anaconda3/envs/cm3/lib/python3.6/site-packages), Requirement.parse('azureml-core==1.0.57.*')).


SDK version: 1.0.83


In [2]:
import os
import sys

nlp_path = os.path.abspath('../../')
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)
    
from utils_nlp.azureml import azureml_utils

from azureml.core import Experiment, Workspace, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.dnn import PyTorch
from azureml.widgets import RunDetails
from azureml.train.dnn import Nccl

In [3]:
## Replace the following constants with your own 
WORKSPACE_NAME = "daden1amlws"
SUBSRIPTION_ID = "9086b59a-02d7-4687-b3fd-e39fa5e0fd9b" 
RESOURCE_GROUP = "daden1aml"
LOCATION = "eastus2"

In [4]:
# Create the workspace using the specified parameters
ws = Workspace.create(name = WORKSPACE_NAME,
                      subscription_id = SUBSRIPTION_ID,
                      resource_group = RESOURCE_GROUP, 
                      location = LOCATION,
                      create_resource_group = False,
                      exist_ok = True)
ws.get_details()

# write the details of the workspace to a configuration file to the notebook library
ws.write_config()

In [5]:
# Retrieve the workspace
ws = Workspace.setup()

# Print the workspace attributes
print('Workspace name: ' + ws.name, 
      'Workspace region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

setup() is now deprecated. Instead, please use create() to create a new workspace, or get()/from_config() to retrieve an existing one


Workspace name: daden1amlws
Workspace region: eastus2
Subscription id: 9086b59a-02d7-4687-b3fd-e39fa5e0fd9b
Resource group: daden1aml


## Create Compute Cluster

In [6]:
## Replace the following constants with your own 
AMLCOMPUTE_CLUSTER_NAME = "extsum5"
NODE_COUNT = 4
VM_SIZE = 'STANDARD_NC6'

In [7]:
try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=AMLCOMPUTE_CLUSTER_NAME)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size=VM_SIZE,
                                                           max_nodes=NODE_COUNT,
                                                           NodeIdleTimeBeforeScaleDown='PT1200S')

    # create the cluster
    gpu_compute_target = ComputeTarget.create(ws, AMLCOMPUTE_CLUSTER_NAME, compute_config)

    gpu_compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute. 
print(gpu_compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 4, 'targetNodeCount': 4, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 4, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-01-30T05:01:54.385000+00:00', 'errors': None, 'creationTime': '2020-01-23T04:50:26.160743+00:00', 'modifiedTime': '2020-01-23T20:31:35.349184+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT1200S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


## Create Experiment

In [8]:
## Replace the following constants with your own 
EXPERIMENT_NAME = 'NLP-ExtSum'

In [9]:
experiment = Experiment(ws, name=EXPERIMENT_NAME)


## Download dataset to local file system

In [40]:
## local folder to save the downloaded data
LOCAL_DATA_FOLDER = '/dadendev/bertsumdata/'

In [42]:
!mkdir -p {LOCAL_DATA_FOLDER}

In [43]:
from utils_nlp.dataset.cnndm import CNNDMBertSumProcessedData, CNNDMSummarizationDataset
CNNDMBertSumProcessedData.download(local_path=LOCAL_DATA_FOLDER)

bertsum_data.zip: 869MB [00:25, 33.5MB/s] 


'/dadendev/bertsumdata/'

## Upload the downloaded dataset to AML workspace

In [44]:
## folder in the workspace where the data is uploaded to
TARGET_DATA_FOLDER = '/bertsumdata'

In [45]:
ds = ws.get_default_datastore()

In [46]:
ds.upload(src_dir=LOCAL_DATA_FOLDER, target_path=TARGET_DATA_FOLDER)

Uploading an estimated of 158 files
Uploading /dadendev/bertsumdata/cnndm.train.22.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.78.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.88.bert.pt
Uploading /dadendev/bertsumdata/cnndm.test.1.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.51.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.5.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.120.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.114.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.140.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.87.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.104.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.94.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.59.bert.pt
Uploading /dadendev/bertsumdata/bertsum_data.zip
Uploading /dadendev/bertsumdata/cnndm.train.30.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.44.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.73.bert.pt
Uploading /dadendev/bertsumdata/cn

Uploaded /dadendev/bertsumdata/cnndm.train.10.bert.pt, 45 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.130.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.101.bert.pt, 46 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.116.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.117.bert.pt, 47 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.13.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.118.bert.pt, 48 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.15.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.20.bert.pt, 49 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.19.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.25.bert.pt, 50 files out of an estimated total of 158
Uploaded /dadendev/bertsumdata/cnndm.train.141.bert.pt, 51 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.tr

Uploaded /dadendev/bertsumdata/cnndm.train.46.bert.pt, 99 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.test.5.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.127.bert.pt, 100 files out of an estimated total of 158
Uploaded /dadendev/bertsumdata/cnndm.train.134.bert.pt, 101 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.32.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.122.bert.pt, 102 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.48.bert.pt
Uploading /dadendev/bertsumdata/cnndm.train.62.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.12.bert.pt, 103 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.69.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.63.bert.pt, 104 files out of an estimated total of 158
Uploading /dadendev/bertsumdata/cnndm.train.4.bert.pt
Uploaded /dadendev/bertsumdata/cnndm.train.90.bert.pt, 105 files out of an estimate

$AZUREML_DATAREFERENCE_91f544a88b9a404d90ccc25479e1d77a

## Prepare the local project folder which is mirror to the workspace for the experiment

In [55]:
## local folder to store all the related files to be copied to the workspace
PROJECT_FOLDER = './azureml_exp'
## conda environment name, the yaml file will be copied to the workspace
CONDA_ENV_NAME = "nlp_gpu"

In [56]:
ENTRY_SCRIPT = "extractive_summarization_cnndm_distributed_train.py"

In [92]:
!mkdir -p {PROJECT_FOLDER}
!python ../../tools/generate_conda_file.py --gpu --name {CONDA_ENV_NAME}
!cp ./nlp_gpu.yaml {PROJECT_FOLDER}
!cp {ENTRY_SCRIPT} {PROJECT_FOLDER}
!cp -r ../../utils_nlp {PROJECT_FOLDER}


Generated conda file: nlp_gpu.yaml

To create the conda environment:
$ conda env create -f nlp_gpu.yaml

To update the conda environment:
$ conda env update -f nlp_gpu.yaml

To register the conda environment in Jupyter:
$ conda activate nlp_gpu
$ python -m ipykernel install --user --name nlp_gpu --display-name "Python (nlp_gpu)"



## Submit Run

In [93]:
#AZUREML_CONFIG_PATH = "./.azureml"
## output dir in the workspace
MODEL_NAME = "distilbert-base-uncased"
ENCODER = "transformer"
TARGET_OUTPUT_DIR = f'output/{EXPERIMENT_NAME}/'
## cache dir in the workspace
TARGET_CACHE_DIR = f'cache/{EXPERIMENT_NAME}/'
## file name for saving the prediction
SUMMARY_FILENAME = "generated_summaries.txt"
MODEL_FILENAME = "dist_extsum.pt"

## local path to download the output from the cluster
LOCAL_OUTPUT_DIR = './output'

In [59]:
os.makedirs(LOCAL_OUTPUT_DIR, exist_ok=True)
os.makedirs(os.path.join(LOCAL_OUTPUT_DIR, EXPERIMENT_NAME), exist_ok=True)

In [94]:
NcclConfig=Nccl()
estimator = PyTorch(source_directory=PROJECT_FOLDER,
                    compute_target=gpu_compute_target,
                    script_params={
                        "--dist_url": "$AZ_BATCHAI_PYTORCH_INIT_METHOD",
                        "--rank": "$AZ_BATCHAI_TASK_INDEX",
                        "--node_count": NODE_COUNT,
                        "--data_dir":ds.path(f'{TARGET_DATA_FOLDER}').as_mount(),
                        "--cache_dir": ds.path(f'{TARGET_CACHE_DIR}').as_mount(),
                        '--output_dir':ds.path(f'{TARGET_OUTPUT_DIR}').as_mount(),
                        "--quick_run": 'true',
                        "--use_preprocessed_data": 'true',
                        "--summary_filename": f'{SUMMARY_FILENAME}',
                        "--model_filename": f'{MODEL_FILENAME}',
                        "--model_name": MODEL_NAME,
                        "--encoder": ENCODER
                    },
                    entry_script= ENTRY_SCRIPT,
                    node_count=NODE_COUNT,
                    distributed_training=NcclConfig,
                    conda_dependencies_file=f'{CONDA_ENV_NAME}.yaml',
                    use_gpu=True)



In [95]:
run = experiment.submit(estimator)

In [96]:

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Download Generated Summaries 

In [None]:
"""
If you stop the notebook and come back, 
you'll need to use the run_id in the output of the previous cell 
to get run details.
""
# fetched_run = Run(experiment, "NLP-ExtSum_1579816237_ea238f69")
# RunDetails(fetched_run).show()

In [None]:
# need to clear the local output dir as the ds.download worn't download is the path exists
!rm -rf {LOCAL_OUTPUT_DIR}/* 

In [None]:
ds.download(target_path=LOCAL_OUTPUT_DIR,
                   prefix=f'{TARGET_OUTPUT_DIR}{SUMMARY_FILE}',
                   show_progress=True)

## Evaluation

In [76]:
from utils_nlp.eval.evaluate_summarization import get_rouge
from utils_nlp.models.transformers.extractive_summarization import ExtSumProcessedData

In [77]:

train_dataset, test_dataset = ExtSumProcessedData().splits(root=LOCAL_DATA_FOLDER)

In [85]:
target = [i['tgt_txt'] for i in test_dataset]

In [None]:
prediction = []
with open(os.path.join(LOCAL_OUTPUT_DIR, f'{TARGET_OUTPUT_DIR}{SUMMARY_FILENAME}'), "r") as filehandle:
    for cnt, line in enumerate(filehandle):
        prediction.append(line[0:-1]) # remove the ending "\n"

In [97]:
## in case quick run and not use preprocessed data, download the saved model and run prediction
import pickle
from utils_nlp.models.transformers.extractive_summarization import ExtractiveSummarizer
if len(prediction) !=len(target):
    ds.download(target_path=LOCAL_OUTPUT_DIR,
                   prefix=f'{TARGET_OUTPUT_DIR}{MODEL_FILENAME}',
                   show_progress=True)
    model = pickle.load(open(os.path.join(LOCAL_OUTPUT_DIR, f'{TARGET_OUTPUT_DIR}{MODEL_FILENAME}'), "rb"))
    summarizer = ExtractiveSummarizer("distilbert-base-uncased", "transformer", "./output")
    summarizer.model = model
    prediction = summarizer.predict(test_dataset, num_gpus=4, batch_size=128)


In [63]:
ds.download(target_path=LOCAL_OUTPUT_DIR,
            prefix="output/NLP-ExtSum/dist_extsum_model.pt",
            show_progress=True)

Downloading output/NLP-ExtSum/dist_extsum_model.pt
Downloaded output/NLP-ExtSum/dist_extsum_model.pt, 1 files out of an estimated total of 1


1

Scoring: 100%|██████████| 90/90 [00:41<00:00,  3.71it/s]


In [89]:
test_dataset[0]['src_txt']

['turkey has blocked access to twitter and youtube after they refused a request to remove pictures of a prosecutor held during an armed siege last week .',
 "a turkish court imposed the blocks because images of the deadly siege were being shared on social media and ` deeply upset ' the wife and children of mehmet selim kiraz , the hostage who was killed .",
 "the 46-year-old turkish prosecutor died in hospital when members of the revolutionary people 's liberation party-front ( dhkp-c ) stormed a courthouse and took him hostage .",
 'the dhkp-c is considered a terrorist group by turkey , the european union and us .',
 'a turkish court has blocked access to twitter and youtube after they refused a request to remove pictures of prosecutor mehmet selim kiraz held during an armed siege last week',
 'grief : the family of mehmet selim kiraz grieve over his coffin during his funeral at eyup sultan mosque in istanbul , turkey .',
 'he died in hospital after he was taken hostage by the far-lef

In [83]:
prediction[0]

"after spending 269 days in a coma , elvan eventually died on march 11 last year .<q>the 15-year-old was severely wounded after being hit on the head by a tear-gas canister fired by a police officer during anti-government protests in istanbul in june 2013 .<q>his death , and the subsequent investigation , have since become a rallying point for the country 's far-left ."

In [86]:
target[0]

"turkish court imposed blocks as images of siege shared on social media<q>images ` deeply upset ' wife and children of hostage mehmet selim kiraz<q>prosecutor , 46 , died in hospital after hostages stormed a courthouse<q>two of his captors were killed when security forces took back the building"

In [87]:
RESULT_DIR = './testrouge'

In [88]:
rouge_score = get_rouge(prediction, target, RESULT_DIR)

11489
11489


2020-01-30 06:42:07,875 [MainThread  ] [INFO ]  Writing summaries.
INFO:global:Writing summaries.
2020-01-30 06:42:07,877 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./testrouge/tmpml_l__bp/system and model files to ./testrouge/tmpml_l__bp/model.
INFO:global:Processing summaries. Saving system files to ./testrouge/tmpml_l__bp/system and model files to ./testrouge/tmpml_l__bp/model.
2020-01-30 06:42:07,878 [MainThread  ] [INFO ]  Processing files in ./testrouge/rouge-tmp-2020-01-30-06-42-06/candidate/.
INFO:global:Processing files in ./testrouge/rouge-tmp-2020-01-30-06-42-06/candidate/.
2020-01-30 06:42:09,028 [MainThread  ] [INFO ]  Saved processed files to ./testrouge/tmpml_l__bp/system.
INFO:global:Saved processed files to ./testrouge/tmpml_l__bp/system.
2020-01-30 06:42:09,030 [MainThread  ] [INFO ]  Processing files in ./testrouge/rouge-tmp-2020-01-30-06-42-06/reference/.
INFO:global:Processing files in ./testrouge/rouge-tmp-2020-01-30-06-42-06/reference/.


---------------------------------------------
1 ROUGE-1 Average_R: 0.32121 (95%-conf.int. 0.31865 - 0.32394)
1 ROUGE-1 Average_P: 0.26093 (95%-conf.int. 0.25874 - 0.26303)
1 ROUGE-1 Average_F: 0.27434 (95%-conf.int. 0.27243 - 0.27641)
---------------------------------------------
1 ROUGE-2 Average_R: 0.08593 (95%-conf.int. 0.08408 - 0.08778)
1 ROUGE-2 Average_P: 0.06877 (95%-conf.int. 0.06732 - 0.07025)
1 ROUGE-2 Average_F: 0.07264 (95%-conf.int. 0.07112 - 0.07414)
---------------------------------------------
1 ROUGE-L Average_R: 0.28730 (95%-conf.int. 0.28497 - 0.28973)
1 ROUGE-L Average_P: 0.23408 (95%-conf.int. 0.23206 - 0.23607)
1 ROUGE-L Average_F: 0.24574 (95%-conf.int. 0.24392 - 0.24768)

