Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

GenSen on AML deep dive notebook (sentence similarity) #78

Merged
merged 57 commits into from Jun 17, 2019

Conversation

catherine667
Copy link
Collaborator

@catherine667 catherine667 commented May 31, 2019

  1. This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity building one of the advanced models - GenSen on AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

The notebook includes: Data loading and preprocessing, Train GenSen model with distributed PyTorch with Horovod on AzureML and Tuning on HypterDrive. Evaluation and deployment will be added later. In addition, the comparison results with training and tuning on AML v.s. VM will be added once this initial PR is merged with staging.

  1. Provide a refactored GenSen code into utils_nlp to make the model reusable.

We provide a distributed PyTorch with Horovod implementation of the paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks.
This code is based on the gibhub codebase from Maluuba, but we have refactored the code in the following aspects:

  1. Support a distributed PyTorch with Horovod
  2. Clean and refactor the original code in a more structured form
  3. Change the training file (train.py) from non-stopping to stop when the validation loss reaches to the local minimum
  4. Update the code from Python 2.7 to 3+ and PyTorch from 0.2 or 0.3 to 1.0.1
  5. Add some necessary comments
  6. Add some code for training on AzureML platform
  7. Fix the bug on when setting the batch size to 1, the training raises an error

@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/nlp/pull/78

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

@miguelgfierro
Copy link
Member

miguelgfierro commented Jun 3, 2019

This is really good. Several questions:

  1. Have you talked with anyone in Maluuba about this? it would be great if they could give us feedback

  2. We are not uploading images to the repo, see discussion here. I uploaded your image here https://nlpbp.blob.core.windows.net/images/seq2seq.png

  3. Does your code allows for training without horovod? just on GPU or multi-GPU?

At a bare minimum, I would split the notebook into two, one for training and one for hyperparameter tuning. In reco we are doing this.

Copy link
Contributor

@heatherbshapiro heatherbshapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explicitly call out why users should use this and how azureml helps. Does hyperdrive improve the accuracy? we spin up the gpu compute for you and it would be very difficult to run without...etc

@catherine667 catherine667 reopened this Jun 3, 2019
@catherine667
Copy link
Collaborator Author

catherine667 commented Jun 3, 2019

This is really good. Several questions:

Have you talked with anyone in Maluuba about this? it would be great if they could give us feedback

We are not uploading images to the repo, see discussion here. I uploaded your image here https://nlpbp.blob.core.windows.net/images/seq2seq.png

Does your code allows for training without horovod? just on GPU or multi-GPU?

At a bare minimum, I would split the notebook into two, one for training and one for hyperparameter tuning. In reco we are doing this.

@miguelgfierro The followings are the replies:

  1. Our PM @irshaffe has tried to contact Maluuba authors in that repo, but it seems none of them is at Microsoft right now. From the repo, seems like they are not activate this year. Still have unmerged PR from last year.
  2. I have used the url you have provided and remove the img folder.
  3. The code only needs minor changes for local training with one or multiple GPUs. @AbhiramE is working on this and he will raise a separate PR on local training without horovod.
  4. We also plan to add model evaluation to the notebook, would it be good to keep all the sessions in one notebook so that I do not have to replicate the settings for AML? @saidbleik
    If we split the notebook with preprocessing + training, tuning and evaluation. What would be the structure you recommend? (With sub-folder like similarity/gensen/...)

Thanks!

@saidbleik
Copy link
Collaborator

Great work!

  • I like that the flow is all in one notebook (although this one is a little long). Ideally, there should be one notebook for showing the base example (without Horovod, AML, tuning, etc...) and one that extends the base example and shows how to do the same using AML.
    • gensen_deep_dive.ipynb
    • gensen_deep_dive_aml.ipynb
  • Things like train.py should be outside the utils_nlp folder.

@saidbleik
Copy link
Collaborator

@catherine667
Copy link
Collaborator Author

catherine667 commented Jun 3, 2019

Great work!

I like that the flow is all in one notebook (although this one is a little long). Ideally, there should be one notebook for showing the base example (without Horovod, AML, tuning, etc...) and one that extends the base example and shows how to do the same using AML.

gensen_deep_dive.ipynb
gensen_deep_dive_aml.ipynb

Things like train.py should be outside the utils_nlp folder.

@saidbleik The structure you mentioned:
gensen_deep_dive.ipynb
gensen_deep_dive_aml.ipynb

That's exactly what we are planning to do. @AbhiramE is working on gensen_deep_dive.ipynb which is training on VM without AML. He will raise a separate PR once this PR is merged. We are currently doing the experiments on the performance for AML v.s. VM. Once the evaluations are done, we will put the results in README file.

I think maybe it's better to put all the sessions (preprocessing, training, tuning, evaluation) in one gensen_deep_dive_aml.ipynb notebook because our purpose is to show the whole end-to-end pipeline and this way can also avoid code replications on setting AML configurations.

For train.py, do you recommend to let me put it in the same folder as the notebook? @saidbleik
Please let me know if you have other thoughts. Thanks!

@catherine667
Copy link
Collaborator Author

catherine667 commented Jun 3, 2019

For multi-gpu support, see here: https://github.com/microsoft/nlp/blob/bleik/utils_nlp/pytorch/device_utils.py

@saidbleik
The original code supports the multi gpus by using:
model = torch.nn.DataParallel(model, device_ids=range(n_gpus))

However, we do not have multi gpus permission to train the model for now. We can only use Standard_NC6 with single gpu.

@saidbleik
Copy link
Collaborator

I think maybe it's better to put all the sessions (preprocessing, training, tuning, evaluation) in one gensen_deep_dive_aml.ipynb notebook because our purpose is to show the whole end-to-end pipeline and this way can also avoid code replications on setting AML configurations. Please let me know if you have other thoughts. Thanks!

This is fine, but it can be less verbose. For example, you don't need to describe Gensen in the AML version. You can link to the base notebook instead (or perhaps have this common description in the readme).

@catherine667
Copy link
Collaborator Author

It would be good to explicitly call out why users should use this and how azureml helps. Does hyperdrive improve the accuracy? we spin up the gpu compute for you and it would be very difficult to run without...etc

@heatherbshapiro Added in section 2.3.5 for explaining the Horovod.

@catherine667
Copy link
Collaborator Author

catherine667 commented Jun 4, 2019

Several changes have been made:

  1. Add quotes and citations to the GenSen explanation part; correct all the typos
  2. Fixed the uploading bug and add code for using default datastore
  3. Add explanations on the results to the training and tuning
  4. Fixed the bug on training not converging

Copy link
Contributor

@heatherbshapiro heatherbshapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for making the updates!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants