Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Hlu/bert ner utils #36

Merged
merged 36 commits into from Jun 8, 2019
Merged

Hlu/bert ner utils #36

merged 36 commits into from Jun 8, 2019

Conversation

hlums
Copy link
Collaborator

@hlums hlums commented May 3, 2019

5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99


5/29 updates: @saidbleik @miguelgfierro
I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.)
Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset.
Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.


5/20 udpates: @saidbleik @miguelgfierro
I made another update based on our discussion last week.

I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios
a. Single sentence data with label: (sentence_text, label)
b. Single sentence data without label: (sentence_text,)
c. Two sentence data with label: (sentence_1_text, sentence_2_text, label)
d. Two sentence data without label: (sentence_1_text, sentence_2_text)
As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks.
I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

I'm flexible with using or not using the configuration class.

Let's seek more evidence to finalize these decisions as Miguel suggested.

5/16 updates: @saidbleik @miguelgfierro
I made another pass through the code. Three major changes:

  1. Consolidated some utility functions into the BertTokenClassifier class.
  2. Removed some unnecessary configurations.
  3. Added docstring.

In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

  • Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
    I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
  • Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.
  • Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro

@review-notebook-app
Copy link

Check out this pull request on ReviewNB: https://app.reviewnb.com/Microsoft/NLP/pull/36

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

Copy link
Collaborator

@saidbleik saidbleik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Thanks.
I still think the example feels bloated though. It might be helpful if someone goes through it from a user's perspective and tries it out. I'll add a task for that.

utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved
utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved
utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved
@nikhilrj nikhilrj added this to In progress in NLP MVP May 7, 2019
@nikhilrj nikhilrj removed this from In progress in NLP MVP May 7, 2019
@nikhilrj nikhilrj added this to In progress in PR Review May 7, 2019
Copy link
Member

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job, took one pass and have some questions

utils_nlp/bert/bert_data_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_data_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_data_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_data_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/configs.py Outdated Show resolved Hide resolved
@hlums
Copy link
Collaborator Author

hlums commented May 7, 2019

good job, took one pass and have some questions

Thank you @miguelgfierro! Great insights! I will work on addressing your comments and let you know when it's ready for another pass of review.
BTW, we are also discussing if we should write our own classes instead of utility functions, so it may take some time to finalize this if we decided going with classes.

utils_nlp/bert/bert_data_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@saidbleik saidbleik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are part of an old review that I added without submitting. Ignore if outdated. I'll review the new updates.

utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved
@miguelgfierro
Copy link
Member

hey @hlums I think we should avoid having binary objects like images in the repo. The reason is because git is not designed for binary files, but for code. A binary file can't be versioned and at the same time can make the repo very big if there are many binaries. In reco we have a blob where we host all images and then we link them in the notebooks.

I uploaded the bert image to a blob I created for us: https://nlpbp.blob.core.windows.net/images/bert_architecture.png. You can access it in case you want to upload other images.

To link them in the notebooks, it is just like in a markdown or with the html img tag, see one example here: https://github.com/microsoft/recommenders/blob/c16ed91b21cd2c3eea228becaeac7013029f9758/notebooks/00_quick_start/wide_deep_movielens.ipynb#L914

@hlums
Copy link
Collaborator Author

hlums commented May 30, 2019

hey @hlums I think we should avoid having binary objects like images in the repo. The reason is because git is not designed for binary files, but for code. A binary file can't be versioned and at the same time can make the repo very big if there are many binaries. In reco we have a blob where we host all images and then we link them in the notebooks.

I uploaded the bert image to a blob I created for us: https://nlpbp.blob.core.windows.net/images/bert_architecture.png. You can access it in case you want to upload other images.

To link them in the notebooks, it is just like in a markdown or with the html img tag, see one example here: https://github.com/microsoft/recommenders/blob/c16ed91b21cd2c3eea228becaeac7013029f9758/notebooks/00_quick_start/wide_deep_movielens.ipynb#L914

@miguelgfierro That makes sense! Thank you!

@miguelgfierro
Copy link
Member

hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this?

Please let me know when you want me to take a look at your PR

@hlums
Copy link
Collaborator Author

hlums commented Jun 5, 2019

hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this?

Please let me know when you want me to take a look at your PR

@miguelgfierro Do you mean the code in common_ner.py? I needed to repeat some code in Said’s PR to get my code running. The plan is once Said completed his PR, I will update my branch from staging. You can start review my PR now and skip the part overlapping with Said’s PR. If Said is planning to complete his PR soon, you can also wait for I complete the merging from staging if you prefer.

@saidbleik
Copy link
Collaborator

hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this?
Please let me know when you want me to take a look at your PR

@miguelgfierro Do you mean the code in common_ner.py? I needed to repeat some code in Said’s PR to get my code running. The plan is once Said completed his PR, I will update my branch from staging. You can start review my PR now and skip the part overlapping with Said’s PR. If Said is planning to complete his PR soon, you can also wait for I complete the merging from staging if you prefer.

The notebook and the utils are ready but I'm trying to figure out which dataset to use (the latest one is using MultiNLI). Let me push my changes to my branch for now.

@hlums
Copy link
Collaborator Author

hlums commented Jun 7, 2019

@miguelgfierro @saidbleik I've updated my branch from staging and it's ready to be reviewed.

PR Review automation moved this from In progress to In Review Jun 7, 2019
@saidbleik saidbleik merged commit 01b038b into staging Jun 8, 2019
PR Review automation moved this from In Review to Done Jun 8, 2019
@miguelgfierro miguelgfierro deleted the hlu/BERT_NER_utils branch June 10, 2019 09:08
saidbleik added a commit that referenced this pull request Jun 10, 2019
Staging to master after PR #36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
PR Review
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants