Hlu/bert ner utils #36

hlums · 2019-05-03T13:13:38Z

5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99

5/29 updates: @saidbleik @miguelgfierro
I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.)
Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset.
Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.

5/20 udpates: @saidbleik @miguelgfierro
I made another update based on our discussion last week.

I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios
a. Single sentence data with label: (sentence_text, label)
b. Single sentence data without label: (sentence_text,)
c. Two sentence data with label: (sentence_1_text, sentence_2_text, label)
d. Two sentence data without label: (sentence_1_text, sentence_2_text)
As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks.
I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

I'm flexible with using or not using the configuration class.

Let's seek more evidence to finalize these decisions as Miguel suggested.

5/16 updates: @saidbleik @miguelgfierro
I made another pass through the code. Three major changes:

Consolidated some utility functions into the BertTokenClassifier class.
Removed some unnecessary configurations.
Added docstring.

In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.
Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro

review-notebook-app · 2019-05-03T13:13:39Z

Check out this pull request on ReviewNB: https://app.reviewnb.com/Microsoft/NLP/pull/36

Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows.

saidbleik

Nice work. Thanks.
I still think the example feels bloated though. It might be helpful if someone goes through it from a user's perspective and tries it out. I'll add a task for that.

utils_nlp/bert/NER_bert-demo-new.ipynb

miguelgfierro

good job, took one pass and have some questions

utils_nlp/bert/bert_data_utils.py

utils_nlp/bert/bert_utils.py

utils_nlp/bert/configs.py

utils_nlp/bert/NER_bert-demo-new.ipynb

hlums · 2019-05-07T19:37:25Z

good job, took one pass and have some questions

Thank you @miguelgfierro! Great insights! I will work on addressing your comments and let you know when it's ready for another pass of review.
BTW, we are also discussing if we should write our own classes instead of utility functions, so it may take some time to finalize this if we decided going with classes.

utils_nlp/bert/bert_data_utils.py

utils_nlp/bert/bert_utils.py

… data.

saidbleik

These are part of an old review that I added without submitting. Ignore if outdated. I'll review the new updates.

utils_nlp/bert/bert_utils.py

examples/named_entity_recognition/NER_bert-demo-new-updated.ipynb

miguelgfierro · 2019-05-30T15:45:12Z

hey @hlums I think we should avoid having binary objects like images in the repo. The reason is because git is not designed for binary files, but for code. A binary file can't be versioned and at the same time can make the repo very big if there are many binaries. In reco we have a blob where we host all images and then we link them in the notebooks.

I uploaded the bert image to a blob I created for us: https://nlpbp.blob.core.windows.net/images/bert_architecture.png. You can access it in case you want to upload other images.

To link them in the notebooks, it is just like in a markdown or with the html img tag, see one example here: https://github.com/microsoft/recommenders/blob/c16ed91b21cd2c3eea228becaeac7013029f9758/notebooks/00_quick_start/wide_deep_movielens.ipynb#L914

hlums · 2019-05-30T22:49:09Z

hey @hlums I think we should avoid having binary objects like images in the repo. The reason is because git is not designed for binary files, but for code. A binary file can't be versioned and at the same time can make the repo very big if there are many binaries. In reco we have a blob where we host all images and then we link them in the notebooks.

I uploaded the bert image to a blob I created for us: https://nlpbp.blob.core.windows.net/images/bert_architecture.png. You can access it in case you want to upload other images.

To link them in the notebooks, it is just like in a markdown or with the html img tag, see one example here: https://github.com/microsoft/recommenders/blob/c16ed91b21cd2c3eea228becaeac7013029f9758/notebooks/00_quick_start/wide_deep_movielens.ipynb#L914

@miguelgfierro That makes sense! Thank you!

miguelgfierro · 2019-06-03T12:06:04Z

hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this?

Please let me know when you want me to take a look at your PR

hlums · 2019-06-05T20:56:13Z

hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this?

Please let me know when you want me to take a look at your PR

@miguelgfierro Do you mean the code in common_ner.py? I needed to repeat some code in Said’s PR to get my code running. The plan is once Said completed his PR, I will update my branch from staging. You can start review my PR now and skip the part overlapping with Said’s PR. If Said is planning to complete his PR soon, you can also wait for I complete the merging from staging if you prefer.

saidbleik · 2019-06-06T01:34:15Z

hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this?
Please let me know when you want me to take a look at your PR

@miguelgfierro Do you mean the code in common_ner.py? I needed to repeat some code in Said’s PR to get my code running. The plan is once Said completed his PR, I will update my branch from staging. You can start review my PR now and skip the part overlapping with Said’s PR. If Said is planning to complete his PR soon, you can also wait for I complete the merging from staging if you prefer.

The notebook and the utils are ready but I'm trying to figure out which dataset to use (the latest one is using MultiNLI). Let me push my changes to my branch for now.

hlums · 2019-06-07T14:45:57Z

@miguelgfierro @saidbleik I've updated my branch from staging and it's ready to be reviewed.

utils_nlp/dataset/wikigold.py

scenarios/named_entity_recognition/Named_Entity_Recognition_Using_BERT.ipynb

Staging to master after PR #36

hlums added 2 commits May 2, 2019 10:50

Initial check in of bert utility functions.

d5ee6d4

Fixed a few minor issues found during testing.

b15b0a4

hlums requested review from miguelgfierro, saidbleik and yexing99 May 3, 2019 13:14

saidbleik reviewed May 3, 2019

View reviewed changes

utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved

utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved

utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved

nikhilrj added this to In progress in NLP MVP May 7, 2019

nikhilrj removed this from In progress in NLP MVP May 7, 2019

nikhilrj added this to In progress in PR Review May 7, 2019

miguelgfierro reviewed May 7, 2019

View reviewed changes

utils_nlp/bert/NER_bert-demo-new.ipynb Outdated Show resolved Hide resolved

hlums added 3 commits May 7, 2019 10:01

Updates to expose BERT objects to the user.

bd4e805

Added utils path to system path.

6e5b060

Moved notebooks to example folder.

2af4d4a

miguelgfierro reviewed May 13, 2019

View reviewed changes

utils_nlp/bert/bert_data_utils.py Outdated Show resolved Hide resolved

utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved

utils_nlp/bert/bert_utils.py Outdated Show resolved Hide resolved

hlums added 11 commits May 16, 2019 18:08

Added yaml configuration file for NER example.

14543fb

Updated NER notebook with new BertTokenClassifier class.

2732da2

Minor edits and added docstring.

d87dfbc

Consolidated some utility functions into BertTokenClassifier.

7ca2969

Consolidated all configuration classes into a single class.

eef85de

Changed optimizer and number of epochs configuration.

e81138a

Remived InputFeature class. Use namedtuple instead of class for input…

9919a7b

… data.

Minor updates for data class updates.

1393c74

Added a constant file.

4a41ec4

Removed old data utils script.

3d1c186

Black auto formatting.

2473e1a

saidbleik reviewed May 21, 2019

View reviewed changes

examples/named_entity_recognition/NER_bert-demo-new-updated.ipynb Outdated Show resolved Hide resolved

Added bert image to notebook.

05d40f2

hlums added 3 commits May 31, 2019 13:46

Removed BERT image.

320b08d

Updated NER notebook with wikigold data.

9bcad55

Updated notebook with results.

9aa04d8

miguelgfierro mentioned this pull request Jun 3, 2019

GenSen on AML deep dive notebook (sentence similarity) #78

Merged

hlums added 7 commits June 6, 2019 16:45

Merge remote-tracking branch 'origin/staging' into hlu/BERT_NER_utils

fbf15e6

Added utility functions for token classification.

2593620

Minor notebook update.

ab54b2a

Removed common_ner.py

a8feb91

Moved NER notebook to scenarios folder

881c4fd

Removed examples folder.

b667166

Removed old data utils script.

e40e963

Minor updates in token classifier.

4e7ac8a

saidbleik approved these changes Jun 7, 2019

View reviewed changes

utils_nlp/dataset/wikigold.py Show resolved Hide resolved

scenarios/named_entity_recognition/Named_Entity_Recognition_Using_BERT.ipynb Outdated Show resolved Hide resolved

scenarios/named_entity_recognition/Named_Entity_Recognition_Using_BERT.ipynb Outdated Show resolved Hide resolved

PR Review automation moved this from In progress to In Review Jun 7, 2019

hlums added 3 commits June 7, 2019 17:08

Added BERT prefix to classifier names and some minor docstring updates.

049ddf6

Added random seed option to wikigold util function.

26fcc3c

Changed notebook file name to lower case.

bcd4de1

saidbleik merged commit 01b038b into staging Jun 8, 2019

PR Review automation moved this from In Review to Done Jun 8, 2019

miguelgfierro deleted the hlu/BERT_NER_utils branch June 10, 2019 09:08

This was referenced Jun 10, 2019

V chguan/add icml ex nlp code #90

Merged

Staging to master after PR #36 #91

Merged

saidbleik added a commit that referenced this pull request Jun 10, 2019

Merge pull request #91 from microsoft/staging

42340a7

Staging to master after PR #36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hlu/bert ner utils #36

Hlu/bert ner utils #36

hlums commented May 3, 2019 •

edited

review-notebook-app bot commented May 3, 2019

saidbleik left a comment

miguelgfierro left a comment

hlums commented May 7, 2019

saidbleik left a comment

miguelgfierro commented May 30, 2019

hlums commented May 30, 2019

miguelgfierro commented Jun 3, 2019

hlums commented Jun 5, 2019

saidbleik commented Jun 6, 2019

hlums commented Jun 7, 2019

Hlu/bert ner utils #36

Hlu/bert ner utils #36

Conversation

hlums commented May 3, 2019 • edited

Let's seek more evidence to finalize these decisions as Miguel suggested.

I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

review-notebook-app bot commented May 3, 2019

saidbleik left a comment

Choose a reason for hiding this comment

miguelgfierro left a comment

Choose a reason for hiding this comment

hlums commented May 7, 2019

saidbleik left a comment

Choose a reason for hiding this comment

miguelgfierro commented May 30, 2019

hlums commented May 30, 2019

miguelgfierro commented Jun 3, 2019

hlums commented Jun 5, 2019

saidbleik commented Jun 6, 2019

hlums commented Jun 7, 2019

hlums commented May 3, 2019 •

edited