Conversation
Check out this pull request on ReviewNB: https://app.reviewnb.com/Microsoft/NLP/pull/36 Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Thanks.
I still think the example feels bloated though. It might be helpful if someone goes through it from a user's perspective and tries it out. I'll add a task for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job, took one pass and have some questions
Thank you @miguelgfierro! Great insights! I will work on addressing your comments and let you know when it's ready for another pass of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are part of an old review that I added without submitting. Ignore if outdated. I'll review the new updates.
examples/named_entity_recognition/NER_bert-demo-new-updated.ipynb
Outdated
Show resolved
Hide resolved
hey @hlums I think we should avoid having binary objects like images in the repo. The reason is because git is not designed for binary files, but for code. A binary file can't be versioned and at the same time can make the repo very big if there are many binaries. In reco we have a blob where we host all images and then we link them in the notebooks. I uploaded the bert image to a blob I created for us: https://nlpbp.blob.core.windows.net/images/bert_architecture.png. You can access it in case you want to upload other images. To link them in the notebooks, it is just like in a markdown or with the html img tag, see one example here: https://github.com/microsoft/recommenders/blob/c16ed91b21cd2c3eea228becaeac7013029f9758/notebooks/00_quick_start/wide_deep_movielens.ipynb#L914 |
@miguelgfierro That makes sense! Thank you! |
hey @hlums, I see that there are some repeated code between this PR and #63. What is the plan, are we going to merge first @saidbleik PR and then work on this? Please let me know when you want me to take a look at your PR |
@miguelgfierro Do you mean the code in common_ner.py? I needed to repeat some code in Said’s PR to get my code running. The plan is once Said completed his PR, I will update my branch from staging. You can start review my PR now and skip the part overlapping with Said’s PR. If Said is planning to complete his PR soon, you can also wait for I complete the merging from staging if you prefer. |
The notebook and the utils are ready but I'm trying to figure out which dataset to use (the latest one is using MultiNLI). Let me push my changes to my branch for now. |
@miguelgfierro @saidbleik I've updated my branch from staging and it's ready to be reviewed. |
scenarios/named_entity_recognition/Named_Entity_Recognition_Using_BERT.ipynb
Outdated
Show resolved
Hide resolved
scenarios/named_entity_recognition/Named_Entity_Recognition_Using_BERT.ipynb
Outdated
Show resolved
Hide resolved
Staging to master after PR #36
5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99
5/29 updates: @saidbleik @miguelgfierro
I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.)
Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset.
Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.
5/20 udpates: @saidbleik @miguelgfierro
I made another update based on our discussion last week.
I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios
a. Single sentence data with label: (sentence_text, label)
b. Single sentence data without label: (sentence_text,)
c. Two sentence data with label: (sentence_1_text, sentence_2_text, label)
d. Two sentence data without label: (sentence_1_text, sentence_2_text)
As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks.
I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.
I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.
I'm flexible with using or not using the configuration class.
Let's seek more evidence to finalize these decisions as Miguel suggested.
5/16 updates: @saidbleik @miguelgfierro
I made another pass through the code. Three major changes:
In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.
I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!
I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro