-
Notifications
You must be signed in to change notification settings - Fork 3
Update how TransformerTagging tokenizes sentences #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r1.
Reviewable status: 1 of 4 files reviewed, 4 unresolved discussions (waiting on @Chris7C)
python/TransformerTagging/README.md
line 15 at r1 (raw file):
are called "trigger sentences". These sentences are grouped by "tag" based on which entry in the corpus they matched against.
Needs to be added to descriptor.json
.
python/TransformerTagging/transformer_tagging_component/transformer_tagging_component.py
line 41 at r1 (raw file):
from nltk.tokenize.punkt import PunktSentenceTokenizer import pandas as pd import re
Not used.
python/TransformerTagging/transformer_tagging_component/transformer_tagging_component.py
line 152 at r1 (raw file):
# split input sentence further on newline or carriage return if flag is set if (config.split_on_newline): for new_sentence in probe_str.splitlines(keepends=True):
Can you just set probe_list = probe_str.splitlines(keepends=True)
?
python/TransformerTagging/transformer_tagging_component/transformer_tagging_component.py
line 157 at r1 (raw file):
I took a closer look at KeywordTagging and found how it trims the whitespace before and after the trigger word: https://github.com/openmpf/openmpf-components/blob/develop/cpp/KeywordTagging/KeywordTagging.cpp#L159
I think we should do that in TransformerTagging for consistency. I found this post that talks about counting the leading and trailing chars to strip by using lstrip()
and rstrip()
: https://stackoverflow.com/questions/52581881/keeping-track-of-number-of-characters-removed-in-strings-python-strip/52581987#52581987
Related, I think I mentioned that we should strip whitespace from the probe before attempting to get a match score. Consider this:
>>> test = ' \n B '
>>> out = test.splitlines(keepends=True)
>>> out
[' \n', ' B ']
>>> len(out[0].strip())
0
In this case ' \n'
isn't even worth using since the stripped version has a length of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Chris7C)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Chris7C)
Issues:
This change is