Update how TransformerTagging tokenizes sentences #366

Chris7C · 2024-05-07T05:17:33Z

Issues:

Update how TransformerTagging tokenizes sentences openmpf#1774

This change is

…carriage returns

…agging-newlines

jrobble

Reviewed 1 of 4 files at r1.
Reviewable status: 1 of 4 files reviewed, 4 unresolved discussions (waiting on @Chris7C)

python/TransformerTagging/README.md line 15 at r1 (raw file):

are called "trigger sentences". These sentences are grouped by "tag" based on which entry 
in the corpus they matched against.

Needs to be added to descriptor.json.

python/TransformerTagging/transformer_tagging_component/transformer_tagging_component.py line 41 at r1 (raw file):

from nltk.tokenize.punkt import PunktSentenceTokenizer
import pandas as pd
import re

Not used.

python/TransformerTagging/transformer_tagging_component/transformer_tagging_component.py line 152 at r1 (raw file):

            # split input sentence further on newline or carriage return if flag is set
            if (config.split_on_newline):
                for new_sentence in probe_str.splitlines(keepends=True):

Can you just set probe_list = probe_str.splitlines(keepends=True) ?

python/TransformerTagging/transformer_tagging_component/transformer_tagging_component.py line 157 at r1 (raw file):
I took a closer look at KeywordTagging and found how it trims the whitespace before and after the trigger word: https://github.com/openmpf/openmpf-components/blob/develop/cpp/KeywordTagging/KeywordTagging.cpp#L159

I think we should do that in TransformerTagging for consistency. I found this post that talks about counting the leading and trailing chars to strip by using lstrip() and rstrip(): https://stackoverflow.com/questions/52581881/keeping-track-of-number-of-characters-removed-in-strings-python-strip/52581987#52581987

Related, I think I mentioned that we should strip whitespace from the probe before attempting to get a match score. Consider this:

>>> test =  '     \n   B   '
>>> out = test.splitlines(keepends=True)
>>> out
['     \n', '   B   ']
>>> len(out[0].strip())
0

In this case ' \n' isn't even worth using since the stripped version has a length of 0.

jrobble

Reviewed 3 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Chris7C)

jrobble

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Chris7C)

Chris and others added 4 commits March 27, 2024 00:41

Added a new flag, that when set, will split sentences on newline and …

5a7c258

…carriage returns

Updates to fix offsets on newlines

451ac6e

test for newline splits and offset

1626e8a

Merge remote-tracking branch 'origin/develop' into feat/transformer-t…

46cc48f

…agging-newlines

jrobble requested changes May 8, 2024

View reviewed changes

Address PR comments.

7dbd353

jrobble reviewed May 8, 2024

View reviewed changes

Strip probes.

9b08cb5

jrobble approved these changes May 8, 2024

View reviewed changes

jrobble merged commit d7629cd into develop May 8, 2024

jrobble deleted the feat/transformer-tagging-newlines branch May 8, 2024 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update how TransformerTagging tokenizes sentences #366

Update how TransformerTagging tokenizes sentences #366

Uh oh!

Chris7C commented May 7, 2024 •

edited by jrobble

Loading

Uh oh!

jrobble left a comment

Uh oh!

jrobble left a comment

Uh oh!

jrobble left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update how TransformerTagging tokenizes sentences #366

Update how TransformerTagging tokenizes sentences #366

Uh oh!

Conversation

Chris7C commented May 7, 2024 • edited by jrobble Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

jrobble left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chris7C commented May 7, 2024 •

edited by jrobble

Loading