Skip to content

Commit

Permalink
Fix the way we deal with empty sentences due to removal of punctuatio…
Browse files Browse the repository at this point in the history
…n. Should be stronger

Squashed commit of the following:

commit 187ada7
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 18:39:16 2020 +0100

    Fix the way we deal with empty sentences due to removal of punctuation. Should be stronger

commit 0531fc9
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 18:06:12 2020 +0100

    Added special case of following stuff

commit 9a7ed93
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 17:34:00 2020 +0100

    Working

commit 20d53e8
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 17:19:02 2020 +0100

    Cleaner

commit e88eeaf
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 17:16:38 2020 +0100

    It's ALIVE !

commit cc51728
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 17:05:54 2020 +0100

    Now we have an issue with missing parenthesis

commit 9adf218
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 16:42:08 2020 +0100

    More bugs

commit b80ac11
Author: Thibault Clérice <leponteineptique@gmail.com>
Date:   Thu Feb 6 15:09:56 2020 +0100

    (Not working) First attempt but not working. Need to debug. No text is sent apparently...
  • Loading branch information
PonteIneptique committed Feb 6, 2020
1 parent f5eedd8 commit 08a60bb
Showing 1 changed file with 11 additions and 34 deletions.
45 changes: 11 additions & 34 deletions pie_extended/tagger.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,51 +54,37 @@ def iter_tag(self, data: str, iterator: DataIterator, formatter_class: type):
# Unzip the batch into the sentences, their sizes and the dictionaries of things that needs
# to be reinserted
sents, lengths, needs_reinsertion = zip(*chunk)
# Removing punctuation might create empty sentences !
# Which would crash Torch
empty_sents_indexes = {
index: []
for index, sent in enumerate(sents)
if len(sent) == 0
}

is_empty = [0 == len(sent) for sent in enumerate(sents)]

tagged, tasks = self.tag(
sents=[sent for sent in sents if len(sent)],
sents=[sent for sent in sents if sent],
lengths=lengths
)
formatter: Formatter = formatter_class(tasks)

# We keep a real sentence index
real_sentence_index = 0
for sent in tagged:
if not sent:
continue
for sents_index, sent_is_empty in enumerate(is_empty):
if sent_is_empty:
sent = []
else:
sent = tagged.pop(0)

# Gets things that needs to be reinserted
sent_reinsertion = needs_reinsertion[real_sentence_index]
sent_reinsertion = needs_reinsertion[sents_index]

# If the header has not yet be written, write it
if not header:
yield formatter.write_headers()
header = True

# Some sentences can be empty and would have been removed from tagging
# we check and until we get to a non empty sentence
# we increment the real_sentence_index to keep in check with the reinsertion map
while real_sentence_index in empty_sents_indexes:
yield from self.reinsert_full(
formatter,
needs_reinsertion[real_sentence_index],
tasks
)
real_sentence_index += 1

yield formatter.write_sentence_beginning()

# If we have a disambiguator, we run the results into it
if self.disambiguation:
sent = self.disambiguation(sent, tasks)

reinsertion_index = 0
index = 0

for index, (token, tags) in enumerate(sent):
while reinsertion_index + index in sent_reinsertion:
Expand All @@ -125,15 +111,6 @@ def iter_tag(self, data: str, iterator: DataIterator, formatter_class: type):

yield formatter.write_sentence_end()

real_sentence_index += 1

while real_sentence_index in empty_sents_indexes:
yield from self.reinsert_full(
formatter,
needs_reinsertion[real_sentence_index],
tasks
)
real_sentence_index += 1

if formatter:
yield formatter.write_footer()

0 comments on commit 08a60bb

Please sign in to comment.