Fix Surface String to Token Mappings for Case Encoding #12

rjai · 2021-07-30T21:38:02Z

This PR fixes the SentencePieceText structure returned upon using the C++ APIs. This is particularly relevant in the context of Case Encoding.

Previously in Encode, the Case Encoding scheme was returning incomplete norm_to_orig mappings which would prevent the code that assigned surfaces to each piece from doing it's job correctly. This PR corrects the norm_to_orig mapping. This fix allows us to ensure that SentencePieceText adheres to the API invariant that text.substr(piece.begin, piece.end - piece.begin) = piece.surface

Second, there is a bug in SentencePiece where when it is run with treat_whitespace_as_suffix and add_dummy_prefix together, upon decoding the string, an extra whitespace may be emitted in the text. This PR corrects that.

Lastly, in case of Decode, while SPM assigns the Text parameter in SentencePieceText structure correctly to the "Normalized" (the denormalizer's normalization) form, it does not correct the piece surfaces with their corresponding fixed indices. The piece surfaces returned currently are still "unnormalized". This is a problem with case encoding since the piece surface strings may have case markers which is not useful generally. This PR corrects the surface strings to match up with the actual Text.

This reverts commit 1714615.

…Encode(x)) == x

rjai · 2021-07-31T20:52:35Z

src/sentencepiece_processor.cc

+    } else {
+        if(is_eos_ws
+           && (!model_proto_
+               || (model_proto_


Correct formatting

rjai · 2021-08-01T05:44:45Z

src/sentencepiece_processor.cc

+      auto *spiece = spt->mutable_pieces(i);
+      auto curr_surface = spiece->surface();
+
+      // De-normalize curr_surface using o2n. Missing chars (bytes) are deleted (ambiguous)


Stale comment.

snukky

Only a few requests on documenting the code and changes little more by adding more comments. Consider also including mentions of the fixes from the PR description into the relevant parts of the code as comments.

snukky · 2021-08-01T08:16:58Z

src/case_encoder.h

@@ -72,34 +75,53 @@ class UpperCaseEncoder : public CaseEncoder {
  UpperCaseEncoder(bool removeExtraWhiteSpace)
  : removeExtraWhiteSpace_(removeExtraWhiteSpace) {}

-  std::pair<absl::string_view, int> normalizePrefix(absl::string_view input) {
+  std::pair<absl::string_view, int> normalizePrefix(absl::string_view orig_input) {
+    if((dump_buffer_from_ >= 0) && (dump_buffer_from_ < buffer_queue_.size())) {


I think this "if" statement deserves a short comment explaining what is handled here.

snukky · 2021-08-01T08:18:47Z

src/case_encoder.h

+      return buffer_queue_[dump_buffer_from_++];
+    }
+
+    if(dump_buffer_from_ > -1) {


As above, consider adding a comment explaining what happened and what will be done inside the block.

snukky · 2021-08-01T08:20:18Z

src/case_encoder.h

@@ -62,6 +62,9 @@ class UpperCaseEncoder : public CaseEncoder {
 private:
  std::string buffer_;
  std::string signature_;
+  int offset = 0;


I suppose it could be named as offset_ as it as an (private) attribute.

snukky · 2021-08-01T08:22:05Z

src/case_encoder.h

-    auto null = [](int consumed) -> std::pair<absl::string_view, int> {
-      return {{nullptr, 0}, consumed};
+    auto null = [this](int consumed) -> std::pair<absl::string_view, int> {
+      offset += consumed;


Consider adding an in-line comment explaining why this is done.

snukky · 2021-08-01T08:24:45Z

src/case_encoder.h

+      auto cur_buf_last = buffer_.size();
      buffer_.append(sp.data(), sp.size());
+      auto tmp_str = std::string(buffer_).substr(cur_buf_last, sp.size());
+      buffer_queue_.push_back({tmp_str, override_consumed == -1 ? p.second : override_consumed});


As above, consider adding a few comments here, so that it will be easier to recall what is being done here.

snukky · 2021-08-01T08:26:03Z

src/case_encoder.h

@@ -62,6 +62,9 @@ class UpperCaseEncoder : public CaseEncoder {
 private:
  std::string buffer_;
  std::string signature_;
+  int offset = 0;
+  std::vector<std::pair<std::string, int> > buffer_queue_;


What's the second key in pairs here? Consider adding a comment.

snukky · 2021-08-01T08:35:46Z

src/sentencepiece_processor.cc

@@ -594,13 +608,51 @@ util::Status SentencePieceProcessor::Decode(
    if (!IsByte(sp.id())) {
      RETURN_IF_ERROR(ProcessBytePieces(byte_start, i));
      byte_start = i + 1;
-      SetSurface(i, DecodeSentencePiece(sp.piece(), sp.id(), text->empty()));
+      bool is_eos_space = i == spt->pieces_size() - 1;
+      SetSurface(i, DecodeSentencePiece(sp.piece(), sp.id(), text->empty(), is_eos_space));
    }
  }
  RETURN_IF_ERROR(ProcessBytePieces(byte_start, spt->pieces_size()));

  if (denormalizer_) {


Please document the code inside this block a bit more. Maybe even consider extracting it (or a part of it) into a subroutine if appropriate.

rjai · 2021-08-01T16:23:14Z

@snukky Before we merge this, should we think about regression testing for these changes? I for one, have yet to extensively diff and ensure output of these changes is exactly the same as the old setup (though it seems that way visually, and from head diff)

snukky · 2021-08-01T18:56:49Z

Yes, that's a very good idea. Let's sync on priv.

* Adding alternative project name for spm latest to prevent lib conflicts * Update cmake * Update CMakeFiles to allow for configurable artifact names * Enables --encode_unicode_case option for case-aware sentence piece (marian-nmt#10) * Enables --encode_unicode_case option for case-aware sentence piece * Example: This IS a TEST OF THE CASING gets converted internally to Tthis Uis a Atest of the casing before segmentation. * This is fully reversible. * Enable toggling Case Encoding flag from C++ Train API (marian-nmt#11) * Enable toggling Case Encoding flag from C++ Train API * Fixing issue with hardcoding truth value of encode_decode_case flag * Disable denormalizer flags (marian-nmt#13) Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com> * Fix Surface String to Token Mappings for Case Encoding (marian-nmt#12) Co-authored-by: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com> * add one header file to installation * Rename VERSION to VERSION.txt * Rename VERSION to VERSION.txt Installing python package fails with below error. This change addresses this issue ``` × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [10 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/alferre/code/sentencepiece/python/setup.py", line 111, in <module> version=version(), File "/home/alferre/code/sentencepiece/python/setup.py", line 36, in version with codecs.open('VERSION.txt', 'r', 'utf-8') as f: File "/opt/conda/envs/ptca/lib/python3.8/codecs.py", line 905, in open file = builtins.open(filename, mode, buffering) FileNotFoundError: [Errno 2] No such file or directory: 'VERSION.txt' [end of output] ``` --------- Co-authored-by: Rohit Jain <rjai@microsoft.com> Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com> Co-authored-by: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> Co-authored-by: Roman Grundkiewicz <rgrundkiewicz@gmail.com> Co-authored-by: alexandremuzio <ax.muzio@gmail.com>

emjotde and others added 30 commits May 2, 2021 23:48

first steps towards unicode case handling

ee72f10

merge case mapping with normalization

dc3dfe8

add case_encoder.h

cae9469

correct decoding

0b12080

rewrite case normalizer

5c5483a

prepare for decoding

8ddba2d

working encoding/decoding

221b38d

working and fast case mapping

a80f4a6

split implementation

b80fa2b

add file

f6a508b

add space as delimiter for case

8548547

minimize diff

0f866ee

clean up normalizer.cc

3ae48a8

better unicode case folding

ba02a13

before refactoring

1ec9054

remove punctuation marker

1714615

Revert "remove punctuation marker"

c576987

This reverts commit 1714615.

working long-range encoding

fdb58a2

short-circuit computations

73982b2

fix bad whitespace behavior

3c93260

Fix complexity related exceptions in regex_search on windows

a21a6ed

Add warning message in case of regex segmentation error

be8a769

Improve error messages

da762d2

Correct the norm_to_orig values for case encoding normalization

8201c54

Buffer and return all normalized subpieces with correct alignment

e88ce4c

Update UpperCaseEncoder to correctly handle terminating L

abcf4e5

Be consistent about recursive use or nullzero return

d940fcf

Fix decode side piece surface strings in presence of a denormalizer

37092c0

Make SPM strip off the terminating dummy whitespace to ensure Decode(…

1b08cd2

…Encode(x)) == x

Ignore adjustments for bytefallbacks

b346d07

Rohit Jain added 7 commits July 30, 2021 06:15

Remove ByteFallback special code and ensure alll bytes are surfaced

2b708f4

correct last

dea4a78

Fix buffer_queue

87e3581

Fix bug after moving to string from string_view

4b570eb

Remove logs

39d7dc7

Merge branch 'master_real' into rjai/casing

6fbd81a

Merge cleanup

fd8aca2

rjai requested review from emjotde and snukky July 30, 2021 21:38

rjai commented Jul 31, 2021

View reviewed changes

src/sentencepiece_processor.cc Outdated

} else {

if(is_eos_ws

&& (!model_proto_

|| (model_proto_

Copy link

Collaborator Author

rjai Jul 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct formatting

rjai commented Aug 1, 2021

View reviewed changes

snukky approved these changes Aug 1, 2021

View reviewed changes

Review comment fixes

2125e9a

emjotde approved these changes Aug 27, 2021

View reviewed changes

emjotde merged commit c307b87 into marian-nmt:master Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Surface String to Token Mappings for Case Encoding #12

Fix Surface String to Token Mappings for Case Encoding #12

rjai commented Jul 30, 2021

rjai Jul 31, 2021

rjai Aug 1, 2021

snukky left a comment

snukky Aug 1, 2021

snukky Aug 1, 2021

snukky Aug 1, 2021

snukky Aug 1, 2021

snukky Aug 1, 2021

snukky Aug 1, 2021

snukky Aug 1, 2021

rjai commented Aug 1, 2021

snukky commented Aug 1, 2021

Fix Surface String to Token Mappings for Case Encoding #12

Fix Surface String to Token Mappings for Case Encoding #12

Conversation

rjai commented Jul 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snukky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjai commented Aug 1, 2021

snukky commented Aug 1, 2021