Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Surface String to Token Mappings for Case Encoding #12

Merged
merged 38 commits into from
Aug 31, 2021

Conversation

rjai
Copy link
Collaborator

@rjai rjai commented Jul 30, 2021

This PR fixes the SentencePieceText structure returned upon using the C++ APIs. This is particularly relevant in the context of Case Encoding.

Previously in Encode, the Case Encoding scheme was returning incomplete norm_to_orig mappings which would prevent the code that assigned surfaces to each piece from doing it's job correctly. This PR corrects the norm_to_orig mapping. This fix allows us to ensure that SentencePieceText adheres to the API invariant that text.substr(piece.begin, piece.end - piece.begin) = piece.surface

Second, there is a bug in SentencePiece where when it is run with treat_whitespace_as_suffix and add_dummy_prefix together, upon decoding the string, an extra whitespace may be emitted in the text. This PR corrects that.

Lastly, in case of Decode, while SPM assigns the Text parameter in SentencePieceText structure correctly to the "Normalized" (the denormalizer's normalization) form, it does not correct the piece surfaces with their corresponding fixed indices. The piece surfaces returned currently are still "unnormalized". This is a problem with case encoding since the piece surface strings may have case markers which is not useful generally. This PR corrects the surface strings to match up with the actual Text.

@rjai rjai requested review from emjotde and snukky July 30, 2021 21:38
} else {
if(is_eos_ws
&& (!model_proto_
|| (model_proto_
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct formatting

auto *spiece = spt->mutable_pieces(i);
auto curr_surface = spiece->surface();

// De-normalize curr_surface using o2n. Missing chars (bytes) are deleted (ambiguous)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment.

Copy link
Member

@snukky snukky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few requests on documenting the code and changes little more by adding more comments. Consider also including mentions of the fixes from the PR description into the relevant parts of the code as comments.

@@ -72,34 +75,53 @@ class UpperCaseEncoder : public CaseEncoder {
UpperCaseEncoder(bool removeExtraWhiteSpace)
: removeExtraWhiteSpace_(removeExtraWhiteSpace) {}

std::pair<absl::string_view, int> normalizePrefix(absl::string_view input) {
std::pair<absl::string_view, int> normalizePrefix(absl::string_view orig_input) {
if((dump_buffer_from_ >= 0) && (dump_buffer_from_ < buffer_queue_.size())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this "if" statement deserves a short comment explaining what is handled here.

return buffer_queue_[dump_buffer_from_++];
}

if(dump_buffer_from_ > -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, consider adding a comment explaining what happened and what will be done inside the block.

@@ -62,6 +62,9 @@ class UpperCaseEncoder : public CaseEncoder {
private:
std::string buffer_;
std::string signature_;
int offset = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could be named as offset_ as it as an (private) attribute.

auto null = [](int consumed) -> std::pair<absl::string_view, int> {
return {{nullptr, 0}, consumed};
auto null = [this](int consumed) -> std::pair<absl::string_view, int> {
offset += consumed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an in-line comment explaining why this is done.

Comment on lines 103 to 106
auto cur_buf_last = buffer_.size();
buffer_.append(sp.data(), sp.size());
auto tmp_str = std::string(buffer_).substr(cur_buf_last, sp.size());
buffer_queue_.push_back({tmp_str, override_consumed == -1 ? p.second : override_consumed});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, consider adding a few comments here, so that it will be easier to recall what is being done here.

@@ -62,6 +62,9 @@ class UpperCaseEncoder : public CaseEncoder {
private:
std::string buffer_;
std::string signature_;
int offset = 0;
std::vector<std::pair<std::string, int> > buffer_queue_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the second key in pairs here? Consider adding a comment.

@@ -594,13 +608,51 @@ util::Status SentencePieceProcessor::Decode(
if (!IsByte(sp.id())) {
RETURN_IF_ERROR(ProcessBytePieces(byte_start, i));
byte_start = i + 1;
SetSurface(i, DecodeSentencePiece(sp.piece(), sp.id(), text->empty()));
bool is_eos_space = i == spt->pieces_size() - 1;
SetSurface(i, DecodeSentencePiece(sp.piece(), sp.id(), text->empty(), is_eos_space));
}
}
RETURN_IF_ERROR(ProcessBytePieces(byte_start, spt->pieces_size()));

if (denormalizer_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the code inside this block a bit more. Maybe even consider extracting it (or a part of it) into a subroutine if appropriate.

@rjai
Copy link
Collaborator Author

rjai commented Aug 1, 2021

@snukky Before we merge this, should we think about regression testing for these changes? I for one, have yet to extensively diff and ensure output of these changes is exactly the same as the old setup (though it seems that way visually, and from head diff)

@snukky
Copy link
Member

snukky commented Aug 1, 2021

Yes, that's a very good idea. Let's sync on priv.

@emjotde emjotde merged commit c307b87 into marian-nmt:master Aug 31, 2021
XapaJIaMnu added a commit to browsermt/sentencepiece that referenced this pull request May 3, 2023
* Adding alternative project name for spm latest to prevent lib conflicts

* Update cmake

* Update CMakeFiles to allow for configurable artifact names

* Enables --encode_unicode_case option for case-aware sentence piece (marian-nmt#10)

* Enables --encode_unicode_case option for case-aware sentence piece
* Example: This IS a TEST OF THE CASING gets converted internally to Tthis Uis a Atest of the casing before segmentation.
* This is fully reversible.

* Enable toggling Case Encoding flag from C++ Train API (marian-nmt#11)

* Enable toggling Case Encoding flag from C++ Train API
* Fixing issue with hardcoding truth value of encode_decode_case flag

* Disable denormalizer flags (marian-nmt#13)

Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com>

* Fix Surface String to Token Mappings for Case Encoding (marian-nmt#12)

Co-authored-by: Marcin Junczys-Dowmunt <marcinjd@microsoft.com>
Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com>

* add one header file to installation

* Rename VERSION to VERSION.txt

* Rename VERSION to VERSION.txt

Installing python package fails with below error.
This change addresses this issue
```
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [10 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/alferre/code/sentencepiece/python/setup.py", line 111, in <module>
          version=version(),
        File "/home/alferre/code/sentencepiece/python/setup.py", line 36, in version
          with codecs.open('VERSION.txt', 'r', 'utf-8') as f:
        File "/opt/conda/envs/ptca/lib/python3.8/codecs.py", line 905, in open
          file = builtins.open(filename, mode, buffering)
      FileNotFoundError: [Errno 2] No such file or directory: 'VERSION.txt'
      [end of output]
```

---------

Co-authored-by: Rohit Jain <rjai@microsoft.com>
Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com>
Co-authored-by: Marcin Junczys-Dowmunt <marcinjd@microsoft.com>
Co-authored-by: Roman Grundkiewicz <rgrundkiewicz@gmail.com>
Co-authored-by: alexandremuzio <ax.muzio@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants