-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Surface String to Token Mappings for Case Encoding #12
Conversation
This reverts commit 1714615.
src/sentencepiece_processor.cc
Outdated
} else { | ||
if(is_eos_ws | ||
&& (!model_proto_ | ||
|| (model_proto_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct formatting
src/sentencepiece_processor.cc
Outdated
auto *spiece = spt->mutable_pieces(i); | ||
auto curr_surface = spiece->surface(); | ||
|
||
// De-normalize curr_surface using o2n. Missing chars (bytes) are deleted (ambiguous) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few requests on documenting the code and changes little more by adding more comments. Consider also including mentions of the fixes from the PR description into the relevant parts of the code as comments.
@@ -72,34 +75,53 @@ class UpperCaseEncoder : public CaseEncoder { | |||
UpperCaseEncoder(bool removeExtraWhiteSpace) | |||
: removeExtraWhiteSpace_(removeExtraWhiteSpace) {} | |||
|
|||
std::pair<absl::string_view, int> normalizePrefix(absl::string_view input) { | |||
std::pair<absl::string_view, int> normalizePrefix(absl::string_view orig_input) { | |||
if((dump_buffer_from_ >= 0) && (dump_buffer_from_ < buffer_queue_.size())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this "if" statement deserves a short comment explaining what is handled here.
return buffer_queue_[dump_buffer_from_++]; | ||
} | ||
|
||
if(dump_buffer_from_ > -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, consider adding a comment explaining what happened and what will be done inside the block.
src/case_encoder.h
Outdated
@@ -62,6 +62,9 @@ class UpperCaseEncoder : public CaseEncoder { | |||
private: | |||
std::string buffer_; | |||
std::string signature_; | |||
int offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could be named as offset_
as it as an (private) attribute.
src/case_encoder.h
Outdated
auto null = [](int consumed) -> std::pair<absl::string_view, int> { | ||
return {{nullptr, 0}, consumed}; | ||
auto null = [this](int consumed) -> std::pair<absl::string_view, int> { | ||
offset += consumed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an in-line comment explaining why this is done.
src/case_encoder.h
Outdated
auto cur_buf_last = buffer_.size(); | ||
buffer_.append(sp.data(), sp.size()); | ||
auto tmp_str = std::string(buffer_).substr(cur_buf_last, sp.size()); | ||
buffer_queue_.push_back({tmp_str, override_consumed == -1 ? p.second : override_consumed}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, consider adding a few comments here, so that it will be easier to recall what is being done here.
@@ -62,6 +62,9 @@ class UpperCaseEncoder : public CaseEncoder { | |||
private: | |||
std::string buffer_; | |||
std::string signature_; | |||
int offset = 0; | |||
std::vector<std::pair<std::string, int> > buffer_queue_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the second key in pairs here? Consider adding a comment.
@@ -594,13 +608,51 @@ util::Status SentencePieceProcessor::Decode( | |||
if (!IsByte(sp.id())) { | |||
RETURN_IF_ERROR(ProcessBytePieces(byte_start, i)); | |||
byte_start = i + 1; | |||
SetSurface(i, DecodeSentencePiece(sp.piece(), sp.id(), text->empty())); | |||
bool is_eos_space = i == spt->pieces_size() - 1; | |||
SetSurface(i, DecodeSentencePiece(sp.piece(), sp.id(), text->empty(), is_eos_space)); | |||
} | |||
} | |||
RETURN_IF_ERROR(ProcessBytePieces(byte_start, spt->pieces_size())); | |||
|
|||
if (denormalizer_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the code inside this block a bit more. Maybe even consider extracting it (or a part of it) into a subroutine if appropriate.
@snukky Before we merge this, should we think about regression testing for these changes? I for one, have yet to extensively diff and ensure output of these changes is exactly the same as the old setup (though it seems that way visually, and from head diff) |
Yes, that's a very good idea. Let's sync on priv. |
* Adding alternative project name for spm latest to prevent lib conflicts * Update cmake * Update CMakeFiles to allow for configurable artifact names * Enables --encode_unicode_case option for case-aware sentence piece (marian-nmt#10) * Enables --encode_unicode_case option for case-aware sentence piece * Example: This IS a TEST OF THE CASING gets converted internally to Tthis Uis a Atest of the casing before segmentation. * This is fully reversible. * Enable toggling Case Encoding flag from C++ Train API (marian-nmt#11) * Enable toggling Case Encoding flag from C++ Train API * Fixing issue with hardcoding truth value of encode_decode_case flag * Disable denormalizer flags (marian-nmt#13) Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com> * Fix Surface String to Token Mappings for Case Encoding (marian-nmt#12) Co-authored-by: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com> * add one header file to installation * Rename VERSION to VERSION.txt * Rename VERSION to VERSION.txt Installing python package fails with below error. This change addresses this issue ``` × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [10 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/home/alferre/code/sentencepiece/python/setup.py", line 111, in <module> version=version(), File "/home/alferre/code/sentencepiece/python/setup.py", line 36, in version with codecs.open('VERSION.txt', 'r', 'utf-8') as f: File "/opt/conda/envs/ptca/lib/python3.8/codecs.py", line 905, in open file = builtins.open(filename, mode, buffering) FileNotFoundError: [Errno 2] No such file or directory: 'VERSION.txt' [end of output] ``` --------- Co-authored-by: Rohit Jain <rjai@microsoft.com> Co-authored-by: Rohit Jain <Rohit.Jain@microsoft.com> Co-authored-by: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> Co-authored-by: Roman Grundkiewicz <rgrundkiewicz@gmail.com> Co-authored-by: alexandremuzio <ax.muzio@gmail.com>
This PR fixes the SentencePieceText structure returned upon using the C++ APIs. This is particularly relevant in the context of Case Encoding.
Previously in Encode, the Case Encoding scheme was returning incomplete norm_to_orig mappings which would prevent the code that assigned surfaces to each piece from doing it's job correctly. This PR corrects the norm_to_orig mapping. This fix allows us to ensure that SentencePieceText adheres to the API invariant that
text.substr(piece.begin, piece.end - piece.begin) = piece.surface
Second, there is a bug in SentencePiece where when it is run with treat_whitespace_as_suffix and add_dummy_prefix together, upon decoding the string, an extra whitespace may be emitted in the text. This PR corrects that.
Lastly, in case of Decode, while SPM assigns the Text parameter in SentencePieceText structure correctly to the "Normalized" (the denormalizer's normalization) form, it does not correct the piece surfaces with their corresponding fixed indices. The piece surfaces returned currently are still "unnormalized". This is a problem with case encoding since the piece surface strings may have case markers which is not useful generally. This PR corrects the surface strings to match up with the actual Text.