Skip to content

(PUP-7498) update char encoding util per feedback#5799

Closed
MosesMendoza wants to merge 4 commits intopuppetlabs:stablefrom
MosesMendoza:PUP-7021/master/update_char_encoding_util
Closed

(PUP-7498) update char encoding util per feedback#5799
MosesMendoza wants to merge 4 commits intopuppetlabs:stablefrom
MosesMendoza:PUP-7021/master/update_char_encoding_util

Conversation

@MosesMendoza
Copy link
Contributor

This PR splits out adjustments to the character encoding heuristics that were previously a part of #5509 and contains changes based on feedback on that PR.

The three primary changes are:

  • Binary string handling
    Prior to this PR, we would transcode to UTF-8 a string not composed of valid UTF-8 bytes but valid in its current encoding. This presents a challenge for BINARY strings, which if originating in an encoding other to UTF-8 are otherwise valid, but cannot be transcoded (because we have no idea where they came from). In order to transcode successfully, we need an originating encoding to work with. Rather than just blindly call encode! with binary strings, we can take one more guess as to their origin - which is that they originated in Encoding.default_external. I.e., if a system is running in
    Encoding::Windows_31J, and a string is returned as BINARY, we can guess that the string might have been Encoding::Windows_31J. This perhaps substantially increases our odds that we'll be able to convert the (binary) string to UTF-8 successfully.

  • Consistently return nil
    Prior to this PR, we were modifying the supplied string in Puppet::Util::CharacterEncoding as a side-effect of the recent change that added force_encoding(Encoding.default_external) on binary strings. In the case where we fail to encode! that change leaves the original string with a
    different external encoding than it started. This PR addresses that by ensuring any on failure to convert we set the string back to its original external encoding. A quirk was accounted for in that we don't set external_encoding if the default external encoding is UTF-8. The reason is we've already determined that this string doesn't represent valid UTF-8 bytes so doing a force_encoding to UTF-8 doesn't get us any closer to being able to transcode it to UTF-8, and ruby won't do anything on encode! to string already has as its default_external set to the encoding supplied to encode!.

  • Separate out conversion from overriding of string encoding
    Prior to this PR, the Puppet::Util::CharacterEncoding::convert_to_utf_8! was inaccurately named in that the method would both convert (transcode) and override (set external encoding) on a string in different circumstances. After review, it was determined that this behavior was potentially dangerous and also could lead to inadvertent destruction or loss of string fidelity unintentionally. This PR updates the method to split out the two different operations, each with an improved description for when to use it. Conversion should be our primary use-case, for when we believe that the encoding of a string is an accurate representation of that string. Overriding a string's encoding should be used sparingly, effectively only when we have reason to believe that the existing encoding is inaccurate/incorrect.

    This PR adjusts the Puppet::Etc wrapper to account for the modified heuristics in Puppet::Util::CharacterEncoding. We leverage the new distinction between converting and overriding by only "overriding" the encoding to UTF-8 when the string is returned in BINARY and otherwise defaulting to converting it. This is a shift in default assumptions - whereas before we would assume the string encoding was not valid for that environment, now we do. We retain the preceding requirement that strings that cannot be converted are left unmodified in the Etc struct.

    In addition to updating the Etc wrapper to account for the preceding change to split out two methods in CharacterEncoding, prior to this PR the encoding management in our Etc wrapper incorrectly expected that ruby Etc would return BINARY encoded strings when the values on disk are invalid in the system encoding. This is incorrect - ruby Etc quite happily returns strings with invalid encoding. The only case where it appears to return binary is when the environment's encoding (Encoding.default_external) is ASCII (7-bit), and any individual character bit representation is equal to or greater than \x80 (binary 10000000), ie the first 8-bit value. We account for this change in expectation by fixing our spec tests to expect and test for the appropriate scenarios.

Finally, we also add a small spec helper for running a block in a different external encoding.

@puppetcla
Copy link

CLA signed by all contributors.

if value.is_a?(String)
converted = Puppet::Util::CharacterEncoding.convert_to_utf_8!(value)
struct[index] = converted if !converted.nil?
Puppet::Util::CharacterEncoding.override_encoding_to_utf_8!(value) if value.encoding != Encoding::UTF_8
Copy link
Contributor Author

@MosesMendoza MosesMendoza Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshcooper @Iristyle This change is the crux of our discussion - This now basically says we are explicitly overriding the encoding of the string returned by Etc to UTF-8 (not converting it) if it is a) not already UTF-8 and b) would be valid UTF-8. We are starting from the assumption that it is incorrectly labeled with whatever encoding it has (ie ISO-8859-1) because we have reason to believe it is actually UTF-8 because Puppet set (or will set) the value to UTF-8.

Just to reiterate for PR posterity why I believe this is the correct behavior: Consider these two scenarios. In the first, we default to assuming the encoding is correct from Etc and do encode! and in the second we assume the encoding is incorrect and override it with force_encoding.

  • Puppet agent is creating a new user with managed comment property. The host is running in ISO-8859-1 encoding. When puppet agent get's its should value via the catalog, the should value is UTF-8 and its bytes are written to disk as-is with no re-encoding/modification (we don't use Etc for writing, only reading). When Puppet runs next time, Ruby::Etc will read in these bytes and label them as ISO-8859-1, which will likely be valid in that encoding. At least, our commonly used mixed_utf8 string in tests is "valid" ISO-8859-1. When we pipe these values through Puppet::Etc, they'll now be encode!d as UTF-8 - changing the byte representation. So our original UTF-8 has now been modified to something else. When Puppet compares this is value to the catalog should value, they won't be the same... so Puppet will consider the resource out of sync and write the should value to disk again - and will every time it runs.
[1] pry(main)> mixed_utf_8 = "A\u06FF\u16A0\u{2070E}"
=> "Aۿᚠ𠜎"
[2] pry(main)> mixed_utf_8.bytes
=> [65, 219, 191, 225, 154, 160, 240, 160, 156, 142]
[3] pry(main)> mixed_utf_8.force_encoding(Encoding::ISO_8859_1)
=> "A\xDB\xBF\xE1\x9A\xA0\xF0\xA0\x9C\x8E"
[4] pry(main)> mixed_utf_8.valid_encoding?
=> true
[5] pry(main)> mixed_utf_8.encode!(Encoding::UTF_8)
=> "AÛ¿á\u009A ð \u009C\u008E"
[6] pry(main)> mixed_utf_8.bytes
=> [65, 195, 155, 194, 191, 195, 161, 194, 154, 194, 160, 195, 176, 194, 160, 194, 156, 194, 142]
[7] pry(main)> orig_mixed_utf_8 = "A\u06FF\u16A0\u{2070E}"
=> "Aۿᚠ𠜎"
[8] pry(main)> mixed_utf_8 == orig_mixed_utf_8
=> false

Etc is a special case for us. Consider the same scenario above - ie host in ISO-8859-1 managed by Puppet, but instead believing the non-UTF-8 encoding is inaccurate and doing force_encoding!.

  • Puppet will be managing the properties of a user, but Puppet has not run yet on the host. The values on disk or accurate an accurate representation of their encoding, ie ISO-8859-1. When Puppet first runs, these values are read in and incorrectly re-labeled (overridden) as UTF-8 by Puppet::Etc. Puppet compares these values to the the UTF-8 should values, finds them out of sync, and overwrites them on the system with the UTF-8 values. The next time Puppet runs, the UTF-8 values are read in as ISO-8859-1, but correctly overridden by Puppet::Etc to UTF-8, and Puppet correctly finds them in sync with the catalog should values. Even if we had not incorrectly overridden the encoding on the very first Puppet run, the result would have been the same - Puppet would have found the values out of sync, as the ISO-8859-1 value would not equal the UTF-8 value, and corrected them.

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch from 16d12db to 410a25d Compare April 14, 2017 16:48

# When received as BINARY are not transcodable, but by "guessing"
# Encoding.default_external can transcode to UTF-8
win_31j = [130, 187].pack('C*')
Copy link
Contributor

@Iristyle Iristyle Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably introduce a failure spec here that shows an incoming BINARY byte sequence that is not valid in Encoding.default_external ... i.e. it should raise an exception of InvalidByteSequenceError

One that is not valid in 31J is "\xFF\xFE\xFD"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added

# nil upon a failure to legitimately set external encoding or transcode string
# @param [String] string a string to transcode
# @return [nil] (string is modified in place)
def convert_to_utf_8!(string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to address it here, but it seems to me that convert_to_utf_8 should return a copy of the initial string rather than mutating it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated - no longer a destructive method

if valid_utf_8_bytes?(string)
string.force_encoding(Encoding::UTF_8)
else
Puppet.debug(_("%{value} is not valid UTF-8 and result of overriding encoding would be invalid.") % { value: string.dump })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be emitting a string into debug output that we know isn't UTF-8, or will dump always produce UTF-8 due to its behavior?

https://ruby-doc.org/core-2.2.0/String.html#method-i-dump

Copy link
Contributor Author

@MosesMendoza MosesMendoza Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - dump will print the escaped hex values of the string ie so the ultimate character output is, strictly speaking, valid UTF-8, even though it doesn't represent valid UTF-8.

[1] pry(main)> s = [255, 254, 253].pack('C*').force_encoding(Encoding::UTF_8)
=> "\xFF\xFE\xFD"
[2] pry(main)> s.valid_encoding?
=> false
[3] pry(main)> output = "bar #{s}"
=> "bar \xFF\xFE\xFD"
[4] pry(main)> puts output
bar ���
=> nil
[5] pry(main)> output = "bar #{s.dump}"
=> "bar \"\\xFF\\xFE\\xFD\""
[6] pry(main)> puts output
bar "\xFF\xFE\xFD"

# \u2603 - \xe2 \x98 \x83 - 226 152 131
snowman = [226, 152, 131].pack('C*')
Puppet::Util::CharacterEncoding.override_encoding_to_utf_8!(snowman)
expect(snowman).to eq("\u2603")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add to the expectation:

expect(snowman.encoding).to eq(Encoding::UTF_8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 added

nil
end

private
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the valid_bytes implementation be simplified to?

def valid_utf_8_bytes?(string)
  string.dup.force_encoding(Encoding::UTF_8).valid_encoding?
end

Given one of your prior PRs, we determined .dup doesn't copy a string, but just creates a new wrapper around the same underlying string data - making dup pretty cheap. This lets us test bytes without having to do the save / revert of encoding on the original string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that seems like a reasonable optimization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

it "should convert convertible values in arrays to UTF-8" do
expect(converted.mem[0]).to eq("A\u06FF\u16A0\u{2070E}")
expect(converted.mem[0].encoding).to eq(Encoding::UTF_8) # just being explicit
it "should override EUC_KR-labeled values in arrays to UTF-8 if that would result in valid UTF-8" do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to re-examine these tests... not sure on the logic exactly. Might be helpful to make a small map of inputs / outputs / expectations to get a better handle on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were invalid, and they were being masked because we were modifying the expected original values. Pushed a new commit that ensures we're acting on copies for our tests so that the originals go unmodified.


it "should leave the string umodified" do
PuppetSpec::CharacterEncoding.with_external_encoding(Encoding::Windows_31J) do
Puppet::Util::CharacterEncoding.convert_to_utf_8!(invalid_win_31j)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# group passwd field is valid UTF-8
group.passwd = mixed_utf_8
group
group = duplicate_struct_values(group)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just:

duplicate_struct_values(group)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing that out

@Iristyle
Copy link
Contributor

Iristyle commented Apr 25, 2017

Everything is looking pretty good... One last thing to check before merge is:

  • Configure a Linux system with Latin1
  • Create a user with a name that contains a Latin1 accented character that is represented with different bytes in UTF-8 (José is a good option - the 4 bytes used should be [74, 111, 115, 233] - 233 is xE9)
  • Create a Puppet manifest that tries to create the same user, but with the name José represented in UTF-8 as its a manifest (the 5 bytes will be - [74, 111, 115, 195, 169] (195, 169 is xC3 xA9 / aka U+00E9)
  • Run puppet apply on manifest / observe output - assumption is that we end up with 2 users now.
  • Run puppet apply on manifest again to check for idempotency / observe output
  • Run puppet resource user at the shell with Latin1 encoding - observe output - \xC3 from UTF-8 is represented as à in Latin1 and \xA9 from UTF-8 is represented as © in Latin1
  • Run puppet resource user at the shell with UTF-8 encoding - observe output - \xE9 is not a valid byte sequence in UTF-8, so should hopefully show a replacement character?

Make sure we get what we want here... and also validate against existing Puppet behavior prior to any Etc changes (may have to go back a release or two)

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch from ba5e370 to 1b626e7 Compare April 25, 2017 21:24
@MosesMendoza MosesMendoza added the blocked PRs blocked on work external to the PR itself label Apr 25, 2017
@MosesMendoza
Copy link
Contributor Author

Re-added blocked label. in running through manual testing with @Iristyle we determined that this works in a manifest driven workflow with agent in LATIN-1, but fails when running puppet resource directly on the agent when the resource values contain a mix of latin-1 and utf-8 strings.

text = resources.map do |resource|
r = resource.prune_parameters(:parameters_to_include => @extra_params).to_manifest
r.encoding == Encoding.default_external ? r : r.force_encoding(Encoding.default_external)
end.join("\n")
Copy link
Contributor Author

@MosesMendoza MosesMendoza May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs a test done

# object with the original struct values overidden to UTF-8. For each member
# the struct also contains a corresponding :canonical_<member name> struct
# member
def override_field_values_to_utf8(struct)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests need updated

# returns new struct for the next entry or nil if EOF. Call ::endgrent to close file.
def getpwent
convert_field_values_to_utf8!(::Etc.getpwent)
override_field_values_to_utf8(::Etc.getpwent)
Copy link
Contributor Author

@MosesMendoza MosesMendoza May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should get coverage for the new struct objects via these methods done

# @api public
# @param [String] string to set external encoding (re-label) to utf-8
# @return [String] a copy of string with external encoding set to utf-8
def override_encoding_to_utf_8(string)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per commit comment, ultimately kept this method (albeit with non-destructive form) because we have need to only override if the bytes are invalid when running puppet resource user in non-utf8 form. Otherwise, we then issue replacement characters in our now-invalid-utf8 string, which causes incorrect puppet resource user output.

if String.method_defined?(:scrub)
string.scrub
else
string.chars.map { |c| c.valid_encoding? ? c : "?"}.join
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrub isn't present in 1.9.3, and various iterations on encode with replacement characters failed to work. supplying utf-8 for the first two args to encode is a no-op, and supplying binary as the second arg causes any non-ascii character to be replaced regardless of the original encoding or validity of the string...

@MosesMendoza
Copy link
Contributor Author

MosesMendoza commented May 1, 2017

TODO: need test fixups done

@Iristyle this is an implementation using extended structs with canonical values. ultimately not sure it's worth the extra logic, but it seems cleaner and more future proof though in case there are other places in the codebase that need what we have done in nameservice (still need to audit). as-is though I think this now works in all the cases we were testing together previously.

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch 7 times, most recently from 0d109a4 to 131ad8b Compare May 3, 2017 00:05
# defines a new Class object, we memozie to avoid superfluous extra Class
# instantiations.
def puppet_etc_passwd_class
@password_class ||= Struct.new(*Etc::Passwd.members, *Etc::Passwd.members.map { |member| "canonical_#{member}".to_sym })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these all strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The member names will be strings - the values can vary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, strictly speaking, all the member names are returned here as an array of symbols, but the interpolation implicitly calls to_s on the symbol

Copy link
Contributor

@Iristyle Iristyle May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, what I meant was... aren't canonical_ fields only useful for string values?

For instance, canonical_expire, canonical_gid, canonical_uid aren't really useful, are they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not particularly useful. To avoid that, we could hard code the list of available attributes here? I thought about it but decided to go with this because there's a couple things that make that more annoying

  • the list varies based on the struct class (and knowing ruby could change without warning) so we maintain different lists based on if we're doing a user or group struct
  • the list actually varies based on the flags passed to ruby during compilation . This means we'd need to conditionally check for the presence of the members that we expect to exist and have string values and only duplicate those. I feel like there's a tradeoff there in maintainability vs the completeness of just duplicating all the members
  • A sidenote that doesn't necessarily impact this discussion but is worth noting is that the Etc struct members aren't actually typed - any field could just as easily return a string as an integer. So if for some reason the underlying system changes the return value for any of these we're going to be in trouble. We could guard by only trying to encode things that are strings rather than expecting specific fields to be, but again that felt like going to far. Happy to entertain alternatives there though.

# defines a new Class object, we memoize to avoid superfluous extra Class
# instantiations.
def puppet_etc_group_class
@group_class ||= Struct.new(*Etc::Group.members, *Etc::Group.members.map { |member| "canonical_#{member}".to_sym })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these all strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch 2 times, most recently from dfdfbc5 to 43dd9f8 Compare May 3, 2017 22:24
@MosesMendoza
Copy link
Contributor Author

cleaned up whitespace. This ran a local ci:test:git run against ubuntu 1604 and appeared to only fail in expected ways - at least none apparently related to this PR

@MosesMendoza MosesMendoza removed the blocked PRs blocked on work external to the PR itself label May 3, 2017
@MosesMendoza
Copy link
Contributor Author

MosesMendoza commented May 3, 2017

Should be good to merge 🤞

@Iristyle Iristyle changed the base branch from master to stable May 4, 2017 00:11
@Iristyle Iristyle changed the base branch from stable to master May 4, 2017 00:12
@Iristyle
Copy link
Contributor

Iristyle commented May 4, 2017

FYI - I think this should be targeted at stable... I've rebased and I'm testing that locally.

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch from 43dd9f8 to 38ac008 Compare May 4, 2017 15:31
@MosesMendoza MosesMendoza changed the base branch from master to stable May 4, 2017 15:31
@MosesMendoza
Copy link
Contributor Author

@Iristyle rebased against stable

join("\n")
text = resources.map do |resource|
r = resource.prune_parameters(:parameters_to_include => @extra_params).to_hierayaml
r.encoding == Encoding.default_external? ? r : r.force_encoding(Encoding.default_external)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be simplified to r.force_encoding(Encoding.default_external) - i.e. is the check roughly as costly as performing the effectual no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch from 38ac008 to e2fa408 Compare May 4, 2017 18:48
@puppetcla
Copy link

CLA signed by all contributors.

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch 2 times, most recently from cd02674 to 6871ecf Compare May 4, 2017 21:30
@@ -86,10 +100,7 @@ def convert_to_utf_8!(string)
# @param [String] string a string to test
# @return [Boolean] whether we think the string is UTF-8 or not
def valid_utf_8_bytes?(string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_utf_8_bytes was not in prior commit and should be torched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh hell

@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch 2 times, most recently from 24333a6 to 10b342c Compare May 4, 2017 21:44
@MosesMendoza MosesMendoza changed the title (PUP-7021) update char encoding util per feedback (PUP-7498) update char encoding util per feedback May 4, 2017
Moses Mendoza added 4 commits May 4, 2017 15:26
…version

* Prior to this commit, we would transcode to UTF-8 a string not composed of
  valid UTF-8 bytes but valid in its current encoding. This presents a challenge
  for BINARY strings, which if originating in an encoding other to UTF-8 are
  otherwise valid, but cannot be transcoded (because we have no idea where they
  came from). In order to transcode successfully, we need an originating
  encoding to work with. Rather than just blindly call encode! with binary
  strings, we can take one more guess as to their origin - which is that they
  originated in `Encoding.default_external`. I.e., if a system is running in
  Encoding::Windows_31J, and a string is returned as BINARY, we can guess that
  the string might have been Encoding::Windows_31J. This perhaps substantially
  increases our odds that we'll be able to conver the (binary) string to UTF-8
  successfully.

* Prior to this commit, we were modifying the supplied string in
  Puppet::Util::CharacterEncoding as a side-effect of the recent change that
  added `force_encoding(Encoding.default_external)` on binary strings. In the
  case where we fail to encode! that change leaves the original string with a
  different external encoding than it started. This commit addresses that by
  ensuring any on failure to convert we set the string back to its original
  external encoding.

* Prior to this commit, the Puppet::Util::CharacterEncoding::convert_to_utf_8!
  was inaccurately named in that the method would both convert (transcode) and
  override (set external encoding) on a string in different circumstances. After
  review, it was determined that this behavior was potentially dangerous and also
  could lead to inadvertent destruction or loss of string fidelity
  unintentionally. This commit updates the method to split out the two different
  operations, each with an improved description for when to use it. Conversion
  should be our primary use-case, for when we believe that the encoding of a
  string is an accurate representation of that string. Overriding a string's
  encoding should be used sparingly, effectively only when we have reason to
  believe that the existing encoding is inaccurate/incorrect.

* This commit adjusts the Puppet::Etc wrapper to account for the modified
  heuristics in Puppet::Util::CharacterEncoding. We leverage the new distinction
  between converting and overriding by only "overriding" the encoding to UTF-8
  when the string is returned in non-UTF-8 encoding. This is very explicitly us
  stating that we do not believe Ruby when we receive a non-UTF-8 value via Etc.
  The reason is we know that the value we wrote/managed is UTF-8 because it came
  in via Puppet as UTF-8. If the value has changed in between puppet runs, then
  Puppet will correctly address the drift by resetting the value to its correct
  UTF-8 value.

* In addition to updating the Etc wrapper to account for the preceding change to
  split out two methods in CharacterEncoding, prior to this commit the encoding
  management in our Etc wrapper incorrectly expected that ruby Etc would return
  BINARY encoded strings when the values on disk are invalid in the system
  encoding. This is incorrect - ruby Etc quite happily returns strings with
  invalid encoding. The only case where it appears to return binary is when the
  environment's encoding (Encoding.default_external) is ASCII (7-bit), and any
  individual character bit representation is equal to or greater than \x80
  (binary 10000000), ie the first 8-bit value. We account for this change in
  expectation by fixing our spec tests to expect and test for the appropriate
  scenarios.

* This commit also adds a small spec helper for running a block in a different
  external encoding.

Signed-off-by: Moses Mendoza <moses@puppet.com>
* `puppet resource` can correctly obtain resources in different encodings
* Attempting to join strings with incompatible encodings such as LATIN-1 and
  UTF-8 will fail
* After converting the resource to its console format, ensure
  it is in the correct external encoding before printing
* We do force_encoding to print the resource in its original (byte) state

Signed-off-by: Moses Mendoza <moses@puppet.com>
Add our own rendition of ruby's String.scrub. We defer to the original if it's
defined, which is only post-ruby 1.9.3.

Ruby 1.9.3 does not have a scrub method. Nor does using String#encode(<encoding>,
<encoding>, :replace => '?'...) work:
* Given a UTF-8 string, when Encoding::UTF-8 is passed as the
  first/second arguments, no replacement characters are inserted (ie, the call is
  a no-op).
* When Encoding::BINARY is passed as second argument any non-ASCII characters
  are _always_ replaced - even when the UTF-8 string was valid!

So instead we effectively make our own scrub, and only call to it when we need it.

Signed-off-by: Moses Mendoza <moses@puppet.com>
* When running in LATIN-1, we expect `puppet resource` to be able to manage
  UTF-8 resources alongside and without interfering with existing non-UTF-8
  resources.
* Previously, we would query the system for user names, and modify those
  usernames (override them to UTF-8 w/replacement chars) without keeping track
  of the original values returned by Etc. However, we would try to use these
  modified names to query the system again later - for example, the resource
  name is used to check if a user exists. This would erringly report an
  existing user absent - because we were looking for a modified username
* Now we track the original values returned by Etc inside new extended
  structs. These private interfaces are otherwise the same as their Etc
  counterparts, except they have canonical_ members for each original member
* We hide their definition inside of methods in etc.rb to avoid possibly referencing
  Etc::Passwd and Etc::Group where they is not available just by loading the
  code. We memoize to avoid creating multiple ephemeral class definitions,
  since Struct.new creates a new Class object one each call. The new canonical_
  members in the new structs track original state, while the original Etc
  members will track state overridden to UTF-8.
* We do this for every field value because we have the potential to modify the
  bytes of every field value using scrub to replace invalid byte sequences. We
  always scrub to avoid letting invalid strings into the puppet run lifecycle,
  where they wreak havoc when joined and manipulated.
* In order to make self.instances as sane as possible while still having access
  to the canonical_ value from the original struct, we move the logic for
  iterating over the Etc values to it, and deprecate listbyname. There are no
  other callers in the codebase of listbyname. The nameservice constructor is
  called with arguments of different types - in some cases a hash (as seen in
  this PR) and in others with a Puppet resource object that was created
  manually. We have to check that we have a hash before accessing the
  :canonical_name field because :canonical_name is not a property on the user
  resource, and the `:[]` method is overridden to validate properties before
  returning values.
* We revise all of the character_encoding and Puppet:Etc methods to be
  non-destructive. The logic was complicated and annoying with mutated state, so
  we stop. Any minor perf hit from string.dup is worth this sanity.
* Useradd shadow password support is one place we query Etc (again) for field
  values with the resource name possibly obtained by Etc the first time. Update
  to use @canonical_name, which should always be set because this class inherits
  from the base Nameservice provider.
* Make override_encoding_to_utf_8 non-destructive
  It was simply too unintuitve, risky, and annoying to mutate this state. Take
  the minor perf hit of duplicating the string for our sanity's sake.

Signed-off-by: Moses Mendoza <moses@puppet.com>
@MosesMendoza MosesMendoza force-pushed the PUP-7021/master/update_char_encoding_util branch from 10b342c to c6df9f9 Compare May 4, 2017 22:28
Iristyle added a commit that referenced this pull request May 8, 2017
@Iristyle
Copy link
Contributor

Iristyle commented May 8, 2017

Merged to stable in 365d6a9

@Iristyle Iristyle closed this May 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants