(PUP-7498) update char encoding util per feedback#5799
(PUP-7498) update char encoding util per feedback#5799MosesMendoza wants to merge 4 commits intopuppetlabs:stablefrom
Conversation
|
CLA signed by all contributors. |
5d9d3b4 to
16d12db
Compare
lib/puppet/etc.rb
Outdated
| if value.is_a?(String) | ||
| converted = Puppet::Util::CharacterEncoding.convert_to_utf_8!(value) | ||
| struct[index] = converted if !converted.nil? | ||
| Puppet::Util::CharacterEncoding.override_encoding_to_utf_8!(value) if value.encoding != Encoding::UTF_8 |
There was a problem hiding this comment.
@joshcooper @Iristyle This change is the crux of our discussion - This now basically says we are explicitly overriding the encoding of the string returned by Etc to UTF-8 (not converting it) if it is a) not already UTF-8 and b) would be valid UTF-8. We are starting from the assumption that it is incorrectly labeled with whatever encoding it has (ie ISO-8859-1) because we have reason to believe it is actually UTF-8 because Puppet set (or will set) the value to UTF-8.
Just to reiterate for PR posterity why I believe this is the correct behavior: Consider these two scenarios. In the first, we default to assuming the encoding is correct from Etc and do encode! and in the second we assume the encoding is incorrect and override it with force_encoding.
- Puppet agent is creating a new user with managed comment property. The host is running in ISO-8859-1 encoding. When puppet agent get's its
shouldvalue via the catalog, the should value is UTF-8 and its bytes are written to disk as-is with no re-encoding/modification (we don't use Etc for writing, only reading). When Puppet runs next time, Ruby::Etc will read in these bytes and label them as ISO-8859-1, which will likely be valid in that encoding. At least, our commonly usedmixed_utf8string in tests is "valid" ISO-8859-1. When we pipe these values through Puppet::Etc, they'll now beencode!d as UTF-8 - changing the byte representation. So our original UTF-8 has now been modified to something else. When Puppet compares thisisvalue to the catalogshouldvalue, they won't be the same... so Puppet will consider the resource out of sync and write theshouldvalue to disk again - and will every time it runs.
[1] pry(main)> mixed_utf_8 = "A\u06FF\u16A0\u{2070E}"
=> "Aۿᚠ𠜎"
[2] pry(main)> mixed_utf_8.bytes
=> [65, 219, 191, 225, 154, 160, 240, 160, 156, 142]
[3] pry(main)> mixed_utf_8.force_encoding(Encoding::ISO_8859_1)
=> "A\xDB\xBF\xE1\x9A\xA0\xF0\xA0\x9C\x8E"
[4] pry(main)> mixed_utf_8.valid_encoding?
=> true
[5] pry(main)> mixed_utf_8.encode!(Encoding::UTF_8)
=> "AÛ¿á\u009A ð \u009C\u008E"
[6] pry(main)> mixed_utf_8.bytes
=> [65, 195, 155, 194, 191, 195, 161, 194, 154, 194, 160, 195, 176, 194, 160, 194, 156, 194, 142]
[7] pry(main)> orig_mixed_utf_8 = "A\u06FF\u16A0\u{2070E}"
=> "Aۿᚠ𠜎"
[8] pry(main)> mixed_utf_8 == orig_mixed_utf_8
=> false
Etc is a special case for us. Consider the same scenario above - ie host in ISO-8859-1 managed by Puppet, but instead believing the non-UTF-8 encoding is inaccurate and doing force_encoding!.
- Puppet will be managing the properties of a user, but Puppet has not run yet on the host. The values on disk or accurate an accurate representation of their encoding, ie ISO-8859-1. When Puppet first runs, these values are read in and incorrectly re-labeled (overridden) as UTF-8 by Puppet::Etc. Puppet compares these values to the the UTF-8
shouldvalues, finds them out of sync, and overwrites them on the system with the UTF-8 values. The next time Puppet runs, the UTF-8 values are read in as ISO-8859-1, but correctly overridden by Puppet::Etc to UTF-8, and Puppet correctly finds them in sync with the catalogshouldvalues. Even if we had not incorrectly overridden the encoding on the very first Puppet run, the result would have been the same - Puppet would have found the values out of sync, as the ISO-8859-1 value would not equal the UTF-8 value, and corrected them.
16d12db to
410a25d
Compare
|
|
||
| # When received as BINARY are not transcodable, but by "guessing" | ||
| # Encoding.default_external can transcode to UTF-8 | ||
| win_31j = [130, 187].pack('C*') |
There was a problem hiding this comment.
We should probably introduce a failure spec here that shows an incoming BINARY byte sequence that is not valid in Encoding.default_external ... i.e. it should raise an exception of InvalidByteSequenceError
One that is not valid in 31J is "\xFF\xFE\xFD"
| # nil upon a failure to legitimately set external encoding or transcode string | ||
| # @param [String] string a string to transcode | ||
| # @return [nil] (string is modified in place) | ||
| def convert_to_utf_8!(string) |
There was a problem hiding this comment.
We don't need to address it here, but it seems to me that convert_to_utf_8 should return a copy of the initial string rather than mutating it...
There was a problem hiding this comment.
Updated - no longer a destructive method
| if valid_utf_8_bytes?(string) | ||
| string.force_encoding(Encoding::UTF_8) | ||
| else | ||
| Puppet.debug(_("%{value} is not valid UTF-8 and result of overriding encoding would be invalid.") % { value: string.dump }) |
There was a problem hiding this comment.
Should we be emitting a string into debug output that we know isn't UTF-8, or will dump always produce UTF-8 due to its behavior?
There was a problem hiding this comment.
Right - dump will print the escaped hex values of the string ie so the ultimate character output is, strictly speaking, valid UTF-8, even though it doesn't represent valid UTF-8.
[1] pry(main)> s = [255, 254, 253].pack('C*').force_encoding(Encoding::UTF_8)
=> "\xFF\xFE\xFD"
[2] pry(main)> s.valid_encoding?
=> false
[3] pry(main)> output = "bar #{s}"
=> "bar \xFF\xFE\xFD"
[4] pry(main)> puts output
bar ���
=> nil
[5] pry(main)> output = "bar #{s.dump}"
=> "bar \"\\xFF\\xFE\\xFD\""
[6] pry(main)> puts output
bar "\xFF\xFE\xFD"
| # \u2603 - \xe2 \x98 \x83 - 226 152 131 | ||
| snowman = [226, 152, 131].pack('C*') | ||
| Puppet::Util::CharacterEncoding.override_encoding_to_utf_8!(snowman) | ||
| expect(snowman).to eq("\u2603") |
There was a problem hiding this comment.
Should we add to the expectation:
expect(snowman.encoding).to eq(Encoding::UTF_8)| nil | ||
| end | ||
|
|
||
| private |
There was a problem hiding this comment.
Can the valid_bytes implementation be simplified to?
def valid_utf_8_bytes?(string)
string.dup.force_encoding(Encoding::UTF_8).valid_encoding?
endGiven one of your prior PRs, we determined .dup doesn't copy a string, but just creates a new wrapper around the same underlying string data - making dup pretty cheap. This lets us test bytes without having to do the save / revert of encoding on the original string.
There was a problem hiding this comment.
👍 that seems like a reasonable optimization
| it "should convert convertible values in arrays to UTF-8" do | ||
| expect(converted.mem[0]).to eq("A\u06FF\u16A0\u{2070E}") | ||
| expect(converted.mem[0].encoding).to eq(Encoding::UTF_8) # just being explicit | ||
| it "should override EUC_KR-labeled values in arrays to UTF-8 if that would result in valid UTF-8" do |
There was a problem hiding this comment.
Note to re-examine these tests... not sure on the logic exactly. Might be helpful to make a small map of inputs / outputs / expectations to get a better handle on it.
There was a problem hiding this comment.
These were invalid, and they were being masked because we were modifying the expected original values. Pushed a new commit that ensures we're acting on copies for our tests so that the originals go unmodified.
|
|
||
| it "should leave the string umodified" do | ||
| PuppetSpec::CharacterEncoding.with_external_encoding(Encoding::Windows_31J) do | ||
| Puppet::Util::CharacterEncoding.convert_to_utf_8!(invalid_win_31j) |
spec/unit/etc_spec.rb
Outdated
| # group passwd field is valid UTF-8 | ||
| group.passwd = mixed_utf_8 | ||
| group | ||
| group = duplicate_struct_values(group) |
There was a problem hiding this comment.
Why not just:
duplicate_struct_values(group)There was a problem hiding this comment.
thanks for pointing that out
|
Everything is looking pretty good... One last thing to check before merge is:
Make sure we get what we want here... and also validate against existing Puppet behavior prior to any Etc changes (may have to go back a release or two)
|
ba5e370 to
1b626e7
Compare
|
Re-added blocked label. in running through manual testing with @Iristyle we determined that this works in a manifest driven workflow with agent in LATIN-1, but fails when running puppet resource directly on the agent when the resource values contain a mix of latin-1 and utf-8 strings. |
1b626e7 to
57ea2f1
Compare
lib/puppet/application/resource.rb
Outdated
| text = resources.map do |resource| | ||
| r = resource.prune_parameters(:parameters_to_include => @extra_params).to_manifest | ||
| r.encoding == Encoding.default_external ? r : r.force_encoding(Encoding.default_external) | ||
| end.join("\n") |
There was a problem hiding this comment.
This change needs a test done
| # object with the original struct values overidden to UTF-8. For each member | ||
| # the struct also contains a corresponding :canonical_<member name> struct | ||
| # member | ||
| def override_field_values_to_utf8(struct) |
There was a problem hiding this comment.
These tests need updated
| # returns new struct for the next entry or nil if EOF. Call ::endgrent to close file. | ||
| def getpwent | ||
| convert_field_values_to_utf8!(::Etc.getpwent) | ||
| override_field_values_to_utf8(::Etc.getpwent) |
There was a problem hiding this comment.
should get coverage for the new struct objects via these methods done
| # @api public | ||
| # @param [String] string to set external encoding (re-label) to utf-8 | ||
| # @return [String] a copy of string with external encoding set to utf-8 | ||
| def override_encoding_to_utf_8(string) |
There was a problem hiding this comment.
per commit comment, ultimately kept this method (albeit with non-destructive form) because we have need to only override if the bytes are invalid when running puppet resource user in non-utf8 form. Otherwise, we then issue replacement characters in our now-invalid-utf8 string, which causes incorrect puppet resource user output.
| if String.method_defined?(:scrub) | ||
| string.scrub | ||
| else | ||
| string.chars.map { |c| c.valid_encoding? ? c : "?"}.join |
There was a problem hiding this comment.
scrub isn't present in 1.9.3, and various iterations on encode with replacement characters failed to work. supplying utf-8 for the first two args to encode is a no-op, and supplying binary as the second arg causes any non-ascii character to be replaced regardless of the original encoding or validity of the string...
|
@Iristyle this is an implementation using extended structs with canonical values. ultimately not sure it's worth the extra logic, but it seems cleaner and more future proof though in case there are other places in the codebase that need what we have done in nameservice (still need to audit). as-is though I think this now works in all the cases we were testing together previously. |
0d109a4 to
131ad8b
Compare
| # defines a new Class object, we memozie to avoid superfluous extra Class | ||
| # instantiations. | ||
| def puppet_etc_passwd_class | ||
| @password_class ||= Struct.new(*Etc::Passwd.members, *Etc::Passwd.members.map { |member| "canonical_#{member}".to_sym }) |
There was a problem hiding this comment.
The member names will be strings - the values can vary
There was a problem hiding this comment.
Well, strictly speaking, all the member names are returned here as an array of symbols, but the interpolation implicitly calls to_s on the symbol
There was a problem hiding this comment.
Ah sorry, what I meant was... aren't canonical_ fields only useful for string values?
For instance, canonical_expire, canonical_gid, canonical_uid aren't really useful, are they?
There was a problem hiding this comment.
Nope, not particularly useful. To avoid that, we could hard code the list of available attributes here? I thought about it but decided to go with this because there's a couple things that make that more annoying
- the list varies based on the struct class (and knowing ruby could change without warning) so we maintain different lists based on if we're doing a user or group struct
- the list actually varies based on the flags passed to ruby during compilation . This means we'd need to conditionally check for the presence of the members that we expect to exist and have string values and only duplicate those. I feel like there's a tradeoff there in maintainability vs the completeness of just duplicating all the members
- A sidenote that doesn't necessarily impact this discussion but is worth noting is that the Etc struct members aren't actually typed - any field could just as easily return a string as an integer. So if for some reason the underlying system changes the return value for any of these we're going to be in trouble. We could guard by only trying to encode things that are strings rather than expecting specific fields to be, but again that felt like going to far. Happy to entertain alternatives there though.
| # defines a new Class object, we memoize to avoid superfluous extra Class | ||
| # instantiations. | ||
| def puppet_etc_group_class | ||
| @group_class ||= Struct.new(*Etc::Group.members, *Etc::Group.members.map { |member| "canonical_#{member}".to_sym }) |
dfdfbc5 to
43dd9f8
Compare
|
cleaned up whitespace. This ran a local ci:test:git run against ubuntu 1604 and appeared to only fail in expected ways - at least none apparently related to this PR |
|
Should be good to merge 🤞 |
|
FYI - I think this should be targeted at stable... I've rebased and I'm testing that locally. |
43dd9f8 to
38ac008
Compare
|
@Iristyle rebased against stable |
lib/puppet/application/resource.rb
Outdated
| join("\n") | ||
| text = resources.map do |resource| | ||
| r = resource.prune_parameters(:parameters_to_include => @extra_params).to_hierayaml | ||
| r.encoding == Encoding.default_external? ? r : r.force_encoding(Encoding.default_external) |
There was a problem hiding this comment.
Can this just be simplified to r.force_encoding(Encoding.default_external) - i.e. is the check roughly as costly as performing the effectual no-op?
38ac008 to
e2fa408
Compare
|
CLA signed by all contributors. |
cd02674 to
6871ecf
Compare
| @@ -86,10 +100,7 @@ def convert_to_utf_8!(string) | |||
| # @param [String] string a string to test | |||
| # @return [Boolean] whether we think the string is UTF-8 or not | |||
| def valid_utf_8_bytes?(string) | |||
There was a problem hiding this comment.
valid_utf_8_bytes was not in prior commit and should be torched
24333a6 to
10b342c
Compare
…version * Prior to this commit, we would transcode to UTF-8 a string not composed of valid UTF-8 bytes but valid in its current encoding. This presents a challenge for BINARY strings, which if originating in an encoding other to UTF-8 are otherwise valid, but cannot be transcoded (because we have no idea where they came from). In order to transcode successfully, we need an originating encoding to work with. Rather than just blindly call encode! with binary strings, we can take one more guess as to their origin - which is that they originated in `Encoding.default_external`. I.e., if a system is running in Encoding::Windows_31J, and a string is returned as BINARY, we can guess that the string might have been Encoding::Windows_31J. This perhaps substantially increases our odds that we'll be able to conver the (binary) string to UTF-8 successfully. * Prior to this commit, we were modifying the supplied string in Puppet::Util::CharacterEncoding as a side-effect of the recent change that added `force_encoding(Encoding.default_external)` on binary strings. In the case where we fail to encode! that change leaves the original string with a different external encoding than it started. This commit addresses that by ensuring any on failure to convert we set the string back to its original external encoding. * Prior to this commit, the Puppet::Util::CharacterEncoding::convert_to_utf_8! was inaccurately named in that the method would both convert (transcode) and override (set external encoding) on a string in different circumstances. After review, it was determined that this behavior was potentially dangerous and also could lead to inadvertent destruction or loss of string fidelity unintentionally. This commit updates the method to split out the two different operations, each with an improved description for when to use it. Conversion should be our primary use-case, for when we believe that the encoding of a string is an accurate representation of that string. Overriding a string's encoding should be used sparingly, effectively only when we have reason to believe that the existing encoding is inaccurate/incorrect. * This commit adjusts the Puppet::Etc wrapper to account for the modified heuristics in Puppet::Util::CharacterEncoding. We leverage the new distinction between converting and overriding by only "overriding" the encoding to UTF-8 when the string is returned in non-UTF-8 encoding. This is very explicitly us stating that we do not believe Ruby when we receive a non-UTF-8 value via Etc. The reason is we know that the value we wrote/managed is UTF-8 because it came in via Puppet as UTF-8. If the value has changed in between puppet runs, then Puppet will correctly address the drift by resetting the value to its correct UTF-8 value. * In addition to updating the Etc wrapper to account for the preceding change to split out two methods in CharacterEncoding, prior to this commit the encoding management in our Etc wrapper incorrectly expected that ruby Etc would return BINARY encoded strings when the values on disk are invalid in the system encoding. This is incorrect - ruby Etc quite happily returns strings with invalid encoding. The only case where it appears to return binary is when the environment's encoding (Encoding.default_external) is ASCII (7-bit), and any individual character bit representation is equal to or greater than \x80 (binary 10000000), ie the first 8-bit value. We account for this change in expectation by fixing our spec tests to expect and test for the appropriate scenarios. * This commit also adds a small spec helper for running a block in a different external encoding. Signed-off-by: Moses Mendoza <moses@puppet.com>
* `puppet resource` can correctly obtain resources in different encodings * Attempting to join strings with incompatible encodings such as LATIN-1 and UTF-8 will fail * After converting the resource to its console format, ensure it is in the correct external encoding before printing * We do force_encoding to print the resource in its original (byte) state Signed-off-by: Moses Mendoza <moses@puppet.com>
Add our own rendition of ruby's String.scrub. We defer to the original if it's defined, which is only post-ruby 1.9.3. Ruby 1.9.3 does not have a scrub method. Nor does using String#encode(<encoding>, <encoding>, :replace => '?'...) work: * Given a UTF-8 string, when Encoding::UTF-8 is passed as the first/second arguments, no replacement characters are inserted (ie, the call is a no-op). * When Encoding::BINARY is passed as second argument any non-ASCII characters are _always_ replaced - even when the UTF-8 string was valid! So instead we effectively make our own scrub, and only call to it when we need it. Signed-off-by: Moses Mendoza <moses@puppet.com>
* When running in LATIN-1, we expect `puppet resource` to be able to manage UTF-8 resources alongside and without interfering with existing non-UTF-8 resources. * Previously, we would query the system for user names, and modify those usernames (override them to UTF-8 w/replacement chars) without keeping track of the original values returned by Etc. However, we would try to use these modified names to query the system again later - for example, the resource name is used to check if a user exists. This would erringly report an existing user absent - because we were looking for a modified username * Now we track the original values returned by Etc inside new extended structs. These private interfaces are otherwise the same as their Etc counterparts, except they have canonical_ members for each original member * We hide their definition inside of methods in etc.rb to avoid possibly referencing Etc::Passwd and Etc::Group where they is not available just by loading the code. We memoize to avoid creating multiple ephemeral class definitions, since Struct.new creates a new Class object one each call. The new canonical_ members in the new structs track original state, while the original Etc members will track state overridden to UTF-8. * We do this for every field value because we have the potential to modify the bytes of every field value using scrub to replace invalid byte sequences. We always scrub to avoid letting invalid strings into the puppet run lifecycle, where they wreak havoc when joined and manipulated. * In order to make self.instances as sane as possible while still having access to the canonical_ value from the original struct, we move the logic for iterating over the Etc values to it, and deprecate listbyname. There are no other callers in the codebase of listbyname. The nameservice constructor is called with arguments of different types - in some cases a hash (as seen in this PR) and in others with a Puppet resource object that was created manually. We have to check that we have a hash before accessing the :canonical_name field because :canonical_name is not a property on the user resource, and the `:[]` method is overridden to validate properties before returning values. * We revise all of the character_encoding and Puppet:Etc methods to be non-destructive. The logic was complicated and annoying with mutated state, so we stop. Any minor perf hit from string.dup is worth this sanity. * Useradd shadow password support is one place we query Etc (again) for field values with the resource name possibly obtained by Etc the first time. Update to use @canonical_name, which should always be set because this class inherits from the base Nameservice provider. * Make override_encoding_to_utf_8 non-destructive It was simply too unintuitve, risky, and annoying to mutate this state. Take the minor perf hit of duplicating the string for our sanity's sake. Signed-off-by: Moses Mendoza <moses@puppet.com>
10b342c to
c6df9f9
Compare
|
Merged to stable in 365d6a9 |
This PR splits out adjustments to the character encoding heuristics that were previously a part of #5509 and contains changes based on feedback on that PR.
The three primary changes are:
Binary string handling
Prior to this PR, we would transcode to UTF-8 a string not composed of valid UTF-8 bytes but valid in its current encoding. This presents a challenge for BINARY strings, which if originating in an encoding other to UTF-8 are otherwise valid, but cannot be transcoded (because we have no idea where they came from). In order to transcode successfully, we need an originating encoding to work with. Rather than just blindly call encode! with binary strings, we can take one more guess as to their origin - which is that they originated in
Encoding.default_external. I.e., if a system is running inEncoding::Windows_31J, and a string is returned as BINARY, we can guess that the string might have been Encoding::Windows_31J. This perhaps substantially increases our odds that we'll be able to convert the (binary) string to UTF-8 successfully.
Consistently return nil
Prior to this PR, we were modifying the supplied string in Puppet::Util::CharacterEncoding as a side-effect of the recent change that added
force_encoding(Encoding.default_external)on binary strings. In the case where we fail to encode! that change leaves the original string with adifferent external encoding than it started. This PR addresses that by ensuring any on failure to convert we set the string back to its original external encoding. A quirk was accounted for in that we don't set external_encoding if the default external encoding is UTF-8. The reason is we've already determined that this string doesn't represent valid UTF-8 bytes so doing a force_encoding to UTF-8 doesn't get us any closer to being able to transcode it to UTF-8, and ruby won't do anything on
encode!to string already has as its default_external set to the encoding supplied to encode!.Separate out conversion from overriding of string encoding
Prior to this PR, the Puppet::Util::CharacterEncoding::convert_to_utf_8! was inaccurately named in that the method would both convert (transcode) and override (set external encoding) on a string in different circumstances. After review, it was determined that this behavior was potentially dangerous and also could lead to inadvertent destruction or loss of string fidelity unintentionally. This PR updates the method to split out the two different operations, each with an improved description for when to use it. Conversion should be our primary use-case, for when we believe that the encoding of a string is an accurate representation of that string. Overriding a string's encoding should be used sparingly, effectively only when we have reason to believe that the existing encoding is inaccurate/incorrect.
This PR adjusts the Puppet::Etc wrapper to account for the modified heuristics in Puppet::Util::CharacterEncoding. We leverage the new distinction between converting and overriding by only "overriding" the encoding to UTF-8 when the string is returned in BINARY and otherwise defaulting to converting it. This is a shift in default assumptions - whereas before we would assume the string encoding was not valid for that environment, now we do. We retain the preceding requirement that strings that cannot be converted are left unmodified in the Etc struct.
In addition to updating the Etc wrapper to account for the preceding change to split out two methods in CharacterEncoding, prior to this PR the encoding management in our Etc wrapper incorrectly expected that ruby Etc would return BINARY encoded strings when the values on disk are invalid in the system encoding. This is incorrect - ruby Etc quite happily returns strings with invalid encoding. The only case where it appears to return binary is when the environment's encoding (Encoding.default_external) is ASCII (7-bit), and any individual character bit representation is equal to or greater than \x80 (binary 10000000), ie the first 8-bit value. We account for this change in expectation by fixing our spec tests to expect and test for the appropriate scenarios.
Finally, we also add a small spec helper for running a block in a different external encoding.