Skip to content

8251989: Hex formatting and parsing utility #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from

Conversation

RogerRiggs
Copy link
Contributor

@RogerRiggs RogerRiggs commented Oct 2, 2020

java.util.HexFormat utility:

  • Format and parse hexadecimal strings, with parameters for delimiter, prefix, suffix and upper/lowercase
  • Static factories and builder methods to create HexFormat copies with modified parameters.
  • Consistent naming of methods for conversion of byte arrays to formatted strings and back: formatHex and parseHex
  • Consistent naming of methods for conversion of primitive types: toHexDigits... and fromHexDigits...
  • Prefix and suffixes now apply to each formatted value, not the string as a whole
  • Using java.util.Appendable as a target for buffered conversions so output to Writers and PrintStreams
    like System.out are supported in addition to StringBuilder. (IOExceptions are converted to unchecked exceptions)
  • Immutable and thread safe, a "value-based" class

See the HexFormat javadoc for details.

Review comments and suggestions welcome.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/482/head:pull/482
$ git checkout pull/482

@RogerRiggs
Copy link
Contributor Author

/csr

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 2, 2020

👋 Welcome back rriggs! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added rfr Pull request is ready for review csr Pull request needs approved CSR before integration labels Oct 2, 2020
@openjdk
Copy link

openjdk bot commented Oct 2, 2020

@RogerRiggs this pull request will not be integrated until the CSR request JDK-8251991 for issue JDK-8251989 has been approved.

@openjdk
Copy link

openjdk bot commented Oct 2, 2020

@RogerRiggs The following labels will be automatically applied to this pull request:

  • core-libs
  • i18n
  • net
  • nio
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added security security-dev@openjdk.org nio nio-dev@openjdk.org core-libs core-libs-dev@openjdk.org net net-dev@openjdk.org i18n i18n-dev@openjdk.org labels Oct 2, 2020
@mlbridge
Copy link

mlbridge bot commented Oct 2, 2020

@RogerRiggs
Copy link
Contributor Author

To avoid spamming email lists, I'm removing the i18n, net, nio, and security labels since most developers already are subscribed to core-libs.
Please re-add a label if you think it is useful to the email list.

/label remove i18n, net, nio, security

@openjdk openjdk bot removed i18n i18n-dev@openjdk.org net net-dev@openjdk.org nio nio-dev@openjdk.org security security-dev@openjdk.org labels Oct 2, 2020
@openjdk
Copy link

openjdk bot commented Oct 2, 2020

@RogerRiggs
The i18n label was successfully removed.

The net label was successfully removed.

The nio label was successfully removed.

The security label was successfully removed.

Copy link
Contributor

@wangweij wangweij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name already has Hex inside, and methods names still contain it. Also, I found it unexpected that fromHexDigits returns an integer. Can we use encode, decode, decodeInt, decodeLong etc?

@RogerRiggs
Copy link
Contributor Author

The class name already has Hex inside, and methods names still contain it. Also, I found it unexpected that fromHexDigits returns an integer. Can we use encode, decode, decodeInt, decodeLong etc?

Including 'Hex' in the method names reinforces the function, without it, more context is required to make the code readable.

The decoding of hexadecimal digits to a binary number is at its core untyped and unsigned for all three fromHexDigits methods.
A signed type such as byte does implicit sign extension and in many use cases it would need to be masked
before the value is used. An explicit cast in the source to a smaller primitive makes the use clearer explicit.

The formatHex and parseHex methods operating on byte arrays are quite different than toHexDigits and fromHexDigits.
When operating on byte arrays, the prefix, suffix, and delimiter are included in the conversion.
The toHexDigits and fromHexDigits methods do not include prefix and suffix, and delimiter and just convert.
The different naming helps create a clear distinction. Also it was mentioned in comments on the first version of the API, that encode and decode can be ambiguous as to which direction the conversion is going.

Copy link

@Marcono1234 Marcono1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really useful addition! Utility classes for formatting bytes to hex have been re-invented so often already that having this in the JDK would be really great.

Now that the OpenJDK project migrated to GitHub I took the opportunity and wrote this GitHub pull request review (though I am not an OpenJDK contributor). Most things are formatting or documentation related. I hope these comments are useful and are not too intrusive. Please let me know otherwise.

Also is it common practice to use System.out in JDK tests? In my opinion it often does not add much value once the unit test implementation has been completed because the output is not checked during tests automatically anyways and might only clutter the console output.

The tests are using String.toUpperCase, toLowerCase and format without Locale (therefore using the default one). Does the test setup guarantee a constant default locale or would be better to include a Locale to make sure the tests don't break for any unusual locale?

Has it been also considered to add support for parsing from a Reader to an OutputStream and from InputStream to Appendable to support arbitrary length input and output?

Maybe it would also be good to mention for the method parsing and formatting int, long, ... in which byte order the output is created.

And would it be worth supporting a delimiter period or frequency to only apply the delimiter every nth byte? This would be useful when the user want to write hex chars in groups of 4 or 8.

@RogerRiggs
Copy link
Contributor Author

Also is it common practice to use System.out in JDK tests? In my opinion it often does not add much value once the unit test implementation has been completed because the output is not checked during tests automatically anyways and might only clutter the console output.

It has been useful to see the output generated and if there are failures there is additional information available.

The tests are using String.toUpperCase, toLowerCase and format without Locale (therefore using the default one). Does the test setup guarantee a constant default locale or would be better to include a Locale to make sure the tests don't break for any unusual locale?

Good point, the ROOT locale should be used. Though for the cases of the hexadecimal characters, they are consistently treated across all locales.

Has it been also considered to add support for parsing from a Reader to an OutputStream and from InputStream to Appendable to support arbitrary length input and output?

That can be considered separately, the Jira issue 8254708 will track that enhancement.
A typical application reading input may need to handle a mix of input constructs with Hex being just one.
An application usually needs some kind of look ahead or push back to adjust to the contents of the stream
and that brings in a more complex grammar.

Maybe it would also be good to mention for the method parsing and formatting int, long, ... in which byte order the output is created.

The big/little -endian terminology usually applies to binary representations. The natural reading order for numbers is left to right but it can be more explicit.

And would it be worth supporting a delimiter period or frequency to only apply the delimiter every nth byte? This would be useful when the user want to write hex chars in groups of 4 or 8.

It is a tradeoff in complexity. The api focuses on the conversion of byte arrays to strings and back.
That behavior can be achieved naturally using the toHexDigits methods with the application handling the insertion of delimiters.

…to match Character, remove unnecessary Class name qualifications, etc.
- misc javadoc markup fixes.
- added checking of byte array sizes to generate useful exceptions if the arrays would be too large.
- Small implementation cleanups
@RogerRiggs
Copy link
Contributor Author

@dfuch I'll add a clarification.
The fromHexDigits methods explicitly parse using the fromHexDigit method that is specified to include both upper and lower case.

I was mostly concerned with the public methods that follow.
( public int fromHexDigits(CharSequence string) and friends)
They all say that * The delimiter, prefix and suffix are not used. but they do not say anything about upperCase/lowerCase. Since this is normative specification doesn't it need to be fixed?

…rsions

Switched order of declaration of a couple of method to make the javadoc sequence easier to read
* @throws IllegalArgumentException if the string length is greater than eight (8) or
* if any of the characters is not a hexadecimal character
*/
public int fromHexDigits(CharSequence string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend this group of methods include an apinote explaining the differences in behavior of compared to parseInt(s, 16) and parseUnsignedInt(s, 16).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add:

     * @apiNote
     * {@link Integer#parseInt(String, int) Integer.parseInt(s, 16)} and
     * {@link Integer#parseUnsignedInt(String, int) Integer.pareUnsignedInt(s, 16)}
     * are similar but allow all Unicode hexadecimal digits allowed by
     * {@link Character#digit(char, int) Character.digit(ch, 16)}.
     * {@code HexFormat} uses only Latin1 hexadecimal characters "0-9, "A-F", and "a-f".
     * {@link Integer#parseInt(String, int)} can parse signed hexadecimal strings.

And similar text for Long#parseLong and Long.parseUnsignedLong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay; however, I suggest saying more on the signed/unsigned behavior.

@openjdk openjdk bot removed the csr Pull request needs approved CSR before integration label Dec 4, 2020
@openjdk
Copy link

openjdk bot commented Dec 4, 2020

@RogerRiggs This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8251989: Hex formatting and parsing utility

Reviewed-by: tvaleev, chegar, naoto, darcy

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 4 new commits pushed to the master branch:

  • da2415f: 8257457: Update --release 16 symbol information for JDK 16 build 28
  • 36e2097: 8255917: runtime/cds/SharedBaseAddress.java failed "assert(reserved_rgn != 0LL) failed: No reserved region"
  • d53ee62: 8255899: Allow uninstallation of jpackage exe bundles
  • 65756ab: 8257802: LogCompilation throws couldn't find bytecode on JDK 8 log

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 4, 2020
@RogerRiggs
Copy link
Contributor Author

/integrate

@openjdk openjdk bot closed this Dec 16, 2020
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 16, 2020
@openjdk
Copy link

openjdk bot commented Dec 16, 2020

@RogerRiggs Since your change was applied there have been 22 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

Pushed as commit aa9c136.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@RogerRiggs RogerRiggs deleted the 8251989-hex-formatter branch January 9, 2021 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

8 participants