Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computeMD5 with special characters has 2 different outputs #1527

Closed
fabiencrassat opened this issue Nov 3, 2020 · 11 comments
Closed

computeMD5 with special characters has 2 different outputs #1527

fabiencrassat opened this issue Nov 3, 2020 · 11 comments

Comments

@fabiencrassat
Copy link

fabiencrassat commented Nov 3, 2020

Environment

Liquibase Version: master branch and last tag is v4.1.1

Liquibase Integration & Version: maven 3.6.3

Liquibase Extension(s) & Version: none

Database Vendor & Version: none

Operating System Type & Version: Windows 10 & Ubuntu 18.04

Description

I discovered that the md5 checksum can be different between Windows 10 and Ubuntu 18.04 when we deal with special characters.

I discovered this issue when my team changed their Jenkins pipeline from Windows OS to Linux OS. The generateSQL process was finishing with md5 checksum error. After investigation I saw it was due to accent characters.

Steps To Reproduce

To be very efficient (I hope), I forked your repository and added unit tests to reproduce easily the issue.

What to do:

  1. Copy paste this unit tests file: https://github.com/fabiencrassat/liquibase/blob/md5encoding/liquibase-core/src/test/java/liquibase/util/MD5UtilTest.java
  2. Run mvn -Dtest='liquibase.util.MD5UtilTest' test into liquibase-core folder.

Do it on Windows OS and Linux OS and you will see the output (commented in MD5UtilTest.java file too).

Actual Behavior

A string with special characters has two different computeMD5 outputs, depending of the stream encoding and the OS.

Expected/Desired Behavior

A string with special characters should have the same computeMD5 outputs, without concern of stream encoding or the OS.

Additional Context

I can not imagine the impacts of a change like that. That's why I prefer to show you with unit tests. And don't hesitate to explain these impacts here, I will be very happy to understand better!

Sorry for my english, I'm a French old developers! And take care of you all!!

@molivasdat
Copy link
Contributor

Thanks @fabiencrassat for bringing this issue to our attention. And thanks for making the recreation scenario. It definitely helps. We will add this to our list of issues.

@fabiencrassat
Copy link
Author

Good! Thank you @molivasdat
Hope it will be easy. If not, don't hesitate to contact me :)

And I'm very curious to understand what will be your analysis.

Regards,

@molivasdat
Copy link
Contributor

Hi @fabiencrassat Which version of java were you using for the tests?

@fabiencrassat
Copy link
Author

Hi,

On windows it is:

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

For Linux I can do it only next week :(

Regards,

@fabiencrassat
Copy link
Author

Hello @molivasdat,

Hope you are being well!

I have your answer for linux, I'm using a docker image with IBM jdk-8.251.

Hope it will help you :)

Regards,

@molivasdat
Copy link
Contributor

@fabiencrassat Thanks for getting back to us. Looks like both windows and Linux are using the same major version 1.8

@fabiencrassat
Copy link
Author

@fabiencrassat Thanks for getting back to us. Looks like both windows and Linux are using the same major version 1.8

Yes ;)

@nvoxland
Copy link
Contributor

The goal of the MD5Util class is to be a low-level "just encode the bytes" given helper function. Any "smoothing" of cross-os differences like line ending standardization etc. should be handled upstream, such as in liquibase.change.CheckSum#compute()

My guess is that there are byte-order differences or encoding differences between the different platforms that makes java generate different bytes for the same input strings and that's why we get the difference.

From a library-api standpoint, perhaps it would help to remove the MD5Util.computeMD5(string) function completely? If we just have the InputStream version, it would be more obvious that the function is just computing a value based on bytes and it is up to the caller to be careful of what those bytes are. In the javadoc, we can direct them to either use a ByteArrayInputStream for reading from a string or use the CheckSum.compute() method does smoothing of strings. Internal testing shows that the CheckSum.compute() method is giving the same values across different OSs

For the tests you are adding, @fabiencrassat, do they give the same values across platforms if you use the InputStream version of the method and specify the encoding? Also, are the tests there in an attempt to isolate a problem you are seeing in general liquibase usage? Or were you exploring how the method works out of curiosity?

@fabiencrassat
Copy link
Author

Hello and sorry for the delay,

First, maybe I did not present well the description of my ticket because yes this is a problem that my team encountered when changing the operating system of our Jenkins (Windows to Linux). Since then, Liquibase's generateSQL has been in error because the MD5s of the files compared with the database are different. We therefore are doing our builds locally on our Windows computers.

And yes it's also curiosity to better understand your software because I'm always interested in presenting problems with reproducibility. But because I don't know your architecture maybe I added some tests in the wrong place and I'm sorry.

@nvoxland, coming back to the CheckSum.compute() method, it does not take special characters (accents, ...) in these tests. So I added them for the method here: https://github.com/fabiencrassat/liquibase/blob/md5encoding/liquibase-core/src/test/java/liquibase/change/CheckSumTest.java#L104
The results show that CheckSum.compute() responds differently depending on the operating system :(
So would it be somewhere else?

Please feel free to continue to share your thoughts here, I always like to understand the why and how of things.

Fabien

P.S .: I hope my English is good enough!

@molivasdat molivasdat added this to To Do in Conditioning++ via automation Jul 7, 2021
@nvoxland
Copy link
Contributor

nvoxland commented Aug 6, 2021

Yes, your english is great @fabiencrassat . Far better than my french :)

It makes sense that we'd get different checksums with different encodings of strings, since that could be giving us different bytes that we checksum.

The code upstream from the CheckSum should be in charge of ensuring a consistent charset for the stream. We've been making improvements to that since 4.1.1, can you see if you are still seeing the problem with the newest release?

If you are, can you tell me how the problem changeset(s) are set up? What type of changes are in there? The problem will more likely actually lie more with sqlFile or the createView or whatever code.

@nvoxland nvoxland moved this from To Do to In discussion in Conditioning++ Aug 6, 2021
@nvoxland
Copy link
Contributor

nvoxland commented Oct 7, 2021

Closing due to lack of follow-up

@nvoxland nvoxland closed this as completed Oct 7, 2021
Conditioning++ automation moved this from In discussion to Done Oct 7, 2021
@nvoxland nvoxland removed this from Done in Conditioning++ Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants