computeMD5 with special characters has 2 different outputs #1527

fabiencrassat · 2020-11-03T22:36:33Z

Environment

Liquibase Version: master branch and last tag is v4.1.1

Liquibase Integration & Version: maven 3.6.3

Liquibase Extension(s) & Version: none

Database Vendor & Version: none

Operating System Type & Version: Windows 10 & Ubuntu 18.04

Description

I discovered that the md5 checksum can be different between Windows 10 and Ubuntu 18.04 when we deal with special characters.

I discovered this issue when my team changed their Jenkins pipeline from Windows OS to Linux OS. The generateSQL process was finishing with md5 checksum error. After investigation I saw it was due to accent characters.

Steps To Reproduce

To be very efficient (I hope), I forked your repository and added unit tests to reproduce easily the issue.

What to do:

Copy paste this unit tests file: https://github.com/fabiencrassat/liquibase/blob/md5encoding/liquibase-core/src/test/java/liquibase/util/MD5UtilTest.java
Run mvn -Dtest='liquibase.util.MD5UtilTest' test into liquibase-core folder.

Do it on Windows OS and Linux OS and you will see the output (commented in MD5UtilTest.java file too).

Actual Behavior

A string with special characters has two different computeMD5 outputs, depending of the stream encoding and the OS.

Expected/Desired Behavior

A string with special characters should have the same computeMD5 outputs, without concern of stream encoding or the OS.

Additional Context

I can not imagine the impacts of a change like that. That's why I prefer to show you with unit tests. And don't hesitate to explain these impacts here, I will be very happy to understand better!

Sorry for my english, I'm a French old developers! And take care of you all!!

The text was updated successfully, but these errors were encountered:

molivasdat · 2020-11-06T03:50:04Z

Thanks @fabiencrassat for bringing this issue to our attention. And thanks for making the recreation scenario. It definitely helps. We will add this to our list of issues.

fabiencrassat · 2020-11-07T18:26:09Z

Good! Thank you @molivasdat
Hope it will be easy. If not, don't hesitate to contact me :)

And I'm very curious to understand what will be your analysis.

Regards,

molivasdat · 2020-11-13T19:39:31Z

Hi @fabiencrassat Which version of java were you using for the tests?

fabiencrassat · 2020-11-13T20:09:38Z

Hi,

On windows it is:

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

For Linux I can do it only next week :(

Regards,

fabiencrassat · 2020-11-16T08:31:02Z

Hello @molivasdat,

Hope you are being well!

I have your answer for linux, I'm using a docker image with IBM jdk-8.251.

Hope it will help you :)

Regards,

molivasdat · 2020-11-16T14:24:31Z

@fabiencrassat Thanks for getting back to us. Looks like both windows and Linux are using the same major version 1.8

fabiencrassat · 2020-11-16T16:31:48Z

@fabiencrassat Thanks for getting back to us. Looks like both windows and Linux are using the same major version 1.8

Yes ;)

nvoxland · 2020-11-17T18:42:33Z

The goal of the MD5Util class is to be a low-level "just encode the bytes" given helper function. Any "smoothing" of cross-os differences like line ending standardization etc. should be handled upstream, such as in liquibase.change.CheckSum#compute()

My guess is that there are byte-order differences or encoding differences between the different platforms that makes java generate different bytes for the same input strings and that's why we get the difference.

From a library-api standpoint, perhaps it would help to remove the MD5Util.computeMD5(string) function completely? If we just have the InputStream version, it would be more obvious that the function is just computing a value based on bytes and it is up to the caller to be careful of what those bytes are. In the javadoc, we can direct them to either use a ByteArrayInputStream for reading from a string or use the CheckSum.compute() method does smoothing of strings. Internal testing shows that the CheckSum.compute() method is giving the same values across different OSs

For the tests you are adding, @fabiencrassat, do they give the same values across platforms if you use the InputStream version of the method and specify the encoding? Also, are the tests there in an attempt to isolate a problem you are seeing in general liquibase usage? Or were you exploring how the method works out of curiosity?

fabiencrassat · 2020-11-22T16:55:18Z

Hello and sorry for the delay,

First, maybe I did not present well the description of my ticket because yes this is a problem that my team encountered when changing the operating system of our Jenkins (Windows to Linux). Since then, Liquibase's generateSQL has been in error because the MD5s of the files compared with the database are different. We therefore are doing our builds locally on our Windows computers.

And yes it's also curiosity to better understand your software because I'm always interested in presenting problems with reproducibility. But because I don't know your architecture maybe I added some tests in the wrong place and I'm sorry.

@nvoxland, coming back to the CheckSum.compute() method, it does not take special characters (accents, ...) in these tests. So I added them for the method here: https://github.com/fabiencrassat/liquibase/blob/md5encoding/liquibase-core/src/test/java/liquibase/change/CheckSumTest.java#L104
The results show that CheckSum.compute() responds differently depending on the operating system :(
So would it be somewhere else?

Please feel free to continue to share your thoughts here, I always like to understand the why and how of things.

Fabien

P.S .: I hope my English is good enough!

nvoxland · 2021-08-06T15:40:48Z

Yes, your english is great @fabiencrassat . Far better than my french :)

It makes sense that we'd get different checksums with different encodings of strings, since that could be giving us different bytes that we checksum.

The code upstream from the CheckSum should be in charge of ensuring a consistent charset for the stream. We've been making improvements to that since 4.1.1, can you see if you are still seeing the problem with the newest release?

If you are, can you tell me how the problem changeset(s) are set up? What type of changes are in there? The problem will more likely actually lie more with sqlFile or the createView or whatever code.

nvoxland · 2021-10-07T20:09:18Z

Closing due to lack of follow-up

sync-by-unito bot added the StatusDiscovery label Nov 3, 2020

molivasdat added DBAll IntegrationMaven Severity3 TypeBug ImpactLow labels Nov 6, 2020

sync-by-unito bot added StatusConditioning and removed StatusDiscovery labels Nov 17, 2020

molivasdat added this to To Do in Conditioning++ via automation Jul 7, 2021

nvoxland moved this from To Do to In discussion in Conditioning++ Aug 6, 2021

nvoxland closed this as completed Oct 7, 2021

Conditioning++ automation moved this from In discussion to Done Oct 7, 2021

nvoxland removed this from Done in Conditioning++ Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computeMD5 with special characters has 2 different outputs #1527

computeMD5 with special characters has 2 different outputs #1527

fabiencrassat commented Nov 3, 2020 •

edited by sync-by-unito bot

molivasdat commented Nov 6, 2020

fabiencrassat commented Nov 7, 2020

molivasdat commented Nov 13, 2020

fabiencrassat commented Nov 13, 2020

fabiencrassat commented Nov 16, 2020

molivasdat commented Nov 16, 2020

fabiencrassat commented Nov 16, 2020

nvoxland commented Nov 17, 2020

fabiencrassat commented Nov 22, 2020

nvoxland commented Aug 6, 2021

nvoxland commented Oct 7, 2021

computeMD5 with special characters has 2 different outputs #1527

computeMD5 with special characters has 2 different outputs #1527

Comments

fabiencrassat commented Nov 3, 2020 • edited by sync-by-unito bot

Environment

Description

Steps To Reproduce

Actual Behavior

Expected/Desired Behavior

Additional Context

molivasdat commented Nov 6, 2020

fabiencrassat commented Nov 7, 2020

molivasdat commented Nov 13, 2020

fabiencrassat commented Nov 13, 2020

fabiencrassat commented Nov 16, 2020

molivasdat commented Nov 16, 2020

fabiencrassat commented Nov 16, 2020

nvoxland commented Nov 17, 2020

fabiencrassat commented Nov 22, 2020

nvoxland commented Aug 6, 2021

nvoxland commented Oct 7, 2021

fabiencrassat commented Nov 3, 2020 •

edited by sync-by-unito bot