Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent synchronization causes sporadic NPEs #1568

Closed
bvremmeinfor opened this issue Dec 2, 2020 · 6 comments
Closed

Inconsistent synchronization causes sporadic NPEs #1568

bvremmeinfor opened this issue Dec 2, 2020 · 6 comments

Comments

@bvremmeinfor
Copy link

bvremmeinfor commented Dec 2, 2020

Environment

Liquibase Version:
3.8.x all the way to latest

Liquibase Integration & Version:
Plain Java API

Liquibase Extension(s) & Version:
None

Database Vendor & Version:
Any

Operating System Type & Version:
Any

Description

Inconsistent synchronization in LockServiceFactory & ChangeLogHistoryServiceFactory causes sporadic null pointer exceptions in multi-threaded environments.

Static 'instance' variable can be set to null in resetAll(), resulting in null value from getInstance() as they synchronize on different objects (static class vs instance).

Instance modified, synchronized on this (instance) not class as in getInstance():

Instance modified, synchronized on this (instance) not class as in getInstance():

Typicall locations (3.8.0):

Caused by: java.lang.NullPointerException
at liquibase.Liquibase.checkLiquibaseTables(Liquibase.java:1231)
at liquibase.Liquibase.update(Liquibase.java:193)
at liquibase.Liquibase.update(Liquibase.java:179)

Caused by: java.lang.NullPointerException
at liquibase.Liquibase.resetServices(Liquibase.java:1138)
at liquibase.Liquibase.update(Liquibase.java:210)
at liquibase.Liquibase.update(Liquibase.java:179)

Caused by: java.lang.NullPointerException
at liquibase.Liquibase.resetServices(Liquibase.java:1137)
at liquibase.Liquibase.update(Liquibase.java:210)
at liquibase.Liquibase.update(Liquibase.java:179)

Caused by: java.lang.NullPointerException
at liquibase.Liquibase.update(Liquibase.java:183)
at liquibase.Liquibase.update(Liquibase.java:179)

Steps To Reproduce

  • create multiple threads, each with their own Liquibase instance (new Liquibase(....))
  • let all threads perform update(ctx, label) - this has the desired resetServices() call at the end

Issue is very timing sensitive, in most cases it works, a little patience is required.

  • perform update using multiple threads - one thread for each database
  • wait for all to complete
  • if no errors then repeat

12 threads usually fail in less than 400 iterations on my machine (12 cores).
Getting the threads to interleave increases the chance.
It can be hard to reproduce, but once you've seen it you can't unsee it.

Alternative: Unit-test, see sample for LockServiceFactoryTest below

Actual Behavior

Occasional null pointer exceptions in multi-threaded environment.

Expected/Desired Behavior

Databases upgraded without errors.

Screenshots (if appropriate)

N/A

Additional Context

I don't have a PR, because that would require approval from legal and that is out of scope, but a simple fix is:

  • in resetAll(): call reset() instead of 'instance = null'
  • remove synchronized from resetAll()

Keep up the good work 👍

Sample LockServiceFactoryTest for inspiration:

        @Test
	public void resetAll_isThreadSafe() throws InterruptedException {

		final int threadCount = 12;
		final ExecutorService executor = Executors.newCachedThreadPool();

		final AtomicLong errors = new AtomicLong();
		final AtomicLong npeErrors = new AtomicLong();
		
		final CyclicBarrier startBarrier = new CyclicBarrier(threadCount);
		final CountDownLatch endLatch = new CountDownLatch(threadCount);

		try {
			
			for (int i = 0; i < threadCount; i++) {
				executor.execute(() -> {
					try {
						startBarrier.await();

						for (int j = 0; j < 10000; j++) {
								final LockServiceFactory instance = LockServiceFactory.getInstance();
								Thread.currentThread().sleep(0); // Thread interleaving
								instance.resetAll();
						}
						
					} catch (NullPointerException e) {
						errors.incrementAndGet();
						npeErrors.incrementAndGet();
					} catch (InterruptedException e) {
						Thread.currentThread().interrupt();
						errors.incrementAndGet();
					} catch (BrokenBarrierException | RuntimeException e) {
						errors.incrementAndGet();
					} finally {
						endLatch.countDown();
					}
				});
			}

			endLatch.await();
			Assert.assertEquals("NPE Errors", 0, npeErrors.get());
			Assert.assertEquals("Errors", 0, errors.get());

		} finally {
			executor.shutdownNow();
			executor.awaitTermination(10, TimeUnit.SECONDS);
		}
	}
@molivasdat
Copy link
Contributor

HI @bvremme Thanks for writing up this issue. It's unfortunate that you are not able to write up a PR to fix this. Thank you also for the LockServiceFactoryTest testcase. We will take a look and add it to our list of issues.

@bvremmeinfor
Copy link
Author

Thanks @molivasdat,
as mentioned, this is a tiny fix (2 lines * 2 classes) and the probability for this bug to occur in production is low. We use the unmodified stock version in production and will keep doing so. Expect this error to occur once, maybe twice a year in our build systems as they perform some stress-testing on every build. Annoying, but low priority.

As for most threading issues, it is hard to write consistently failing tests, 'numbers game' tests as the one above can help. Slow, but requires little wiring.

@molivasdat molivasdat added this to To Do in Conditioning++ via automation Jul 7, 2021
@molivasdat molivasdat added the BNoUpgradePath Cannot upgrade to next version label Jul 7, 2021
@nvoxland
Copy link
Contributor

Thanks @bvremmeinfor, I brought your test and fix into #1991 and they work great. I appreciate the test

@nvoxland nvoxland moved this from To Do to Code Review in Conditioning++ Jul 19, 2021
@nvoxland
Copy link
Contributor

Fixed by #1991

@suryaaki2 suryaaki2 moved this from Code Review to Ready for Handoff (In JIRA) in Conditioning++ Jul 27, 2021
@bvremmeinfor
Copy link
Author

Nice! Thank you @nvoxland :-)

@nvoxland
Copy link
Contributor

nvoxland commented Dec 7, 2021

This was fixed with #1991 in 4.5.0

@nvoxland nvoxland closed this as completed Dec 7, 2021
Conditioning++ automation moved this from Ready for Handoff (In JIRA) to Done Dec 7, 2021
@nvoxland nvoxland removed this from Done in Conditioning++ Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants