Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL errors when accessing Curator objects before they are fully initialized. #41

Closed
akiani opened this issue Mar 7, 2012 · 27 comments

Comments

@akiani
Copy link

akiani commented Mar 7, 2012

Hi JZ,
In my tests, I often run into these errors:

INFO : org.apache.zookeeper.server.PrepRequestProcessor.run - PrepRequestProcessor exited loop!
FATAL: org.apache.zookeeper.server.SyncRequestProcessor.run - Severe unrecoverable error, exiting
java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:88)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:243)
at org.apache.zookeeper.server.persistence.Util.padLogFile(Util.java:214)
at org.apache.zookeeper.server.persistence.FileTxnLog.padFile(FileTxnLog.java:237)
at org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:215)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:315)
at org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:468)
at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:107)

Which block the execution of my tests. I seem to be able to get around them by adding waits after I create the Curator objects that I'm creating but I was wondering if there is a better way of handling this.

I'm using Curator's test client as well.

Many thanks,
Amir

@Randgalt
Copy link
Contributor

Randgalt commented Mar 7, 2012

I'd need to see a sample. I don't get these in my own tests.

@Randgalt
Copy link
Contributor

Randgalt commented Mar 7, 2012

Actually - I've started seeing them now myself. I'll see what I can do.

@akiani
Copy link
Author

akiani commented Mar 7, 2012

Great :) maybe it's something in the new ZK?

@ntolia
Copy link

ntolia commented Mar 8, 2012

@akiani Thank you so much for filing this bug and, FWIW, I run into this very frequently too. I have been playing with Java7 on Mac OS X for a few days and my local tests kept on failing without anything obvious in my logs. For some reason, this bug doesn't trigger with Java 6. Now I finally know why: SyncRequestProcessor calls System.exit().

@akiani
Copy link
Author

akiani commented Mar 8, 2012

@ntolia That's exactly the configuration that I'm using as well. Sorry I should have pointed that out...

@akiani
Copy link
Author

akiani commented Mar 8, 2012

Actually, let me correct that, I do have Java 1.6 on my Lion:

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11A511)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

@ntolia
Copy link

ntolia commented Mar 13, 2012

@Randgalt Is there any update on this bug? If you tell me where to go look, I would be happy to submit a patch.

@Randgalt
Copy link
Contributor

I haven't had a chance. However, my suspicion is that the problem is somewhere in the shutdown code of TestingServer and/or TestingCluster. Those classes are pretty hacked up.

@Randgalt
Copy link
Contributor

I just pushed a change to TestingServer and TestingCluster that should make this issue better. (see issue 46). Please re-try with these changes.

@ntolia
Copy link

ntolia commented Mar 14, 2012

Unfortunately, I still get the same failure as earlier with 1.1.5-SNAPSHOT. However, if I disable tests that use TestingServer, things work fine.

@akiani
Copy link
Author

akiani commented Mar 14, 2012

Thanks Jordan, but I wasn't even using TestingCluster :D I was using TestingServer. I'm very surprised that fixing TestingCluster fixed my test for you... It's still failing for me with the same error.

@akiani
Copy link
Author

akiani commented Mar 14, 2012

@Randgalt
Copy link
Contributor

Unfortunately, I still get the same failure as earlier with 1.1.5-SNAPSHOT. However, if I disable tests that use TestingServer, things work fine.

I didn't build new JARs. You'd have to take it from source.

@Randgalt
Copy link
Contributor

It's here: c50087d

@ntolia
Copy link

ntolia commented Mar 14, 2012

I didn't build new JARs. You'd have to take it from source.

I should have been clearer. I cloned master, built, installed (it showed up as 1.1.5-SNAPSHOT locally), and then tested. Things still failed.

@Randgalt
Copy link
Contributor

OK - can you put together a sample that fails? Or - are you referring to some of my tests? I realize a few tests are failing.

@ntolia
Copy link

ntolia commented Mar 14, 2012

Trying to create a simple test but I need to remove a bunch of internal code to get at the simplest possible repro. This is proving to be harder than expected but I should hopefully be able to have something soon. In the meantime, with master, I do get this stacktrace via a debugger but then again, that is nothing new.

Breakpoint hit: "thread=SyncThread:0", java.lang.System.exit(), line=960 bci=0

SyncThread:0[1] where
  [1] java.lang.System.exit (System.java:960)
  [2] org.apache.zookeeper.server.SyncRequestProcessor.run (SyncRequestProcessor.java:153)

However, one of the other threads was captured in this state:

  .... (whole bunch of logback stack traces)
  [23] org.apache.zookeeper.server.PrepRequestProcessor.shutdown (PrepRequestProcessor.java:733)
  [24] org.apache.zookeeper.server.ZooKeeperServer.shutdown (ZooKeeperServer.java:439)
  [25] com.netflix.curator.test.TestingServer.stop (TestingServer.java:152)
  [26] com.netflix.curator.test.TestingServer.close (TestingServer.java:170)

Stack traces for all threads are available if it would be helpful.

Also, I don't know if it makes a difference but sometimes, when I call CuratorFramework.close() on a client that had been connected to a TestServer that has been close()d, I sometimes get a stack trace along the lines of:

Error while calling watcher
java.lang.IllegalStateException: null
  at com.google.common.base.Preconditions.checkState(Preconditions.java:129) ~[guava-11.0.2.jar:na]
  at com.netflix.curator.framework.state.ConnectionStateManager.addStateChange(ConnectionStateManager.java:130) ~[curator-framework-1.1.5-SNAPSHOT.jar:na]
   ...

@ntolia
Copy link

ntolia commented Mar 26, 2012

Just wanted to provide an update on this bug report. I still get the failure but I am having trouble creating a standalone test. My test does the same thing our internal code does but things work in the standalone unit test. Points to timing issues but I have nothing further than that right now. I am going to keep on digging though.

I also wonder if this is related to #47.

@Randgalt
Copy link
Contributor

Randgalt commented Apr 4, 2012

I just pushed a total rewrite of TestingServer and TestingCluster based on work by Jérémie BORDIER (ahfeel). Let me know if it behaves any better.

@ntolia
Copy link

ntolia commented Apr 4, 2012

This is a good news/bad news story.

Good news: I no longer get a System.exit(). I had retested a couple of days ago with your zkDb.commit() patch and I was still hitting the System.exit() failure case.

Bad news: My TestingServer-based tests now hang but I need to figure out if that is just because of some thread that isn't cleaning up after itself, an internal problem that could be my fault, or something else with the TestingServer code (TestingCluster is not used in this particular suite of tests). I am going to spend some time this week looking into the hang and will let you know if I can narrow the problem down.

@Randgalt
Copy link
Contributor

Randgalt commented Apr 4, 2012

:(

@ntolia
Copy link

ntolia commented Apr 4, 2012

I am not sure if there are timing issues there but, on a second mvn test run, TestingServer-based tests didn't hang but a few did fail (that do not fail with Java6 + Curator 1.1.5). Will keep on digging into this but am swamped with a couple of other things.

@Randgalt
Copy link
Contributor

Randgalt commented Apr 4, 2012

OK - I just pushed a new version of TestingCluster that resurrects so really ugly bytecode manipulation that I didn't think was still needed. Deep inside of ZooKeeper you can get an Assertion that screws things up. With this change, my tests all run fine now - even with Gradle.

@Randgalt
Copy link
Contributor

Has anyone tried with the latest JARs?

@Randgalt
Copy link
Contributor

As I haven't heard back on this I am closing it.

@ntolia
Copy link

ntolia commented Apr 16, 2012

FWIW, it didn't work with master as of a week ago and doesn't work with Curator 1.1.7. Same symptoms as earlier with hung tests. I will request a reopen when I get a chance to provide more information or an easy repro.

@Randgalt
Copy link
Contributor

Sorry :( Thanks for your patience on this.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants