Marathon / HA / Zookeeper #412

nelsou · 2014-07-22T14:46:41Z

Hello,

We are using Marathon (0.6.0 and tried 0.7.0) in High Avaibility with the --ha flag. We want to start 3 Marathon:

We are having this exception 95%. I mean, after I tried at least 20 times it starts correctly ...

I'm not a scala guy so any help would be very appreciated.

Starting Marathon 0.7.0-SNAPSHOT (mesosphere.marathon.Main$:20)
Connecting to Zookeeper... (mesosphere.marathon.Main$:39)
Client environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT (org.apache.zookeeper.ZooKeeper:97)
Client environment:host.name=ip-****.eu-west-1.compute.internal (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.version=1.7.0_55 (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.vendor=Oracle Corporation (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.home=/usr/lib/jvm/java-7-openjdk-amd64/jre (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.class.path=/tmp/marathon-runnable.jar (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.library.path=/usr/local/lib (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.io.tmpdir=/tmp (org.apache.zookeeper.ZooKeeper:97)
Client environment:java.compiler= (org.apache.zookeeper.ZooKeeper:97)
Client environment:os.name=Linux (org.apache.zookeeper.ZooKeeper:97)
Client environment:os.arch=amd64 (org.apache.zookeeper.ZooKeeper:97)
Client environment:os.version=3.13.0-24-generic (org.apache.zookeeper.ZooKeeper:97)
Client environment:user.name=root (org.apache.zookeeper.ZooKeeper:97)
Client environment:user.home=/root (org.apache.zookeeper.ZooKeeper:97)
Client environment:user.dir=/tmp (org.apache.zookeeper.ZooKeeper:97)
Initiating client connection, connectString=ec2-****.eu-west-1.compute.amazonaws.com:2181,ec2-****.eu-west-1.compute.amazonaws.com:2181,ec2-****.eu-west-1.compute.amazonaws.com:2181 sessionTimeout=100000 watcher=com.twitter.common.zookeeper.ZooKeeperClient$2@2d34f41e (org.apache.zookeeper.ZooKeeper:379)
Opening socket connection to server ec2-****.eu-west-1.compute.amazonaws.com/****:2181 (org.apache.zookeeper.ClientCnxn:1061)
Socket connection established to ec2-****.eu-west-1.compute.amazonaws.com/****:2181, initiating session (org.apache.zookeeper.ClientCnxn:950)
Session establishment complete on server ec2-****.eu-west-1.compute.amazonaws.com/****:2181, sessionid = 0x347391169ea008f, negotiated timeout = 40000 (org.apache.zookeeper.ClientCnxn:739)
ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
ZOO_INFO@log_env@716: Client environment:host.name=ip-****
ZOO_INFO@log_env@723: Client environment:os.name=Linux
ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-24-generic
ZOO_INFO@log_env@725: Client environment:os.version=#46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014
ZOO_INFO@log_env@733: Client environment:user.name=ubuntu
ZOO_INFO@log_env@741: Client environment:user.home=/root
ZOO_INFO@log_env@753: Client environment:user.dir=/tmp
ZOO_INFO@zookeeper_init@786: Initiating client connection, host=ec2-****.eu-west-1.compute.amazonaws.com:2181,ec2-****.eu-west-1.compute.amazonaws.com:2181,ec2-****.eu-west-1.compute.amazonaws.com:2181 sessionTimeout=100000 watcher=0x7f8f2d124510 sessionId=0 sessionPasswd= context=0x7f8f24006860 flags=0
Registering in Zookeeper with hostname:ec2-****.eu-west-1.compute.amazonaws.com (mesosphere.marathon.MarathonModule:68)

Exception in thread "main" com.google.inject.CreationException: Guice creation errors:

1) Error in custom provider, java.util.concurrent.TimeoutException: Failed to wait for future within timeout
  at mesosphere.marathon.MarathonModule.provideAppRepository(MarathonModule.scala:86)
  at mesosphere.marathon.MarathonModule.provideAppRepository(MarathonModule.scala:86)
  while locating mesosphere.marathon.state.AppRepository
    for parameter 5 at mesosphere.marathon.MarathonSchedulerService.(MarathonSchedulerService.scala:31)
  at mesosphere.marathon.MarathonModule.configure(MarathonModule.scala:34)
  while locating mesosphere.marathon.MarathonSchedulerService
    for parameter 0 at mesosphere.marathon.api.LeaderProxyFilter.(LeaderProxyFilter.scala:20)
  at mesosphere.marathon.api.MarathonRestModule.configureServlets(MarathonRestModule.scala:34)
  while locating mesosphere.marathon.api.LeaderProxyFilter
Caused by: java.util.concurrent.TimeoutException: Failed to wait for future within timeout
#011at org.apache.mesos.state.AbstractState.__fetch_get_timeout(Native Method)
#011at org.apache.mesos.state.AbstractState.access$400(AbstractState.java:34)
#011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:69)
#011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:42)
#011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
#011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:19)
#011at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
#011at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
#011at java.lang.Thread.run(Thread.java:744)

2) Error in custom provider, java.util.concurrent.TimeoutException: Failed to wait for future within timeout
  at mesosphere.marathon.MarathonModule.provideAppRepository(MarathonModule.scala:86)
  at mesosphere.marathon.MarathonModule.provideAppRepository(MarathonModule.scala:86)
  while locating mesosphere.marathon.state.AppRepository
    for parameter 2 at mesosphere.marathon.MarathonScheduler.(MarathonScheduler.scala:41)
  at mesosphere.marathon.MarathonModule.configure(MarathonModule.scala:35)
  while locating mesosphere.marathon.MarathonScheduler
    for parameter 6 at mesosphere.marathon.MarathonSchedulerService.(MarathonSchedulerService.scala:31)
  at mesosphere.marathon.MarathonModule.configure(MarathonModule.scala:34)
  while locating mesosphere.marathon.MarathonSchedulerService
    for parameter 0 at mesosphere.marathon.api.LeaderProxyFilter.(LeaderProxyFilter.scala:20)
  at mesosphere.marathon.api.MarathonRestModule.configureServlets(MarathonRestModule.scala:34)
  while locating mesosphere.marathon.api.LeaderProxyFilter
Caused by: java.util.concurrent.TimeoutException: Failed to wait for future within timeout
#011at org.apache.mesos.state.AbstractState.__fetch_get_timeout(Native Method)
#011at org.apache.mesos.state.AbstractState.access$400(AbstractState.java:34)
#011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:69)
#011at org.apache.mesos.state.AbstractState$1.get(AbstractState.java:42)
#011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:20)
#011at mesosphere.util.BackToTheFuture$$anonfun$futureToFutureOption$1.apply(BackToTheFuture.scala:19)
#011at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
#011at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
#011at java.lang.Thread.run(Thread.java:744)

2 errors
#011at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:435)
#011at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:183)
#011at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:109)
#011at com.google.inject.Guice.createInjector(Guice.java:95)
#011at com.google.inject.Guice.createInjector(Guice.java:72)
#011at mesosphere.chaos.App$class.injector(App.scala:21)
#011at mesosphere.marathon.Main$.injector$lzycompute(Main.scala:18)
#011at mesosphere.marathon.Main$.injector(Main.scala:18)
#011at mesosphere.chaos.App$$anonfun$run$2.apply(App.scala:39)
#011at mesosphere.chaos.App$$anonfun$run$2.apply(App.scala:38)
#011at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
#011at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
#011at mesosphere.chaos.App$class.run(App.scala:38)
#011at mesosphere.marathon.Main$.run(Main.scala:18)
#011at mesosphere.marathon.Main$delayedInit$body.apply(Main.scala:80)
#011at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
#011at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
#011at scala.App$$anonfun$main$1.apply(App.scala:71)
#011at scala.App$$anonfun$main$1.apply(App.scala:71)
#011at scala.collection.immutable.List.foreach(List.scala:318)
#011at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
#011at scala.App$class.main(App.scala:71)
#011at mesosphere.marathon.Main$.main(Main.scala:18)
#011at mesosphere.marathon.Main.main(Main.scala)

Thanks !

ssk2 · 2014-07-22T15:29:18Z

Hey @nelsou, I would suggest trying a few things:

The 0.7.0 snapshot release is probably not recommended for general use right now. Is there a specific feature you're looking for? I'd advocate downloaded the latest stable release - 0.6.1 and using that for now. Make sure the version is the same across all machines since how they use Zookeeper may change.
It looks like it's struggling to connect to Zookeeper (a timeout). Have you configured these machines the same?
You might also want to flush the Zookeeper state (just in case using multiple versions simultaneously has made it inconsistent) using zkCli.sh (depending on your version of Linux, this should be available in /usr/share/zookeeper/bin). Launch zkCli.sh on the same machine that you're running Zookeeper on and try rmr /marathon - this will delete all Marathon state.

Let me know if this helps!

nelsou · 2014-07-22T15:50:57Z

Thanks @ssk2 for the quick response. We have this issue on 0.6.0 too.

I'm not sure it's related too zookeeper because I tried with another zookeeper node: zk://zookeeper-ip:2191/marathon2, same problem.

My 3 zookeepers are all configured the same and it perfectly works with other frameworks (Mesos, Kafka, ...).

When Marathon connects to Zookeeper, I don't see any timeout exception in the logs above. zk_timeout is set to 10 seconds in your code, not sure he really waits 10 seconds.

Is it possible that you do not wait for zookeeper's response ?

ssk2 · 2014-07-22T19:07:21Z

I'm not quite sure why that would work on 2 nodes but not the third!

Some other suggestions-

The Zookeeper port is 2181 - just to double check it's being set correctly. (Probably just a typo in your last message though.)
Is the Scala version the same on all nodes?
Are you able to bring up the third node by itself (with the other two down)? I wonder if the issue is specific to that machine.
Do you have any custom config files?

Generally it looks like a connectivity problem, but I could be wrong!

jloisel · 2014-07-23T07:49:48Z

Hi, i'm working with Nelsou on that issue. So far what we experimented:

Running Marathon locally with local zookeeper. Fine,
Running Marathon locally with remote Zookeeper cluster composed of 3 EC2 instances. Fine,
Marathon run on EC2 instance (same one as Mesos Master) with zookeeper cluster composed of 3 EC2 instances. Fail. Starts after about 20 attempts,
Marathon either in HA or not, the same issue is faced. It doesn't change anything,
Marathon started on 3 different Mesos Masters face the same issue. Even when rmr /marathon on the zookeeper cluster to start like from a fresh install.

The same zookeeper cluster is used by Apache Mesos, Kafka, Chronos and Apache Spark without any issue. TCP Port 2181 as well as 2888 and 3888 are opens. So it does not seem the be a connectivity problem.

The marathon node is started through supervisor. But starting the marathon node manually by launching the sh script also faces the same issue.

It seems like there is a race condition in the Marathon initialization phase. I'm not expecting Guice to initialize beans in parallel but it seems to be the case in Marathon. As the code is written in scala, we are unable to figure out what's wrong.

We use exactly the same setup for the 3 marathon nodes (which are in fact also Mesos Master nodes), as we are using the same AMI to launch the 3 ones.

Our zookeeper cluster is managed by Netflix Exhibitor (run by supervisor) and runs like a charm on oversized amazon ec2 instances.

When looking at the Marathon logs, it seems like zookeeper is initialized properly just after the timeout occurs.

rasputnik · 2014-07-23T12:44:11Z

Sorry to ask the obvious, but have you upped your max connections limit (
"maxClientCnxns=" ) on zookeeper?

On 23 July 2014 08:49, Jerome notifications@github.com wrote:

Hi, i'm working with Nelsou on that issue. So far what we experimented:

Running Marathon locally with local zookeeper. Fine,

Running Marathon locally with remote Zookeeper cluster composed of 3
EC2 instances. Fine,

Marathon run on EC2 instance (same one as Mesos Master) with
zookeeper cluster composed of 3 EC2 instances. Fail. Starts after about 20
attempts,

Marathon either in HA or not, the same issue is faced. It doesn't
change anything,

Marathon started on 3 different Mesos Masters face the same issue.
Even when rmr /marathon on the zookeeper cluster to start like from a fresh
install.

The same zookeeper cluster is used by Apache Mesos, Kafka, Chronos and
Apache Spark without any issue. TCP Port 2181 as well as 2888 and 3888 are
opens. So it does not seem the be a connectivity problem.

The marathon node is started through supervisor. But starting the marathon
node manually by launching the sh script also faces the same issue.

It seems like there is a race condition in the Marathon initialization
phase. I'm not expecting Guice to initialize beans in parallel but it seems
to be the case in Marathon. As the code is written in scala, we are unable
to figure out what's wrong.

We use exactly the same setup for the 3 marathon nodes (which are in fact
also Mesos Master nodes), as we are using the same AMI to launch the 3 ones.

Our zookeeper cluster is managed by Netflix Exhibitor (run by supervisor)
and runs like a charm on oversized amazon ec2 instances.

When looking at the Marathon logs, it seems like zookeeper is initialized
properly just after the timeout occurs.

—
Reply to this email directly or view it on GitHub
#412 (comment).

jloisel · 2014-07-23T12:57:20Z

I'm going to check, but from local machine we can launch marathon without any issue every time. But not on the EC2 instance.

jloisel · 2014-07-23T13:32:46Z

The property maxClientCnxns is not set in our zoo.cfg. I have made "stat" (4LTR) on our zookeeper nodes, it shows between 20 and 30 alive connections per node.

ssk2 · 2014-07-23T16:53:39Z

Hey @jloisel, a few more questions:

Are the Zookeeper instances separate to the master nodes that you're trying to run Marathon on?
Have you tried running Marathon on its own instance? Does that yield the same issue?
Have you installed the Marathon tarball? We now have .debs available which might be worth trying - http://mesosphere.io/2014/07/17/mesosphere-package-repositories/

I suspect there's some configuration issue here. Can you share the arguments you use to start Marathon manually? I'm on freenode under the same username if you'd like to chat there.

StephanErb · 2014-07-23T17:17:03Z

Maybe in one of your configs you are confusing internal with external IP-addresses? I remember having problems like this on EC2.

jloisel · 2014-07-23T19:12:59Z

Hi, thanks for your answers, it's always good to feel that people are behind the project and trying to help you :)

Zookeeper instances are on 3 dedicated EC2 instances. Each instance is located in a specific zone (eu-west-1a, 1b and 1c) within the same region (eu-west-1) to prevent any blackout if an amazon zone falls down. We have 3 replicated Mesos Master each one in separate zone too within the same region. The Mesos Masters are not installed on the same machine as the zookeeper quorum.
Marathon nodes are running on the same machines as the Mesos Master ones. Mesos master have no issue at all connecting to the zookeeper cluster. Our zookeeper cluster composed of 3 nodes uses of course 3 Elastic IPs. (static public IP)
We haven't tried running marathon on EC2 instances other than the masters, because it requires the Mesos native libs. However, we can give it a try on a Mesos Slave machine as we are using exactly the same image for Mesos Master and Slave machines.
We have tried both Marathon TAR ball (0.6.0) as well as building from sources (0.7.0-SNAPSHOT). We will give a try to the deb one, but i don't see any reason why the tarball would not work.

Here is the configuration we use to launch our marathon instance:
MARATHON_DIR=/home/ubuntu/marathon-0.6.0
ZK_URL=zk://<ec2 hostname 1>,<ec2 hostname 2>,<ec2 hostname 3> # public hostname only
HOSTNAME="$(curl http://169.254.169.254/latest/meta-data/public-hostname 2>/dev/null)"
cd $MARATHON_DIR
./bin/start --zk $ZK_URL/marathon --master $ZK_URL/mesos --hostname $HOSTNAME --ha

We have checked ten times the zookeeper url, it's being shared with a setenv.sh with Mesos Master launch script. Mesos Master works like a charm on the same machine using the same zookeeper cluster. It connects to it within milliseconds.

As mentioned before, launching marathon exactly the same way on the local machine and connecting to the remote EC2 zookeeper cluster works absolutely fine. We use Ubuntu 14.04 both on local machines as well as on EC2 instances.

Maybe we could setup a screenshare on hangout with one of you guys to show you the issue in live. Anyway thanks for spending time on this issue. :)

We have to restart Marathon up to 20 times until it properly connects to zookeeper. Once running, it works perfectly fine.

jloisel · 2014-07-24T08:01:14Z

I have tested again Marathon on another EC2 instance where just a Mesos Slave runs, it fails too.

nelsou · 2014-07-24T08:03:57Z

Can we turn on somehow a debug mode in order to help ?

ConnorDoyle · 2014-07-24T08:19:47Z

You can configure the verbosity of all registered loggers via the UI by visiting the /logging endpoint. Would really love to get this solved for you. Maybe Friday afternoon for a hangout?

jloisel · 2014-07-24T09:15:15Z

We are in Paris timezone, that won't be possible as you are in San Francisco ;)

The problem is that the marathon server does not start, i doubt /logging would work.

ssk2 · 2014-07-28T21:32:19Z

@jloisel - are you still stuck on this issue?

jloisel · 2014-07-29T06:59:46Z

Yes still no solution other than restarting marathon 10x of times until it works.

ssk2 · 2014-07-29T18:40:31Z

@jloisel @ConnorDoyle - can you guys arrange a time to speak?

ConnorDoyle · 2014-07-29T18:53:32Z

@jloisel my email is connor at mesosphere d0t io if you'd like take the scheduling off of GitHub.

ConnorDoyle · 2014-08-06T21:01:43Z

As a workaround, specifying the ZK hosts as (AWS internal) IP addresses instead of host names increased the rate of startup success substantially (to 100% in some manual testing on your servers). We'll eventually move to pure Java bindings to connect to Mesos, sidestepping this unfortunate deficiency of the C language ZK client.

Closing for now, feel free to re-open if the problem persists or if the above workaround isn't acceptable as a stopgap.

nelsou · 2014-08-07T15:05:15Z

Thks @ConnorDoyle !

As you suggested, we replaced our zookeeper hostnames by the ips and it works fine !

Thank you 👍

ConnorDoyle closed this as completed Aug 6, 2014

philwinder mentioned this issue Jul 18, 2016

Zookeeper timeout on AWS when using hostname ContainerSolutions/minimesos#483

Open

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marathon / HA / Zookeeper #412

Marathon / HA / Zookeeper #412

nelsou commented Jul 22, 2014

ssk2 commented Jul 22, 2014

nelsou commented Jul 22, 2014

ssk2 commented Jul 22, 2014

jloisel commented Jul 23, 2014

rasputnik commented Jul 23, 2014

jloisel commented Jul 23, 2014

jloisel commented Jul 23, 2014

ssk2 commented Jul 23, 2014

StephanErb commented Jul 23, 2014

jloisel commented Jul 23, 2014

jloisel commented Jul 24, 2014

nelsou commented Jul 24, 2014

ConnorDoyle commented Jul 24, 2014

jloisel commented Jul 24, 2014

ssk2 commented Jul 28, 2014

jloisel commented Jul 29, 2014

ssk2 commented Jul 29, 2014

ConnorDoyle commented Jul 29, 2014

ConnorDoyle commented Aug 6, 2014

nelsou commented Aug 7, 2014

Marathon / HA / Zookeeper #412

Marathon / HA / Zookeeper #412

Comments

nelsou commented Jul 22, 2014

ssk2 commented Jul 22, 2014

nelsou commented Jul 22, 2014

ssk2 commented Jul 22, 2014

jloisel commented Jul 23, 2014

rasputnik commented Jul 23, 2014

jloisel commented Jul 23, 2014

jloisel commented Jul 23, 2014

ssk2 commented Jul 23, 2014

StephanErb commented Jul 23, 2014

jloisel commented Jul 23, 2014

jloisel commented Jul 24, 2014

nelsou commented Jul 24, 2014

ConnorDoyle commented Jul 24, 2014

jloisel commented Jul 24, 2014

ssk2 commented Jul 28, 2014

jloisel commented Jul 29, 2014

ssk2 commented Jul 29, 2014

ConnorDoyle commented Jul 29, 2014

ConnorDoyle commented Aug 6, 2014

nelsou commented Aug 7, 2014