-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marathon / HA / Zookeeper #412
Comments
Hey @nelsou, I would suggest trying a few things:
Let me know if this helps! |
Thanks @ssk2 for the quick response. We have this issue on 0.6.0 too. I'm not sure it's related too zookeeper because I tried with another zookeeper node: zk://zookeeper-ip:2191/marathon2, same problem. My 3 zookeepers are all configured the same and it perfectly works with other frameworks (Mesos, Kafka, ...). When Marathon connects to Zookeeper, I don't see any timeout exception in the logs above. zk_timeout is set to 10 seconds in your code, not sure he really waits 10 seconds. Is it possible that you do not wait for zookeeper's response ? |
I'm not quite sure why that would work on 2 nodes but not the third! Some other suggestions-
Generally it looks like a connectivity problem, but I could be wrong! |
Hi, i'm working with Nelsou on that issue. So far what we experimented:
The same zookeeper cluster is used by Apache Mesos, Kafka, Chronos and Apache Spark without any issue. TCP Port 2181 as well as 2888 and 3888 are opens. So it does not seem the be a connectivity problem. The marathon node is started through supervisor. But starting the marathon node manually by launching the sh script also faces the same issue. It seems like there is a race condition in the Marathon initialization phase. I'm not expecting Guice to initialize beans in parallel but it seems to be the case in Marathon. As the code is written in scala, we are unable to figure out what's wrong. We use exactly the same setup for the 3 marathon nodes (which are in fact also Mesos Master nodes), as we are using the same AMI to launch the 3 ones. Our zookeeper cluster is managed by Netflix Exhibitor (run by supervisor) and runs like a charm on oversized amazon ec2 instances. When looking at the Marathon logs, it seems like zookeeper is initialized properly just after the timeout occurs. |
Sorry to ask the obvious, but have you upped your max connections limit ( On 23 July 2014 08:49, Jerome notifications@github.com wrote:
|
I'm going to check, but from local machine we can launch marathon without any issue every time. But not on the EC2 instance. |
The property maxClientCnxns is not set in our zoo.cfg. I have made "stat" (4LTR) on our zookeeper nodes, it shows between 20 and 30 alive connections per node. |
Hey @jloisel, a few more questions:
I suspect there's some configuration issue here. Can you share the arguments you use to start Marathon manually? I'm on freenode under the same username if you'd like to chat there. |
Maybe in one of your configs you are confusing internal with external IP-addresses? I remember having problems like this on EC2. |
Hi, thanks for your answers, it's always good to feel that people are behind the project and trying to help you :)
Here is the configuration we use to launch our marathon instance: We have checked ten times the zookeeper url, it's being shared with a setenv.sh with Mesos Master launch script. Mesos Master works like a charm on the same machine using the same zookeeper cluster. It connects to it within milliseconds. As mentioned before, launching marathon exactly the same way on the local machine and connecting to the remote EC2 zookeeper cluster works absolutely fine. We use Ubuntu 14.04 both on local machines as well as on EC2 instances. Maybe we could setup a screenshare on hangout with one of you guys to show you the issue in live. Anyway thanks for spending time on this issue. :) We have to restart Marathon up to 20 times until it properly connects to zookeeper. Once running, it works perfectly fine. |
I have tested again Marathon on another EC2 instance where just a Mesos Slave runs, it fails too. |
Can we turn on somehow a debug mode in order to help ? |
You can configure the verbosity of all registered loggers via the UI by visiting the |
We are in Paris timezone, that won't be possible as you are in San Francisco ;) The problem is that the marathon server does not start, i doubt /logging would work. |
@jloisel - are you still stuck on this issue? |
Yes still no solution other than restarting marathon 10x of times until it works. |
@jloisel @ConnorDoyle - can you guys arrange a time to speak? |
@jloisel my email is connor at mesosphere d0t io if you'd like take the scheduling off of GitHub. |
As a workaround, specifying the ZK hosts as (AWS internal) IP addresses instead of host names increased the rate of startup success substantially (to 100% in some manual testing on your servers). We'll eventually move to pure Java bindings to connect to Mesos, sidestepping this unfortunate deficiency of the C language ZK client. Closing for now, feel free to re-open if the problem persists or if the above workaround isn't acceptable as a stopgap. |
Thks @ConnorDoyle ! As you suggested, we replaced our zookeeper hostnames by the ips and it works fine ! Thank you 👍 |
Hello,
We are using Marathon (0.6.0 and tried 0.7.0) in High Avaibility with the --ha flag. We want to start 3 Marathon:
We are having this exception 95%. I mean, after I tried at least 20 times it starts correctly ...
I'm not a scala guy so any help would be very appreciated.
Thanks !
The text was updated successfully, but these errors were encountered: