Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework stuck at broker reconciling state #306

Open
shangd opened this issue Jun 8, 2017 · 5 comments
Open

Framework stuck at broker reconciling state #306

shangd opened this issue Jun 8, 2017 · 5 comments

Comments

@shangd
Copy link
Contributor

shangd commented Jun 8, 2017

We had several occasions where the mesos slave with running broker got restarted, the framework tries to reconcile the broker task, then the framework got restarted while the broker state is "reconciling" (saved in zk), after that the framework will be stuck due to the reconciling broker state which wasn't reconciling at all. It will not start any brokers even though the broker is no longer running, the only way I found to fix this is to manually go into zk /kafka-mesos node and delete all the "task" broker attributes that contains "reconciling" state from the brokers json.

The code causing this problem:
https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/mesos/TaskReconciler.scala#L124

Maybe it should resume the reconciliation or remove the check all together, rather than do nothing if the state is reconciling.

@steveniemitz
Copy link
Contributor

Is it actually stuck? The default reconciliation timeout is pretty high, 30 minutes I think. Can you attach broker logs from before/after the framework restart?

@shangd
Copy link
Contributor Author

shangd commented Jun 9, 2017

All the brokers are no longer running, but the framework zk state remembered a broker in reconciling state, so when the framework restart it won't start any broker again as long as a single broker state is stuck in reconciling, and it is no longer recoverable because of the above mentioned line. (if any broker state is in reconciling the framework won't schedule any future reconciliation, and since the broker is no longer running the reconciling state is stuck there forever)

@shangd
Copy link
Contributor Author

shangd commented Jun 13, 2017

@steveniemitz Any thought? You can easily recreate the problem by having a framework with 1 running broker, stop the framework first, then kill the broker, manually change the broker state from running to reconciling in zookeeper /kafka-mesos, start the framework and it will be stuck.

I would suggest removing the if part and always do startImpl() in https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/mesos/TaskReconciler.scala#L124
not sure if there's any side-effect though.

@steveniemitz
Copy link
Contributor

steveniemitz commented Jun 14, 2017

I'll have some time to look at this soon, the reconciliation logic is fairly complicated because there are a bunch of edge cases, it's not as simple as just removing that if block.

I'd really like to see the logs from before and after the framework restarts, I'm still confused how you're getting in this state. What version of the framework and mesos are you running?

Also, when it gets in this state, manually stopping it from the CLI should be enough to get it out of reconciling, have you tried that?

@shangd
Copy link
Contributor Author

shangd commented Jun 16, 2017

It's very likely to happen when there is a rolling restart of the mesos slaves, say you have 2 slaves: slave01 and slave02, broker is running on slave01 and framework running on slave02.

  1. slave01 got restarted (broker killed and lost)
  2. framework start reconciling with broker
  3. while it is still reconciling, slave02 got restarted
  4. framework start after slave02 got back, but stuck forever due to the reconciling state.

Here is an example state from /api/broker/list

{
  "brokers": [
    {
      "id": "21",
      "active": true,
      "cpus": 1,
      "mem": 3072,
      "heap": 2048,
      "syslog": false,
      "constraints": "hostname=like:.*slave01.*",
      "options": "",
      "log4jOptions": "",
      "jvmOptions": "",
      "stickiness": {
        "period": "864000s",
        "hostname": "slave01.mycluster.com"
      },
      "failover": {
        "delay": "60s",
        "maxDelay": "10m",
        "failures": 0
      },
      "task": {
        "id": "kafka-21-27ae8058-b0b8-48a3-9d94-eac6e30db749",
        "slaveId": "400ac0f4-9bf6-435c-a227-51a9b559e22d-S3",
        "executorId": "kafka-21-a35bc7ce-8bf2-4016-9f43-6f41aad154d9",
        "hostname": "slave01.mycluster.com",
        "endpoint": "slave01.mycluster.com:10023",
        "attributes": {},
        "state": "reconciling"
      },
      "metrics": {
        "timestamp": 0
      },
      "needsRestart": false
    }
  ]
}

Stopping the broker does not work (/api/broker/stop), since it only changes the active field from true to false, the task section will stick around, and when I start the broker it won't do anything.

The only relevant log I got after framework restart is

2017-06-15 15:16:14,871 WARN           TaskReconciler] Reconcile already in progress, skipping.

The log before the framework restart won't matter, as long as you time it to kill the framework when a broker is reconciling (for any reason), then when you start the framework again it will stuck.

We are using mesos 0.28.2, from my understanding reconciliation is framework driven, so as long as the framework skip it in the startup if block, then the reconciliation can never complete.

plaflamme added a commit to plaflamme/kafka that referenced this issue Jan 13, 2019
This should fix issue mesos#306 which prevents the reconciling task from starting if a previous one doesn't fully complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants