Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Levant deploy stuck in loop when deployment object is empty #263

Closed
msvbhat opened this issue Dec 19, 2018 · 7 comments
Closed

Levant deploy stuck in loop when deployment object is empty #263

msvbhat opened this issue Dec 19, 2018 · 7 comments
Assignees

Comments

@msvbhat
Copy link
Contributor

msvbhat commented Dec 19, 2018

We had a nomad job which was last deployed couple of months ago. And for some reason, the /v1/job//deployments returned an empty object. So when I try to run deploy the job (with no changes to job definition), levant got stuck in a loop for more than 20 minutes. The command I ran for levant deploy is levant deploy -ignore-no-changes -var-file <file> job.nomad

levant version
Levant v0.2.5
Date: 2018-10-25T13:22:22Z
Commit: 0514741
Branch: 0.2.5
State: 0.2.5
Summary: 0514741

nomad version
Nomad v0.8.5 (90fbfaba6a6d9af7febc39082b95ed832d8b8bd6)

Debug log outputs from Levant:

I don't have the DEBUG logs output yet (lost them when we did the workaround). But the debug had lot of below output

levant/deploy: Nomad returned an empty deployment for evaluation ; retrying

When I checked the levant code, I observed that the DeploymentID is empty and that is why it got stuck.

https://github.com/jrasell/levant/blob/cc275cb120fda9dfaf20ffaebb36c12305495acb/levant/deploy.go#L446

The steps to reproduce would be to have a job that doesn't have the deployment (not sure how a job can end up in this state). And then try to deploy that job without doing any changes to job definition with -ignore-no-changes option.

If I am able to reproduce this again in our environment, I will provide more information.

@msvbhat
Copy link
Contributor Author

msvbhat commented Jan 25, 2019

I am facing this issue again. Below is the logs from the deploy command.

2019-01-25T10:42:00+01:00 |INFO| helper/variable: using variable with key tcp_service_checks and value [map[interval:10s timeout:2s]] from file
2019-01-25T10:42:00+01:00 |INFO| helper/variable: using variable with key show_datadog_logs and value true from file
2019-01-25T10:42:00+01:00 |INFO| helper/variable: using variable with key consumers_path and value server/workers/consumers/ from file
2019-01-25T10:42:00+01:00 |DEBU| levant/plan: triggering Nomad plan
2019-01-25T10:42:00+01:00 |ERRO| levant/plan: no changes detected for job
2019-01-25T10:42:00+01:00 |INFO| levant/plan: no changes found in job but ignore-changes flag set to true
2019-01-25T10:42:01+01:00 |DEBU| levant/deploy: running dynamic job count updater job_id=webapp-backend
2019-01-25T10:42:01+01:00 |INFO| levant/deploy: using dynamic count 2 for group staging job_id=webapp-backend
2019-01-25T10:42:01+01:00 |INFO| levant/deploy: triggering a deployment job_id=webapp-backend
2019-01-25T10:42:02+01:00 |INFO| levant/deploy: evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2 finished successfully job_id=webapp-backend
2019-01-25T10:42:02+01:00 |INFO| levant/deploy: beginning deployment watcher for job job_id=webapp-backend
2019-01-25T10:42:02+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:04+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:06+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:08+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:10+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:13+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:15+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:17+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend
2019-01-25T10:42:19+01:00 |DEBU| levant/deploy: Nomad returned an empty deployment for evaluation f7d110ef-0fbe-8d70-3cf7-b22e421ac8a2; retrying job_id=webapp-backend

And it gets stuck in this forever. From what I can see, it is because the deployment endpoint from Nomad has returned an empty response.

msvbhat@tortuga$ curl --header "X-Nomad-Token: $NOMAD_TOKEN" https://staging.nomad.com:4646/v1/job/webapp-backend/deployment; echo
null

msvbhat@tortuga$ curl --header "X-Nomad-Token: $NOMAD_TOKEN" https://staging.nomad.com:4646/v1/job/webapp-backend/deployments; echo
[]

This may also be because Nomad cluster went into some inconsistent state. But would like to see levant get around it.

@msvbhat
Copy link
Contributor Author

msvbhat commented Jan 31, 2019

Another update.

Today, I hit this again even when the -ignore-no-changes option was NOT set. So I believe this happens everytime when there is a empty deployment object.

@stevenscg
Copy link

@msvbhat Does #264 represent the issue you're seeing or something different??

cc @jrasell

@msvbhat
Copy link
Contributor Author

msvbhat commented Feb 1, 2019

No, I don't think they are related. Because recently I have also seen the issue even without -ignore-no-changes option.

@stevenscg
Copy link

Ok. I know that jrasell has been swamped lately.

@jrasell
Copy link
Member

jrasell commented Feb 2, 2019

Hi @msvbhat sorry for the delay; as @stevenscg has pointed out I have been pretty busy in my day to day professional job so have let Levant slide a little so apologies for that. I am dedicating some time tomorrow to fix this problem, which I believe will be solved by adding some timeout logic into the code where it checks for a deployment object rather than continually try to get a deployment ID. I am thinking this can be short, maybe 30s before exiting and stating no deployment ID was generated. If you have any thoughts please let me know.

@jrasell jrasell self-assigned this Feb 2, 2019
@msvbhat
Copy link
Contributor Author

msvbhat commented Feb 2, 2019

Hi @jrasell, Thanks for replying. And I understand about the job, So please you don't have to apologise. :)

I think timeout login feels easy to implement and would fix the problem at hand. I can also try and send a PR for timeout.

But I was trying to understand why a deployment watcher is necessary when there is no change in the job file and no deployment is triggered at Nomad.

jrasell added a commit that referenced this issue Feb 25, 2019
In situations where the deployed evaluations didn't invoke a
deployment, the return from the Nomad eval endpoint would include
an empty deployment ID. Levant would continue to retry until the
deployment ID object was populated, which possibly wouldn't
happen causing Levant to get stuck in a loop forever.

This change adds a timeout into the function which performs the
above work, so that if after 60s no deployment ID has been
returned, Levant will exit with a useful message.

Closes #263
jrasell added a commit that referenced this issue Feb 25, 2019
jrasell added a commit that referenced this issue Feb 25, 2019
Fix deploy loop bug when evaluation didn't include a deployment ID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants