Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary promotion reports wrong healthy count, preventing promotion #6407

Closed
djenriquez opened this issue Sep 30, 2019 · 29 comments
Closed

Canary promotion reports wrong healthy count, preventing promotion #6407

djenriquez opened this issue Sep 30, 2019 · 29 comments

Comments

@djenriquez
Copy link

djenriquez commented Sep 30, 2019

Nomad version

0.9.5, 0.10.0

Operating system and Environment details

Amazon Linux 2

Issue

When doing canary blue/green deployments, the promote API incorrectly report failed counts rather than the final healthy count, preventing promotions from occurring.

The screenshots below do not show the original deployment with failed allocations since I created a new deployment to try to mitigate the issue. However, even in this job version, the healthy count does not match what the promote API is reporting.

This issue is happening 100% of the time allocations fail on canary deploy, even if they are rescheduled and end up healthy.

Also an important detail: IT ALWAYS WORKS when there are no failed allocations. Only when allocations fail does this happen.

Screen Shot 2019-09-30 at 4 20 19 PM
Screen Shot 2019-09-30 at 4 20 29 PM

EDIT: Added Nomad 0.10.0 as version affected

@djenriquez
Copy link
Author

djenriquez commented Oct 1, 2019

I created another deployment which ended in failed allocations:

Screen Shot 2019-09-30 at 4 55 01 PM
Screen Shot 2019-09-30 at 4 55 20 PM

A few minutes later, I took a screenshot of the deployments view, which I figured might have more useful information. Looks like more allocations of the new version were added and more healthy resulted from that....more than it needed, but the promotion still fails with 4/8 unhealthy according to the RPC.
Screen Shot 2019-09-30 at 4 59 10 PM

@djenriquez
Copy link
Author

djenriquez commented Oct 2, 2019

@endocrimes, I built a quick little service to test this around, and I'm finding interesting behaviors with deployment. A little bit of a tangent but might be related to the issue: ultimately, on a deployment, if an allocation goes unhealthy, I expect Nomad to restart the allocation, and it doesn't seem to be doing that, is that expected? (Nevermind, looks like you need to define a check_restart)

Here is randfail which includes a nomad jobfile to reproduce the behavior. This is running on Nomad 0.9.5 (server + agent).

To replicate the canary issue with Nomad 0.9.5:

  1. Run nomad job run randfail.hcl and watch the job start, with some failing since FAIL_PERCENTAGE is set at 50%. The allocations will be rescheduled and eventually go healthy, which will result in a successful deploy.
  2. Run nomad job run randfail.canary.hcl to perform the blue/green. Allocations here will fail as well, but eventually the desired healthy count is reached from the rescheduled allocations. However, if you press Promote or use the promote API, it fails telling you there aren't enough healthy allocations to succeed.

@djenriquez
Copy link
Author

djenriquez commented Oct 16, 2019

Hi guys, any confirmation if this is a known issue? Maybe I'm doing something wrong? Would love to get blue/green deployments going for our org, devs feel rolling deploys are a bit painful in terms of speed 😉.

@djenriquez
Copy link
Author

djenriquez commented Oct 29, 2019

Tested this on 0.10.0 and verified that the issue still exists. Given blue/green deploy is a fundamental feature marketed for Nomad, and this existing issue which makes the feature unusable, when do we foresee traction for this issue?

Nomad 0.10.0 canary promotion failure:
Screen Shot 2019-10-29 at 11 21 27 AM

Screen Shot 2019-10-29 at 11 31 50 AM

@dadgar are you able to comment?

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 29, 2019
@djenriquez
Copy link
Author

Interesting. It has been 50 minutes over the progress deadline and the deployment is still going waiting to be promoted, but it cannot be promoted because of the bug in this issue. So, this deployment is completely stuck, it does not fail. Looks like the only way to make it unstuck is run another deployment.

@djenriquez
Copy link
Author

@eveld, I forgot to ask about this in the webinar this morning 😅. I would really appreciate an update on this issue as I'd love to sell blue/green deploys to my org but can't promote it currently because of this issue... could this issue get some ❤️ ?

@schmichael schmichael added this to the near-term milestone Nov 8, 2019
@tgross tgross self-assigned this Nov 8, 2019
@tgross
Copy link
Member

tgross commented Nov 8, 2019

Hi @djenriquez, sorry it took so long to get back to you on this. I've been trying to replicate the behavior you're showing here but unfortunately I haven't been able to do so.

I've been using the randfail jobs you provided (thanks so much for providing test cases!). I checked through the spec pretty carefully along with the docs to make sure that this isn't just a misconfiguration or documentation confusion.

It looks like the initial randfail.hcl job can take a very long time to converge to a successful state if the random number generation is unkind. In my most recent run here I had to place 25 containers before I got 10 that were ok. But once it was running I ran the randfail.canary.hcl, waiting for 10 healthy canaries to appear, and was able to promote it with no problems.

▶ nomad deployment status b69f6388
ID          = b69f6388
Job ID      = randfail
Job Version = 1
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    true      10       10        15      10       5          2019-11-08T21:02:55Z

I've assigned this issue to myself and I want to see if I can create a faster-converging test case, but first I need to be able to repro.

I did make one modification to the jobs which was to reduce the amount of memory used so that I could fit the deployment on my test rig. And I did notice on one run that one of the previous deploy's tasks failed after the deployment had been marked healthy. Given the size of the jobs we're running here, is there any chance that some of the tasks are becoming unhealthy in your environment because of resource restrictions, and that's fouling the deployment?

@djenriquez
Copy link
Author

djenriquez commented Nov 9, 2019 via email

@djenriquez
Copy link
Author

djenriquez commented Nov 10, 2019

@tgross I verified. In a single server deployment, everything works well. I reduced the server count to 1 for our development environment and ran through the randfail tests:

Screen Shot 2019-11-09 at 4 29 39 PM

Promoting this deployment results in a successful deployment:
Screen Shot 2019-11-09 at 4 29 47 PM

However, I increase the server count to just 2 servers, and the issue returned:

With two servers (notice the previous deployment were successful, which is the deployment shown above)
Screen Shot 2019-11-09 at 4 36 44 PM

Screen Shot 2019-11-09 at 4 36 58 PM

So this issue only exists in multi-server environments, which I would think, would be every production environment that uses Nomad.

It appears there is state synchronization issues. Bug in raft?

@tgross
Copy link
Member

tgross commented Nov 11, 2019

Thanks for that test! That's really interesting and probably narrows down the set of places where this bug could be lurking. I'll continue my investigation and circle back here soon.

@tgross
Copy link
Member

tgross commented Nov 11, 2019

Ok, I've been able to reproduce and found some interesting logs! I went through the process and waited for the deployment status to report 10 healthy, and then promoted it:

▶ nomad deployment status 6fad6d49
ID          = 6fad6d49
Job ID      = randfail
Job Version = 4
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    false     10       10        16      10       6          2019-11-11T16:25:49Z
▶ nomad deployment promote 6fad6d49
Error promoting deployment: Unexpected response code: 500 (rpc error: 1 error(s) occurred:

* Task group "randfail" has 5/10 healthy allocations)

The job status is showing something strange, however. We're showing 10 healthy canaries for the deployment but only 15 allocations in the running state. There are 5 allocations showing complete, which is throwing off our calculation here. For service jobs allocations shouldn't be marked as complete, and there might be some buggy logic in the canary code that doesn't account for that. @djenriquez can you see if that's happening in your environment as well?

▶ nomad job status randfail
ID            = randfail
Name          = randfail
Submit Date   = 2019-11-11T11:14:00-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
randfail    0       0         15       34      34        0

Latest Deployment
ID          = 6fad6d49
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    false     10       10        16      10       6          2019-11-11T16:25:49Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
1dc3f436  9a323737  randfail    4        run      running   12m ago     11m47s ago
5b6f4233  5ecd049f  randfail    4        run      running   12m52s ago  12m17s ago
4874a6b8  9a323737  randfail    4        stop     failed    12m52s ago  12m ago
016e1a2f  5ecd049f  randfail    4        stop     complete  12m52s ago  12m21s ago
6563ee3c  5ecd049f  randfail    4        run      running   12m52s ago  12m23s ago
1afd8c03  9a323737  randfail    4        run      running   12m52s ago  12m14s ago
bcd7ed18  5ecd049f  randfail    4        run      running   13m35s ago  13m16s ago
cd1b3113  5ecd049f  randfail    4        run      running   13m35s ago  13m15s ago
15be4779  5ecd049f  randfail    4        stop     failed    13m35s ago  12m52s ago
9cee1388  9a323737  randfail    4        stop     failed    13m35s ago  12m52s ago
969bef4a  5ecd049f  randfail    4        run      running   13m35s ago  13m14s ago
44c5cb3c  9a323737  randfail    4        run      running   13m35s ago  13m13s ago
20eadfa0  5ecd049f  randfail    4        stop     failed    13m35s ago  12m52s ago
4329ba2c  9a323737  randfail    4        stop     failed    13m35s ago  12m52s ago
9bfc5bd5  9a323737  randfail    4        run      running   13m35s ago  13m23s ago
afcf0d9a  9a323737  randfail    4        stop     failed    13m35s ago  12m52s ago
32b781ec  9a323737  randfail    3        stop     complete  14m50s ago  12m21s ago
1a0220ab  9a323737  randfail    3        stop     failed    16m22s ago  14m50s ago
cfa51ca5  9a323737  randfail    3        stop     failed    17m53s ago  16m22s ago
349d9268  9a323737  randfail    3        stop     failed    19m25s ago  17m53s ago
9b86619a  9a323737  randfail    3        run      running   20m57s ago  20m43s ago
751d8b50  9a323737  randfail    3        stop     failed    20m57s ago  19m25s ago
0b574c09  9a323737  randfail    3        stop     failed    22m9s ago   20m57s ago
c4e5be40  5ecd049f  randfail    3        stop     complete  22m9s ago   12m21s ago
05ee3ff6  9a323737  randfail    3        stop     failed    22m9s ago   20m57s ago
6516abdb  9a323737  randfail    3        run      running   22m9s ago   21m33s ago
7ad3f3ed  9a323737  randfail    3        stop     failed    23m ago     22m9s ago
a5a09865  5ecd049f  randfail    3        run      running   23m ago     22m21s ago
65b4c14b  9a323737  randfail    3        stop     failed    23m ago     22m9s ago
1a88ac04  5ecd049f  randfail    3        stop     failed    23m ago     22m9s ago
31f1eb2a  9a323737  randfail    3        stop     failed    23m ago     22m9s ago
7744af58  5ecd049f  randfail    3        run      running   23m ago     22m48s ago
482e2463  5ecd049f  randfail    3        stop     failed    23m42s ago  23m ago
7322b31f  5ecd049f  randfail    3        stop     failed    23m42s ago  23m ago
dc470860  5ecd049f  randfail    3        run      running   23m42s ago  23m11s ago
c77d1258  5ecd049f  randfail    3        run      running   23m42s ago  23m22s ago
eb48de13  9a323737  randfail    3        stop     failed    23m42s ago  23m ago
da0336cf  9a323737  randfail    3        stop     complete  23m42s ago  12m21s ago
ee13d690  5ecd049f  randfail    3        stop     failed    23m42s ago  23m ago
4639a023  9a323737  randfail    3        stop     failed    23m42s ago  23m ago
099e0283  9a323737  randfail    3        stop     failed    23m42s ago  23m ago
c8819235  9a323737  randfail    3        stop     complete  23m42s ago  12m21s ago
...

Here's the alloc status of one of these jobs (it has no logs, however):

▶ nomad alloc status 016e1a2f
ID                  = 016e1a2f-f2c1-ea23-633f-4465f587fa10
Eval ID             = a834cd25
Name                = randfail.randfail[7]
Node ID             = 5ecd049f
Node Name           = ip-172-31-33-81
Job ID              = randfail
Job Version         = 4
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 18m58s ago
Modified            = 18m27s ago
Deployment ID       = 6fad6d49
Deployment Health   = healthy
Canary              = true

Task "server" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  4.2 MiB/64 MiB  300 MiB  http: 172.31.33.81:24409

Task Events:
Started At     = 2019-11-11T16:14:44Z
Finished At    = 2019-11-11T16:15:14Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2019-11-11T11:15:14-05:00  Killed      Task successfully killed
2019-11-11T11:15:14-05:00  Terminated  Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2019-11-11T11:15:14-05:00  Killing     Sent interrupt. Waiting 5s before force killing
2019-11-11T11:14:44-05:00  Started     Task started by client
2019-11-11T11:14:43-05:00  Driver      Downloading image
2019-11-11T11:14:43-05:00  Task Setup  Building Task Directory
2019-11-11T11:14:43-05:00  Received    Task received by client

The server logs also show that each server has a different view of the task group health at the moment we try to do the promotion.

Server 1:

Nov 11 16:16:38 ip-172-31-39-230 nomad[2173]: 2019-11-11T16:16:38.104Z [ERROR] nomad.fsm: UpsertDeploymentPromotion failed: error="1 error(s) occurred:
Nov 11 16:16:38 ip-172-31-39-230 nomad[2173]: * Task group "randfail" has 6/10 healthy allocations"

Server 2:

Nov 11 16:16:38 ip-172-31-37-118 nomad[2161]: 2019-11-11T16:16:38.079Z [ERROR] nomad.fsm: UpsertDeploymentPromotion failed: error="1 error(s) occurred:
Nov 11 16:16:38 ip-172-31-37-118 nomad[2161]: * Task group "randfail" has 8/10 healthy allocations"

Server 3:

Nov 11 16:16:38 ip-172-31-45-45 nomad[2168]: 2019-11-11T16:16:38.020Z [ERROR] nomad.fsm: UpsertDeploymentPromotion failed: error="1 error(s) occurred:
Nov 11 16:16:38 ip-172-31-45-45 nomad[2168]: * Task group "randfail" has 5/10 healthy allocations"

But the weird thing is this is persistent. Even after I've waited long enough to write all this report up, I retried the deployment and they all show the same error message; they haven't converged on an opinion about task group health. I may need to dig into how they gets determined.

@tgross
Copy link
Member

tgross commented Nov 11, 2019

It looks like it isn't the allocation exiting by itself that's causing them to be marked complete, but Nomad that's stopping them. We can see the following in the alloc status:

Desired Status      = stop
Desired Description = alloc not needed due to job update

The eval for that allocation was:

▶ nomad eval status -verbose a834cd25
ID                 = a834cd25-2d9d-1d99-7400-a024d148358f
Create Time        = 2019-11-11T11:14:35-05:00
Modify Time        = 2019-11-11T11:14:43-05:00
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = alloc-failure
Priority           = 50
Placement Failures = false
Wait Until         = 2019-11-11T16:14:43Z
Previous Eval      = 2b663f9d-1c2a-8b8b-f86b-c9fd8c042bdd
Next Eval          = <none>
Blocked Eval       = <none>

Which has the following associated log entries:

2019-11-11T16:14:43.746Z [DEBUG] worker: dequeued evaluation: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f
2019-11-11T16:14:43.746Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f job_id=randfail namespace=default results="Total changes: (place 5) (destructive 0) (inplace 0) (stop 5)
 Desired Changes for "randfail": (place 5) (inplace 0) (destructive 0) (stop 5) (migrate 0) (ignore 15) (canary 0)"
2019-11-11T16:14:43.760Z [DEBUG] worker: submitted plan for evaluation: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f
2019-11-11T16:14:43.760Z [DEBUG] worker.service_sched: setting eval status: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f job_id=randfail namespace=default status=complete
2019-11-11T16:14:43.768Z [DEBUG] worker: updated evaluation: eval="<Eval "a834cd25-2d9d-1d99-7400-a024d148358f" JobID: "randfail" Namespace: "default">"
2019-11-11T16:14:43.769Z [DEBUG] worker: ack evaluation: eval_id=a834cd25-2d9d-1d99-7400-a024d148358f

So now we need to figure out why Nomad decided that we were supposed to stop these allocations.

@tgross
Copy link
Member

tgross commented Nov 11, 2019

I've been leaving this job running as I've been debugging, and a random failed health check on one of the tasks kicked off a new round of evaluations and a few more completed tasks. This brought the count of allegedly health allocations up to 8/10.

And here's the allocations for our job, which is showing more than 10 of version 4 now:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
583cbfad  9a323737  randfail    4        stop     complete  1h17m ago  1h17m ago
f10ea6cb  5ecd049f  randfail    4        stop     complete  1h17m ago  1h17m ago
aebcfb34  5ecd049f  randfail    4        run      running   1h17m ago  1h16m ago
04fbda89  9a323737  randfail    4        run      running   1h17m ago  1h16m ago
605fe0f4  9a323737  randfail    4        stop     complete  1h17m ago  1h17m ago
fe72e828  5ecd049f  randfail    4        run      running   1h17m ago  1h16m ago
1d02f273  5ecd049f  randfail    4        stop     complete  1h17m ago  1h17m ago
401019b0  9a323737  randfail    4        stop     complete  1h17m ago  1h17m ago
9e690906  9a323737  randfail    4        run      running   1h18m ago  1h17m ago
24772c81  5ecd049f  randfail    4        run      running   1h18m ago  1h17m ago
8a42783e  9a323737  randfail    4        stop     complete  1h18m ago  1h17m ago
cf890b71  9a323737  randfail    4        stop     failed    1h18m ago  1h17m ago
1dc3f436  9a323737  randfail    4        run      running   3h39m ago  3h39m ago
6563ee3c  5ecd049f  randfail    4        run      running   3h40m ago  3h39m ago
1afd8c03  9a323737  randfail    4        run      running   3h40m ago  3h39m ago
5b6f4233  5ecd049f  randfail    4        stop     complete  3h40m ago  1h17m ago
969bef4a  5ecd049f  randfail    4        run      running   3h41m ago  3h40m ago
cd1b3113  5ecd049f  randfail    4        run      running   3h41m ago  3h40m ago
bcd7ed18  5ecd049f  randfail    4        run      running   3h41m ago  3h40m ago
9bfc5bd5  9a323737  randfail    4        run      running   3h41m ago  3h40m ago
44c5cb3c  9a323737  randfail    4        run      running   3h41m ago  3h40m ago
9b86619a  9a323737  randfail    3        stop     complete  3h48m ago  1h17m ago
6516abdb  9a323737  randfail    3        run      running   3h49m ago  3h49m ago
7744af58  5ecd049f  randfail    3        run      running   3h50m ago  3h50m ago
a5a09865  5ecd049f  randfail    3        run      running   3h50m ago  3h49m ago
c77d1258  5ecd049f  randfail    3        stop     complete  3h51m ago  1h17m ago
dc470860  5ecd049f  randfail    3        run      running   3h51m ago  3h50m ago

Which the deployment status agrees with:

▶ nomad deployment status 6fad6d49
ID          = 6fad6d49
Job ID      = randfail
Job Version = 4
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    false     10       10        28      17       11         2019-11-11T18:48:21Z

@djenriquez
Copy link
Author

@tgross working through some of your questions right now. The jobs are rescheduled via the reschedule stanza timeouts, so setting those to lower values and the health checking could speed up the test case quite a bit.

@djenriquez
Copy link
Author

djenriquez commented Nov 11, 2019

I'm running on a 2 server Nomad deployment, each server shows the same status (might be just my luck, i'll try to increase the fail percentage after this), but the status shows the same for both:

[ec2-user@ip-10-179-163-250 ~]$ nomad deployment status d6e28944
ID          = d6e28944
Job ID      = randfail
Job Version = 6
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    false     10       10        15      10       5          2019-11-11T20:53:05Z
[ec2-user@ip-10-179-163-250 ~]$ nomad deployment promote d6e28944
Error promoting deployment: Unexpected response code: 500 (rpc error: 1 error(s) occurred:

* Task group "randfail" has 9/10 healthy allocations)
[ec2-user@ip-10-179-158-186 ~]$ nomad deployment status d6e28944
ID          = d6e28944
Job ID      = randfail
Job Version = 6
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    false     10       10        15      10       5          2019-11-11T20:53:05Z
[ec2-user@ip-10-179-158-186 ~]$ nomad deployment promote d6e28944
Error promoting deployment: Unexpected response code: 500 (1 error(s) occurred:

* Task group "randfail" has 9/10 healthy allocations)

ip-10-179-158-186 is reporting as the leader.

When I list the allocations, I do see an alloc of the previous (blue) job version that shows as complete:

[ec2-user@ip-10-179-163-250 ~]$ nomad alloc status f34d7544
ID                  = f34d7544
Eval ID             = 0b3801cd
Name                = randfail.randfail[9]
Node ID             = 9590c3a5
Node Name           = i-053006cf1b10c0c05
Job ID              = randfail
Job Version         = 5
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 12m47s ago
Modified            = 7m39s ago
Deployment ID       = f978b097
Deployment Health   = healthy

Task "server" is "dead"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  728 KiB/300 MiB  300 MiB  http: 10.179.173.53:22187

Task Events:
Started At     = 2019-11-11T20:35:22Z
Finished At    = 2019-11-11T20:40:28Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2019-11-11T20:40:28Z  Killed      Task successfully killed
2019-11-11T20:40:28Z  Terminated  Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2019-11-11T20:40:28Z  Killing     Sent interrupt. Waiting 5s before force killing
2019-11-11T20:35:22Z  Started     Task started by client
2019-11-11T20:35:20Z  Driver      Downloading image
2019-11-11T20:35:20Z  Task Setup  Building Task Directory
2019-11-11T20:35:20Z  Received    Task received by client

It's evaluation:

[ec2-user@ip-10-179-163-250 ~]$ nomad eval status -verbose 0b3801cd 
ID                 = 0b3801cd-b1a9-2beb-5eac-04b52f3ee7ec
Create Time        = 2019-11-11T20:36:21Z
Modify Time        = 2019-11-11T20:36:21Z
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = job-register
Job ID             = randfail
Priority           = 50
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

@tgross
Copy link
Member

tgross commented Nov 11, 2019

The slow rotation of tasks onto version 4 made me suspicious. So I caused one of the version 3 tasks to lose its network connectivity and fail its health check, and in response Nomad created multiple version 4 tasks:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
11d0d6dd  5ecd049f  randfail    4        run      running   9s ago     8s ago
e8405340  9a323737  randfail    4        run      running   9s ago     8s ago
ac38de6c  9a323737  randfail    4        run      running   13s ago    12s ago
8b0c6453  9a323737  randfail    4        stop     complete  22s ago    8s ago
1c04ea9a  9a323737  randfail    4        run      running   22s ago    10s ago
418bcfa6  9a323737  randfail    4        stop     complete  22s ago    8s ago
aebcfb34  5ecd049f  randfail    4        run      running   2h10m ago  2h9m ago
04fbda89  9a323737  randfail    4        run      running   2h10m ago  2h9m ago
fe72e828  5ecd049f  randfail    4        run      running   2h10m ago  2h9m ago
9e690906  9a323737  randfail    4        run      running   2h11m ago  2h10m ago
24772c81  5ecd049f  randfail    4        run      running   2h11m ago  2h10m ago
1dc3f436  9a323737  randfail    4        run      running   4h32m ago  4h32m ago
6563ee3c  5ecd049f  randfail    4        run      running   4h33m ago  4h32m ago
1afd8c03  9a323737  randfail    4        run      running   4h33m ago  4h32m ago
969bef4a  5ecd049f  randfail    4        run      running   4h34m ago  4h33m ago
9bfc5bd5  9a323737  randfail    4        run      running   4h34m ago  4h33m ago
cd1b3113  5ecd049f  randfail    4        run      running   4h34m ago  4h33m ago
44c5cb3c  9a323737  randfail    4        run      running   4h34m ago  4h33m ago
bcd7ed18  5ecd049f  randfail    4        run      running   4h34m ago  4h33m ago
6516abdb  9a323737  randfail    3        run      running   4h42m ago  4h42m ago
7744af58  5ecd049f  randfail    3        run      running   4h43m ago  4h43m ago
a5a09865  5ecd049f  randfail    3        run      running   4h43m ago  4h42m ago
dc470860  5ecd049f  randfail    3        stop     failed    4h44m ago  13s ago

At this point, Nomad thinks we have 10/10 and a request to promote succeeds:

▶ nomad deployment promote  6fad6d49
==> Monitoring evaluation "bc955bbc"
    Evaluation triggered by job "randfail"
    Evaluation within deployment: "6fad6d49"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bc955bbc" finished with status "complete"

The version 3 jobs are then stopped, as are any version 4 jobs more than the 10 we want.

@djenriquez at this point I've got a working reproduction and at least some theories as to where the problems could be occurring. I'm going to try to create a more minimal test case and start digging into the areas of code we're touching here next.

@tgross
Copy link
Member

tgross commented Nov 13, 2019

@djenriquez I've run a number of tests changing some of the parameters to see if I can narrow the search space a bit. With a smaller size job that fails a bit less often, or when using a smaller number of canaries, I've found it's easier to get a successful or failed deployment than this mysterious stuck deployment. So worst case scenario for you in the meantime, you might want consider using a smaller set of canaries than full blue/green until we have this figured out.

Here's where we are so far. I've done some scripting to parse through the evaluations API and unfortunately, it looks like there may be two different buggy behaviors to unpack here, and the obvious symptoms described as (1) below aren't necessarily the direct cause of the deploys getting stuck (2) because we can get stuck without this happening. But because I can't reproduce either of these behaviors without canaries, that at least narrows it down to the canary code.


  1. I've found that in many test runs where we're deploying job version N+1, there are allocations of job version N that are being prematurely replaced. Random N allocations are being marked for marked for stopping before we try to promote the job, and even before all the canaries have been deployed. The allocation status includes the following:
Desired Status      = stop
Desired Description = alloc not needed due to job update

That particular message only shows up in a few places in the code base, mostly around the handleGroupCanaries() method in scheduler/reconcile.go#L563-L613. So that'll be the obvious next place to look.

This graph shows an example of an in-flight deployment. The blue allocations on the right hand side (1a6eea8 and 5d81eb5) were v0 allocations that Nomad marked for stopping as above, even though the v1 deploy is still ongoing. This deployment goes on for quite a while and results in many many more evaluations before getting stuck.

graph


  1. Even if the scenario above doesn't happen, if we have a failure of an allocation we can get into the "stuck" state where the servers believe they have only, for example, 4 out of 5 canaries healthy. The Nomad CLI gives the following in this scenario:
▶ nomad deployment promote 1358f244
Error promoting deployment: Unexpected response code: 500 (rpc error: 1 error(s) occurred:

* Task group "randfail" has 4/5 healthy allocations)

▶ nomad job status randfail
ID            = randfail
Name          = randfail
Submit Date   = 2019-11-13T11:49:34-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
randfail    0       0         10       1       5         0

Latest Deployment
ID          = 1358f244
Status      = running
Description = Deployment is running but requires manual promotion

Deployed
Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
randfail    false     5        5         6       5        1          2019-11-13T17:00:04Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
628b73d1  54eca36f  randfail    2        run      running   2m20s ago  2m8s ago
4dd362ed  4f1ca2ae  randfail    2        run      running   2m37s ago  2m22s ago
205e4dff  54eca36f  randfail    2        run      running   2m37s ago  2m23s ago
d5674105  54eca36f  randfail    2        run      running   2m37s ago  2m24s ago
3afc7d56  4f1ca2ae  randfail    2        stop     failed    2m37s ago  2m20s ago
f1d59c8d  54eca36f  randfail    2        run      running   2m37s ago  2m24s ago
4b2996a1  54eca36f  randfail    1        run      running   3m3s ago   2m47s ago
6358e75c  54eca36f  randfail    1        run      running   3m3s ago   2m51s ago
95cca0e0  4f1ca2ae  randfail    1        run      running   3m3s ago   2m49s ago
ab75f24b  4f1ca2ae  randfail    1        run      running   3m3s ago   2m51s ago
d8dcf3e5  4f1ca2ae  randfail    1        run      running   3m38s ago  3m24s ago
5905a42b  54eca36f  randfail    0        stop     complete  4m34s ago  3m3s ago
c5587363  54eca36f  randfail    0        stop     complete  4m34s ago  3m3s ago
346a1a4f  4f1ca2ae  randfail    0        stop     complete  4m34s ago  3m3s ago
4732a9a2  4f1ca2ae  randfail    0        stop     complete  4m34s ago  3m3s ago
1d68b747  54eca36f  randfail    0        stop     complete  4m34s ago  3m3s ago

This is the stuck state graph, taken after the progress timer should has expired. The failed alloc is the red alloc 3afc7d56 on the left. The deployment 1358f244 (blue) hasn't been marked as failed, and we have 5 green canaries, but we can't promote it.

graph

@tgross
Copy link
Member

tgross commented Nov 13, 2019

With a slightly improved version of the script I'm using to process the data, I think I've got a clue. The graph below is the same data as (2) above, but with the addition of the blue arrows that show the canaries that the deployments are tracking. Note that our stuck deployment deploy:1358f24 has no arrow to alloc:628b73d, which replaced the failed alloc:3afc7d5.

So it looks like the deployment is not tracking restarted canaries.

graph

@tgross
Copy link
Member

tgross commented Nov 13, 2019

The odd thing here is that we'd expect that this rescheduled allocation shouldn't have been scheduled at all:

https://www.nomadproject.io/docs/job-specification/reschedule.html#rescheduling-during-deployments

Rescheduling during deployments

The update stanza controls rolling updates and canary deployments. A task group's reschedule stanza does not take affect during a deployment. For example, if a new version of the job is rolled out and the deployment failed due to a failing allocation, Nomad will not reschedule it.

But the raw data definitely says that eval:f3133b4e was triggered due to "alloc-failure"; the prior eval:caa77946 says it was triggered by "deployment-watcher".

@tgross
Copy link
Member

tgross commented Nov 13, 2019

Another workaround for you while we work on this @djenriquez is that if you have the restart stanza set, the rescheduling process won't need to kick in nearly as often and that should make for more reliable deployments. (And they're faster too because it's just restarting the task.)

@djenriquez
Copy link
Author

djenriquez commented Nov 14, 2019

This is incredibly impressive work on the investigation @tgross. I'm not sure where I can help you, but I'm more than happy to do so if there are things I'm able to do.

With the work-around, that actually doesn't work well for us since the problem we experience that trigger this is related to #6567, which fails allocations immediately. It happens, unfortunately, more common than one would hope.

There is also #6350, which was pre-0.10.0 but back then we had to register consul services on the task level, so we registered them on the sidecar proxy task since that represented the network for the service. With a restart stanza on the sidecar proxy's service, Nomad would only send a shutdown signal to the proxy, not the app itself, completely nullifying the intent of the restart stanza--the proxy would get restarted over and over while the app that was in the bad state stayed in that bad state. It may be better with 0.10.0 since the service and check is registered on the task group level w/ network namespaces. Something for me to verify.

@tgross
Copy link
Member

tgross commented Nov 14, 2019

With the work-around, that actually doesn't work well for us since the problem we experience that trigger this is related to #6567, which fails allocations immediately. It happens, unfortunately, more common than one would hope.

Ah, right... well the good news is that my PR to the CNI plugins project was accepted and so hopefully they'll be cutting as release with that soon.

With a restart stanza on the sidecar proxy's service, Nomad would only send a shutdown signal to the proxy, not the app itself, completely nullifying the intent of the restart stanza--the proxy would get restarted over and over while the app that was in the bad state stayed in that bad state.

That issue is on my radar as well!

@tgross
Copy link
Member

tgross commented Nov 15, 2019

After some more testing I have working single node tests and I've been breaking down the symptoms. There are two real problems and one false one:

  1. Deployments with canaries but without restarts should fail when any canary fails, but instead they get stuck.

    • If an alloc passes health checks and is marked as healthy but later dies, the deployment status doesn't get updated and this is how we get stuck.
    • If an alloc allocation never passes health checks and is declared unhealthy, it's being rescheduled (unexpectedly).
    • I've written what I expected to be a failing unit test for nomad/deploymentwatcher/deploymentwatcher.go, but it behaves correctly assuming that the alloc deployment status gets updated.
    • The logs I have suggest that the client isn't updating the deployment status correctly, so I'm digging into client/allocrunner next.
  2. When a deployment is running, a failed alloc for a previous job version should not be replaced with an alloc for the deployment's version but the previous version.

    • This should be controlled by scheduler/reconcile.go so that's where I'm looking next.

One of the symptoms we've seen in the large scale tests is actually the correct behavior but it's emergent when the other bugs are happening. When a deployment job with canaries gets stuck, the next deployment will cause all its canaries to be stopped because they were never promoted. When this happens alongside (1), this looks like we're stopping previous job version allocs prematurely.

@tgross
Copy link
Member

tgross commented Nov 19, 2019

The issue I've identified as (2) above turns out to be a known tradeoff in the scheduler, which is in our internal design docs but not documented in public. I've opened #6723 to make sure this gets documented.

That leaves:

If an alloc passes health checks and is marked as healthy but later dies, the deployment status doesn't get updated and this is how we get stuck.

For each allocation that's part of a deployment, we start up an health_hook on the client. This health hook watches the allocation (see the relevant code here) until it is marked healthy (where "healthy" is defined by update.health_check, or until the update.healthy_deadline passes, or until the allocation stops. But once one of these events happens, we report the status to the server's deployment watcher and then we exit the client-side watcher.

Which means if the allocation fails after it initially reports it's healthy, we're in a state where the deployment thinks everyone is healthy, but when we promote the deployment to make that a reality, the server's state machine rejects the change because there aren't enough canary allocations that are actually healthy.

The "obvious" fix to do is to not exit the watcher loop, but allow subsequent health updates to change the deployment status. But it's not clear how this should interact with the progress deadline. If the progress deadline has passed with all canaries healthy, but one of the canaries fails, what should happen? Are there other implications to the deployment behavior I haven't thought of?

Alternately (as a somewhat incomplete fix), we could have the promotion process fail the deployment if it hits an error past the promotion deadline.


Minimal reproduction of the remaining issue:

# start a job
nomad job run ./test.hcl

# wait for the deployment to succeed
nomad deployment status $id

# bump the env.version and redeploy
nomad job run ./test.hcl

# wait for all 3 allocs to be marked healthy, so that the 
# deployment is pending manual approval
nomad deployment status $id

# note that we have 6 running allocs
nomad job status test

# kill one of the new containers
docker kill $(docker ps | awk '/nginx/{print $1}' | head -1)

# wait for progress deadline to expire, deployment will not be failed
# and promotion will be stuck
nomad job status test
nomad deployment status $id
nomad deployment promote $id

jobspec:

job "test" {
  datacenters = ["dc1"]

  group "webservers" {
    count = 3
    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:latest"
        port_map = {
          http = 80
        }
      }

      env {
        version = "0"
      }

      service {
        name = "nginx"
        port = "http"

        check {
          type     = "http"
          port     = "http"
          path     = "/"
          interval = "5s"
          timeout  = "3s"
          check_restart {
            limit = 1
            grace = "5s"
            ignore_warnings = false
          }
        }
      }

      resources {
        memory = 64

        network {
          mbits = 10
          port "http" {}
        }
      }
    }

    restart {
      attempts = 0
      delay    = "10s"
    }

    update {
        max_parallel      = 3
        health_check      = "checks"
        min_healthy_time  = "5s"
        healthy_deadline  = "30s"
        progress_deadline = "2m"
        auto_revert       = false
        auto_promote      = false
        canary            = 3
    }

  }
}

@djenriquez
Copy link
Author

@tgross, I think that as a system user, I would expect that during a deployment, if a canary goes healthy then fails, a new allocation needs to be added to make up for the failed allocation that happened during the deployment. As far as the progress_deadline, I think that carries the same logic we would expect, where failed allocations do not alter it as it would just be watching for new allocations to go healthy.

The edge case I see in this is the possibility of an endless deployment, where an allocation always goes unhealthy and is replaced before the final allocation can be marked as healthy. In this case, the progress deadline will continually be reset as new allocations come in and become healthy for a period of time. Although, I don't think this is incorrect as that should be the responsibility of the healthy_deadline. If allocations are failing later, then a user has not represented the health lifecycle of their application well enough in the job config.

The "obvious" fix to do is to not exit the watcher loop, but allow subsequent health updates to change the deployment status. But it's not clear how this should interact with the progress deadline. If the progress deadline has passed with all canaries healthy, but one of the canaries fails, what should happen? Are there other implications to the deployment behavior I haven't thought of?

Also, taking a step back, I want to make sure the focus is on the correct issue. Yes, I do see the problematic scenario presented by this allocation watcher, but why is it not an issue when there is a single server? That to me seems like more of a shared state issue than it is a logical issue with how the system handles failed allocations during a deploy. Remember, with a single server, it didn't matter how many times the allocations failed during the deploy, once the healthy number was hit, we were able to promote.

@tgross
Copy link
Member

tgross commented Nov 19, 2019

The edge case I see in this is the possibility of an endless deployment, where an allocation always goes unhealthy and is replaced before the final allocation can be marked as healthy. In this case, the progress deadline will continually be reset as new allocations come in and become healthy for a period of time.

Yeah, that's a good point.

Remember, with a single server, it didn't matter how many times the allocations failed during the deploy, once the healthy number was hit, we were able to promote.

I know I ended up spewing a whole bunch of debugging info here so it would have been easy to miss 😀 , but it turned out starting in #6407 (comment) that I was able to reproduce the behavior with a single node, even just in dev mode.

@djenriquez
Copy link
Author

I know I ended up spewing a whole bunch of debugging info here so it would have been easy to miss 😀 , but it turned out starting in #6407 (comment) that I was able to reproduce the behavior with a single node, even just in dev mode.

Ahhhh, I didn't know that. Interesting, so the problem goes deeper. Thanks for clarifying that for me.

@tgross
Copy link
Member

tgross commented Nov 20, 2019

I've opened #6740 as our way to cover the remaining issue here. There's going to be a bunch of documentation improvements we should do as well, starting with #6723

@tgross tgross closed this as completed Nov 20, 2019
Nomad - Community Issues Triage automation moved this from In Progress to Done Nov 20, 2019
@tgross tgross removed this from the near-term milestone Jan 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants