Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico panics starting in v3.7.1 (v3.7.0 was fine) #2596

Closed
dghubble opened this issue May 7, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@dghubble
Copy link
Contributor

commented May 7, 2019

Current Behavior

Starting in v3.7.1, calico-node does not become ready and logs show panics and errors. This wasn't the case with v3.7.0.

NAME                                       READY     STATUS    RESTARTS   AGE
calico-node-4w8x6                          0/1       Running   0          9m31s
calico-node-9gk92                          0/1       Running   1          9m31s
calico-node-hw8gb                          0/1       Running   0          9m31s
calico-node-zqswb                          0/1       Running   2          9m31s
2019-05-07 03:14:05.009 [WARNING][3932] int_dataplane.go 727: failed to wipe the XDP state error=failed to attach XDP program (/sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A) to calico_tmp_A: exit status 255  
Error: either "dev" is duplicate, or "xdpgeneric" is a garbage.                                                                                                                                                    
 try=8                                                                                                                                                                                                             
2019-05-07 03:14:05.037 [INFO][3932] int_dataplane.go 584: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calico_tmp_B"                                                                              
2019-05-07 03:14:05.037 [INFO][3932] int_dataplane.go 584: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calico_tmp_A"                                                                              
2019-05-07 03:14:05.037 [INFO][3932] int_dataplane.go 584: Linux interface addrs changed. addrs=<nil> ifaceName="calico_tmp_A"                                                                                     
2019-05-07 03:14:05.037 [INFO][3932] int_dataplane.go 584: Linux interface addrs changed. addrs=<nil> ifaceName="calico_tmp_B"                                                                                     
2019-05-07 03:14:05.041 [WARNING][3932] int_dataplane.go 727: failed to wipe the XDP state error=failed to attach XDP program (/sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A) to calico_tmp_A: exit status 255  
Error: either "dev" is duplicate, or "xdpgeneric" is a garbage.                                                                                                                                                    
 try=9                                                                                                                                                                                                             
2019-05-07 03:14:05.041 [PANIC][3932] int_dataplane.go 730: Failed to wipe the XDP state after 10 tries                                                                                                            
panic: (*logrus.Entry) (0x19f3620,0xc0008dceb0)                                                                                                                                                                    

goroutine 83 [running]:
github.com/projectcalico/node/vendor/github.com/sirupsen/logrus.Entry.log(0xc0000ca0a0, 0xc00089e0f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)                                                                
        /go/src/github.com/projectcalico/node/vendor/github.com/sirupsen/logrus/entry.go:128 +0x5a8
github.com/projectcalico/node/vendor/github.com/sirupsen/logrus.(*Entry).Panic(0xc0000ca4b0, 0xc00095f940, 0x1, 0x1)                                                                                              
        /go/src/github.com/projectcalico/node/vendor/github.com/sirupsen/logrus/entry.go:173 +0xb2
github.com/projectcalico/node/vendor/github.com/sirupsen/logrus.(*Entry).Panicf(0xc0000ca4b0, 0x1a8f5fb, 0x2b, 0xc00095fa28, 0x1, 0x1)                                                                            
        /go/src/github.com/projectcalico/node/vendor/github.com/sirupsen/logrus/entry.go:221 +0xed
github.com/projectcalico/node/vendor/github.com/sirupsen/logrus.(*Logger).Panicf(0xc0000ca0a0, 0x1a8f5fb, 0x2b, 0xc00095fa28, 0x1, 0x1)                                                                           
        /go/src/github.com/projectcalico/node/vendor/github.com/sirupsen/logrus/logger.go:173 +0x85
github.com/projectcalico/node/vendor/github.com/sirupsen/logrus.Panicf(0x1a8f5fb, 0x2b, 0xc00095fa28, 0x1, 0x1)                                                                                                   
        /go/src/github.com/projectcalico/node/vendor/github.com/sirupsen/logrus/exported.go:145 +0x5f
github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).shutdownXDPCompletely(0xc0003af800)                                                                      
        /go/src/github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux/int_dataplane.go:730 +0x1ea                                                                                   
github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).apply(0xc0003af800)                                                                                      
        /go/src/github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux/int_dataplane.go:1052 +0x80f                                                                                  
github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).loopUpdatingDataplane(0xc0003af800)                                                                      
        /go/src/github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux/int_dataplane.go:911 +0x583                                                                                   
created by github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux.(*InternalDataplane).Start                                                                                         
        /go/src/github.com/projectcalico/node/vendor/github.com/projectcalico/felix/dataplane/linux/int_dataplane.go:541 +0x51

Possible Solution

Sounds like the base image changed from v3.7.0 -> v3.7.1. Possibly related. I'd guess there's only a small diff between point releases.

On my guinea pig cluster to check v3.7.1, I was able to revert to v3.7.0 and Calico runs without panics.

kubectl edit ds calico-node -n kube-system
# edit version

Steps to Reproduce (for bugs)

Manifests: https://github.com/poseidon/terraform-render-bootkube/tree/master/resources/calico

Your Environment

  • Calico version: v3.7.1
  • Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes (kdd)
  • Operating System and version: Container Linux stable
  • Link to your project (optional): poseidon/typhoon#465
@fasaxc

This comment has been minimized.

Copy link
Member

commented May 7, 2019

Thanks for the report, looks like projectcalico/node#224 backed us down to a too-old version of ip, which doesn't support the flag we need.

As an aside, Calico may get confused if you have non-workload interfaces that begin with the prefix cali; we use that to prefix by default to detect interfaces that we should treat as workload interfaces.

@fasaxc

This comment has been minimized.

Copy link
Member

commented May 7, 2019

This may fix it: projectcalico/node#232

@fasaxc fasaxc added this to the Calico v3.7.2 milestone May 7, 2019

@fasaxc fasaxc self-assigned this May 7, 2019

@fasaxc

This comment has been minimized.

Copy link
Member

commented May 7, 2019

@dghubble Can you explain the scenario that you're testing? I'd only expect Calico to try to attach an XDP program if the policy you're applying to a host endpoint was a doNotTrack policy with a deny rule as the first rule. Just wanted to check that was the case here to make sure that the XDP code isn't firing incorrectly.

@dghubble

This comment has been minimized.

Copy link
Contributor Author

commented May 7, 2019

I didn't delve deeply for this. Just noticed it panicked and reverted.

Some details I can call out. I tested v3.7.1 with a fresh cluster, so not upgrading from a prior version. I had not yet applied any NetworkPolicy or workloads beyond just kube-system control plane (self-hosted). This Kubernetes cluster was on AWS EC2. I've got no other cali interfaces. I'm also not using Calico with host endpoints (or I'm not intending to).

@caseydavenport

This comment has been minimized.

Copy link
Member

commented May 8, 2019

We've merged a fix into master for this here: projectcalico/node#232
And cherry-picked to v3.7 here: projectcalico/node#233

There's an auto-build of it here: calico/node:v3.8.0-0.dev-10-g1046721.

@dghubble if you have time to try out the fix on your system prior to us releasing v3.7.2, that would be awesome. Otherwise keep an eye out for v3.7.2 :)

@dghubble

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

@caseydavenport @fasaxc Yep! That auto-build seems to fix it, thanks for the fast turn-around!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.