Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended Kernel and Docker Storage for 1.4 / 1.5 #874

Closed
chrislovecnm opened this issue Nov 12, 2016 · 38 comments
Closed

Recommended Kernel and Docker Storage for 1.4 / 1.5 #874

chrislovecnm opened this issue Nov 12, 2016 · 38 comments
Assignees
Labels
focus/stability lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. P0
Milestone

Comments

@chrislovecnm
Copy link
Contributor

We need to document and validate that this is resolved. kubernetes/kubernetes#30706

  1. What kernel is recommended
  2. What docker storage is recommended
  3. How do we test this
@chrislovecnm
Copy link
Contributor Author

@dchen1107 / @justinsb / @zmerlynn how do we get this documented and tested?

@jaygorrell
Copy link

@chrislovecnm posted this after discussing things with me in Slack.

Basically where I'm at is I have a kops git-a09f3a9 that was originally managing 1.3.x clusters that have been bumped to 1.4. My understanding is that the panic issue in kubernetes/kubernetes#30706 for m4 instance types is resolved in the 4.4 kernel so I'm trying to understand the best upgrade path.

Kops has an updated AMI/kernel for 4.4 but I'm not sure if that has been released yet on a stable channel. Should I be doing a cluster upgrade with my current kops version and editing the image? Should I update kops and just do an upgrade from there -- and if so, should we build from master or use Kops 1.4.1?

This scope of this issue is a bit larger (documentation and testing) but that's the root questions from my perspective.

@chrislovecnm
Copy link
Contributor Author

yah ... @jaygorrell you covered about 42+ things that should be created as an issue. I jest as I just created a cluster with 42 in it and if you don't know the significance of 42 I have a book for you.

  1. use kops 1.4.1 as it is stable
  2. need to validate which ami is used by 1.1.4 ~ @justinsb
  3. you need to install 1.4.4+ K8s version
  4. do not use master, as it requires a new version of nodeup
  5. we are improving this release process to help with some of these questions
  6. we are improving the release notes in order to help with some of these questions
  7. we need to test, validate, and firm up exactly what the community ami consists of.

Let me ask you.

What would be AMAZING for you? What would have helped answer all of these questions in the best OSS project you have ever seen?

@chrislovecnm
Copy link
Contributor Author

Also, we need some more toys on the base image ... kubernetes-retired/kube-deploy#255

@jaygorrell
Copy link

That was very helpful @chrislovecnm, thank you!

As for what would make this all amazing? Simply having what you outlined readily available! We have kops versioning that almost aligns with Kubernetes versioning, making things a bit confusing. The lack of clarity around which system dictates the AMI received for 4.4 kernel is a great example of the problems around this.

Perhaps all we need is a sort of table/matrix to outline the variables related to a version of kops. It could show the default AMI, networking layer, and other settings... as well as the supported Kubernetes versions.

Ideally, someone should be able to reference it to easily see that if they have kops version 1.4.1, which Versions of Kubernetes they're able to install and what AMI would be used. Similarly, if they know they want Kubernetes 1.4.4 for security reasons, they should be able to see which versions of Kops can help with that.

@krisnova
Copy link
Contributor

@chrislovecnm - Can we touch base on this tomorrow? Maybe after our morning call - Would like to know where we stand.

Cheers

@chrislovecnm
Copy link
Contributor Author

https://docs.docker.com/engine/userguide/storagedriver/selectadriver/
~ overlayfs is still the recommended and more tested driver per @justinsb

@jkemp101
Copy link

I'm eager to try overlay2. I regularly deploy a Prometheus container that is based on busybox. Busybox uses thousands of hardlinks which causes deployment to fail with Error trying v2 registry: failed to register layer: link...too many links after a handful of deployments of the same container. You quickly reach the ext4 65,000 limit on hardlinks. I have to manually run a docker rmi to remove some of the hardlinks so the docker pull will complete. So unless I'm missing something you can't deploy a busybox based container more than about 6 times with overlayfs before it will start failing.

@justinsb justinsb added this to the 1.5.0 milestone Dec 28, 2016
@justinsb
Copy link
Member

Here is what is validated with k8s 1.5:

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#external-dependency-version-information

The reason for the 4.4 kernel: kubernetes/kubernetes#30706

We are figuring out what to do about the recent docker security hole: kubernetes/kubernetes#40061

Also, work on validating overlay2: kubernetes/kubernetes#32536

Until then, I think recommended is:

  • 4.4 kernel
  • docker 1.12.3
  • overlay filesystem

@justinsb justinsb modified the milestones: 1.5.1, 1.5.0 Jan 19, 2017
@lcjlcj
Copy link

lcjlcj commented Feb 9, 2017

@justinsb We were seeing minion crashing about once a day. We were on jessie 3.16, docker 1.11.2, aufs, kube 1.4.3. After seeing issue, we upgraded to jessie 4.4, docker 1.12.3, overlay, kube 1.4.3. Now crash happens every 5 minutes. Also noticed cbr0 has lots of churns. Eth0 keeps going offline and online every half minute to minute. This is on AWS m3.2xlarge.

@lcjlcj
Copy link

lcjlcj commented Feb 10, 2017

Finally captured a stacktrace here.
[50377.745293] IPv6: ADDRCONF(NETDEV_UP): vethc32763d: link is not ready
[50378.138431] eth0: renamed from vethc1d7543
[50378.156526] BUG: unable to handle kernel [50378.158396] IPv6: ADDRCONF(NETDEV_CHANGE): vethc32763d: link becomes ready
[50378.158423] docker0: port 9(vethc32763d) entered forwarding state
[50378.158438] docker0: port 9(vethc32763d) entered forwarding state
[50378.160476] NULL pointer dereference at 0000000000000050
[50378.160476] IP: [] wakeup_preempt_entity.isra.62+0x9/0x50
[50378.160476] PGD 1d7b98067 PUD 1e540c067 PMD 0
[50378.160476] Oops: 0000 [#1] SMP
[50378.160476] Modules linked in: xt_statistic(E) sch_htb(E) ebt_ip(E) ebtable_filter(E) ebtables(E) veth(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_nat(E) xt_tcpudp(E) xt_recent(E) xt_mark(E) xt_comment(E) binfmt_misc(E) overlay(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) xfrm_user(E) xfrm_algo(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) iptable_filter(E) ip_tables(E) xt_conntrack(E) x_tables(E) nf_nat(E) nf_conntrack(E) br_netfilter(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) libcrc32c(E) crc32c_generic(E) loop(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) sunrpc(E) crct10dif_pclmul(E) crc32_pclmul(E) hmac(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) parport_pc(E) 8250_fintek(E) ablk_helper(E) cryptd(E) evdev(E) parport(E) acpi_cpufreq(E) snd_pcsp(E) tpm_tis(E) tpm(E) snd_pcm(E) snd_timer(E) snd(E) cirrus(E) ttm(E) drm_kms_helper(E) drm(E) i2c_piix4(E) soundcore(E) processor(E) button(E) serio_raw(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) xor(E) raid6_pq(E) dm_mod(E) ata_generic(E) xen_netfront(E) xen_blkfront(E) ata_piix(E) crc32c_intel(E) psmouse(E) libata(E) scsi_mod(E) fjes(E)
[50378.160476] CPU: 1 PID: 2551 Comm: docker-containe Tainted: G E 4.4.41-k8s #1
[50378.160476] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[50378.160476] task: ffff88013213ee40 ti: ffff8801e55bc000 task.ti: ffff8801e55bc000
[50378.160476] RIP: 0010:[] [] wakeup_preempt_entity.isra.62+0x9/0x50
[50378.160476] RSP: 0018:ffff8801e55bf828 EFLAGS: 00010086
[50378.160476] RAX: ffff880137528100 RBX: 0000a0ca01034375 RCX: ffff8800eaffa4c0
[50378.160476] RDX: ffff8801efc35e30 RSI: 0000000000000000 RDI: 0000a0ca01034375
[50378.160476] RBP: 0000000000000000 R08: 0000000000004000 R09: ffff8800eaeb1e01
[50378.160476] R10: 00000000000043dc R11: 0000000000000000 R12: 0000000000000000
[50378.160476] R13: 0000000000000000 R14: ffff8801efc35dc0 R15: 0000000000000001
[50378.160476] FS: 00007fc2ef475700(0000) GS:ffff8801efc20000(0000) knlGS:0000000000000000
[50378.160476] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50378.160476] CR2: 0000000000000050 CR3: 00000001cf57c000 CR4: 00000000001406e0
[50378.160476] Stack:
[50378.160476] ffff8800eafa4a00 ffffffff810aa3b0 ffff8800eafa4a00 0000000000000000
[50378.160476] 0000000000015dc0 0000000000000000 ffffffff810b332f ffff88013213ee40
[50378.160476] ffff8801efc35dc0 0000000000016840 0000000000015dc0 ffff8801efc35e30
[50378.160476] Call Trace:
[50378.160476] [] ? pick_next_entity+0x70/0x140
[50378.160476] [] ? pick_next_task_fair+0x30f/0x4a0
[50378.160476] [] ? __schedule+0xdf/0x960
[50378.160476] [] ? schedule+0x31/0x80
[50378.160476] [] ? schedule_hrtimeout_range_clock+0xa1/0x120
[50378.160476] [] ? hrtimer_init+0x100/0x100
[50378.160476] [] ? schedule_hrtimeout_range_clock+0x94/0x120
[50378.160476] [] ? poll_schedule_timeout+0x45/0x60
[50378.160476] [] ? do_select+0x57b/0x7d0
[50378.160476] [] ? find_busiest_group+0x3e/0x4f0
[50378.160476] [] ? cpumask_next_and+0x2a/0x40
[50378.160476] [] ? update_curr+0xba/0x130
[50378.160476] [] ? set_next_entity+0x71/0x7d0
[50378.160476] [] ? update_curr+0x55/0x130
[50378.160476] [] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[50378.160476] [] ? finish_task_switch+0x6d/0x230
[50378.160476] [] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[50378.160476] [] ? hrtimer_try_to_cancel+0xc7/0x120
[50378.160476] [] ? hrtimer_cancel+0x15/0x20
[50378.160476] [] ? futex_wait+0x1e6/0x260
[50378.160476] [] ? hrtimer_init+0x100/0x100
[50378.160476] [] ? core_sys_select+0x19c/0x2a0
[50378.160476] [] ? do_futex+0x110/0xb50
[50378.160476] [] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[50378.160476] [] ? xen_clocksource_get_cycles+0x11/0x20
[50378.160476] [] ? ktime_get_ts64+0x3f/0xf0
[50378.160476] [] ? xen_clocksource_get_cycles+0x11/0x20
[50378.160476] [] ? ktime_get_ts64+0x3f/0xf0
[50378.160476] [] ? SyS_select+0xba/0x110
[50378.160476] [] ? entry_SYSCALL_64_fastpath+0x16/0x75
[50378.160476] Code: 5b e9 3c f2 ff ff 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f7 e9 e3 fb ff ff 0f 1f 00 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 50 48 85 db 7e 2c 48 81 3e 00 04 00 00 8b 05 91 88 9a
[50378.160476] RIP [] wakeup_preempt_entity.isra.62+0x9/0x50
[50378.160476] RSP
[50378.160476] CR2: 0000000000000050
[50378.160476] ---[ end trace 6910e79e3636a2b9 ]---
[50378.160476] Kernel panic - not syncing: Fatal exception
[50378.160476] Shutting down cpus with NMI
[50378.160476] Kernel Offset: disabled

@chrislovecnm
Copy link
Contributor Author

@justinsb any ideas on this?

@justinsb
Copy link
Member

Sorry about the trouble @lcjlcj :

  1. Are you capturing the panics from aws ec2 get-console-output? That tends to have them very reliably, whereas systemd/journald is not very good at capturing them. For example:
    aws ec2 get-console-output --instance-id i-0d91498bdd6160a51 --query Output --output text | less

  2. Which version of kops are you running? (Though I don't think we fixed anything here). Also technically docker 1.12 is not validated with kube 1.4, so I'm wondering how you got kops to do this?

  3. You said the machines are crashing every five minutes. Is that across your whole cluster, or per instance (the uptime you posted was ~14 hours, I think)? And then are the crashes happening on a particular instance or on multiple instances?

  4. Are you doing anything unusual? Let's assume that other people are not hitting the same problem (it hasn't been reported, though that doesn't mean they are not!) What would your guesses be as to how you are different? For example, rapidly churning pods etc

@zhan849
Copy link

zhan849 commented Feb 10, 2017

@justinsb I was with @lcjlcj and we can stably reproduce the same aws console output by simply submitting short running jobs to kubernetes.
For your questions:

  1. yes. we are doing continuous monitoring of entire cluster, all nodes

  2. We are still using kube-up, with some of our own patches (not changing the way node is bootstrapped). What is the major difference in node setup (i.e. docker, network, etc) in kops? And as you said, its not verified with 1.4, is there major difference in the interaction between kubernetes/docker/kernel for 1.4 and 1.5+?

  3. @lcjlcj would have better answers as 5min frequency happened on one of his test clusters 2 days ago. but we do see same kernel panic trace here and there

  4. Yes, we are rapidly churning pods (i.e. using short running jobs): we are in a start up and our product is to automate e2e devops checkout-test-build-release-deployment pipeline. and one thing you might also be interested in was when pods gets up/down frequently, or when we are actively submitting jobs, kube-apiserver can get OOMKilled almost immediately (i.e. master dstat shows memory usage can burst from ~5G to >50G).
    Is kubernetes designed to serve long-running services more?

BTW, kernel log about frequent bridge up-down (anything to do with kube-proxy?):
This snapshot clearly shows that kernel suddenly got into this crazy bridge/port up-down problem and kernel gets crashed shortly after

[   42.963853] ip_tables: (C) 2000-2006 Netfilter Core Team


Debian GNU/Linux 8 ip-10-144-6-157 ttyS0

ip-10-144-6-157 login: [   43.132533] Initializing XFRM netlink socket
[   43.163426] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[   48.410844] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
[  142.049429] nr_pdflush_threads exported in /proc is scheduled for removal
[ 3821.872597] cbr0: port 13(veth00981fee) entered disabled state
[ 3821.877896] device veth00981fee left promiscuous mode
[ 3821.881761] cbr0: port 13(veth00981fee) entered disabled state
[ 3893.414006] device vethef136ae entered promiscuous mode
[ 3893.416804] IPv6: ADDRCONF(NETDEV_UP): vethef136ae: link is not ready
[ 3894.636325] eth0: renamed from veth7c38c0b
[ 3894.700455] IPv6: ADDRCONF(NETDEV_CHANGE): vethef136ae: link becomes ready
[ 3894.704328] br-30b3057eb136: port 1(vethef136ae) entered forwarding state
[ 3894.707982] br-30b3057eb136: port 1(vethef136ae) entered forwarding state
[ 3894.711244] IPv6: ADDRCONF(NETDEV_CHANGE): br-30b3057eb136: link becomes ready
[ 3895.700560] device veth49ea6ee entered promiscuous mode
[ 3895.703818] IPv6: ADDRCONF(NETDEV_UP): veth49ea6ee: link is not ready
[ 3898.132340] eth0: renamed from veth78dc38f
[ 3898.200528] IPv6: ADDRCONF(NETDEV_CHANGE): veth49ea6ee: link becomes ready
[ 3898.204388] br-30b3057eb136: port 2(veth49ea6ee) entered forwarding state
[ 3898.208144] br-30b3057eb136: port 2(veth49ea6ee) entered forwarding state
[ 3899.552897] device veth59d804a entered promiscuous mode
[ 3899.555529] IPv6: ADDRCONF(NETDEV_UP): veth59d804a: link is not ready
[ 3901.328348] eth0: renamed from veth114ad86
[ 3901.400239] IPv6: ADDRCONF(NETDEV_CHANGE): veth59d804a: link becomes ready
[ 3901.404335] br-30b3057eb136: port 3(veth59d804a) entered forwarding state
[ 3901.408387] br-30b3057eb136: port 3(veth59d804a) entered forwarding state
[ 3909.724068] br-30b3057eb136: port 1(vethef136ae) entered forwarding state
[ 3913.244064] br-30b3057eb136: port 2(veth49ea6ee) entered forwarding state
[ 3916.444068] br-30b3057eb136: port 3(veth59d804a) entered forwarding state
[ 3931.402774] device veth0cefb44 entered promiscuous mode
[ 3931.405833] IPv6: ADDRCONF(NETDEV_UP): veth0cefb44: link is not ready
[ 3932.712422] eth0: renamed from veth73a5174
[ 3932.728364] IPv6: ADDRCONF(NETDEV_CHANGE): veth0cefb44: link becomes ready
[ 3932.731786] br-85515d4f0886: port 1(veth0cefb44) entered forwarding state
[ 3932.735132] br-85515d4f0886: port 1(veth0cefb44) entered forwarding state
[ 3932.738384] IPv6: ADDRCONF(NETDEV_CHANGE): br-85515d4f0886: link becomes ready
[ 3933.404302] device vethbfc4168 entered promiscuous mode
[ 3933.406817] IPv6: ADDRCONF(NETDEV_UP): vethbfc4168: link is not ready
[ 3933.409818] br-85515d4f0886: port 2(vethbfc4168) entered forwarding state
[ 3933.413014] br-85515d4f0886: port 2(vethbfc4168) entered forwarding state
[ 3933.728096] br-85515d4f0886: port 2(vethbfc4168) entered disabled state
[ 3936.332373] eth0: renamed from veth2193da6
[ 3936.500381] IPv6: ADDRCONF(NETDEV_CHANGE): vethbfc4168: link becomes ready
[ 3936.505944] br-85515d4f0886: port 2(vethbfc4168) entered forwarding state
[ 3936.510545] br-85515d4f0886: port 2(vethbfc4168) entered forwarding state
[ 3937.604774] device vethb6fc8f9 entered promiscuous mode
[ 3937.607405] IPv6: ADDRCONF(NETDEV_UP): vethb6fc8f9: link is not ready
[ 3939.436406] eth0: renamed from veth2ab24da
[ 3939.452349] IPv6: ADDRCONF(NETDEV_CHANGE): vethb6fc8f9: link becomes ready
[ 3939.455583] br-85515d4f0886: port 3(vethb6fc8f9) entered forwarding state
[ 3939.458692] br-85515d4f0886: port 3(vethb6fc8f9) entered forwarding state
[ 3947.740069] br-85515d4f0886: port 1(veth0cefb44) entered forwarding state
[ 3951.516072] br-85515d4f0886: port 2(vethbfc4168) entered forwarding state
[ 3954.460069] br-85515d4f0886: port 3(vethb6fc8f9) entered forwarding state
[ 3954.869277] cbr0: port 3(veth8dd8f8b7) entered disabled state
[ 3954.875749] device veth8dd8f8b7 left promiscuous mode
[ 3954.878700] cbr0: port 3(veth8dd8f8b7) entered disabled state
[ 3960.004144] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[ 3960.008059] IP: [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0
[ 3960.008059] PGD 6e7bd7067 PUD 72813c067 PMD 0 
[ 3960.008059] Oops: 0000 [#1] SMP 
[ 3960.008059] Modules linked in: xt_statistic(E) xt_nat(E) xt_recent(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) sch_htb(E) ebt_ip(E) ebtable_filter(E) ebtables(E) veth(E) xt_mark(E) xt_comment(E) binfmt_misc(E) overlay(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) xfrm_user(E) xfrm_algo(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) iptable_filter(E) ip_tables(E) xt_conntrack(E) x_tables(E) nf_nat(E) nf_conntrack(E) br_netfilter(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) libcrc32c(E) crc32c_generic(E) loop(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) sunrpc(E) crct10dif_pclmul(E) crc32_pclmul(E) hmac(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ablk_helper(E) cryptd(E) evdev(E) cirrus(E) ttm(E) snd_pcsp(E) drm_kms_helper(E) snd_pcm(E) acpi_cpufreq(E) snd_timer(E) parport_pc(E) tpm_tis(E) 8250_fintek(E) snd(E) i2c_piix4(E) parport(E) tpm(E) soundcore(E) drm(E) serio_raw(E) processor(E) button(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) xor(E) raid6_pq(E) dm_mod(E) ata_generic(E) xen_netfront(E) xen_blkfront(E) ata_piix(E) libata(E) crc32c_intel(E) psmouse(E) scsi_mod(E) fjes(E)
[ 3960.008059] CPU: 4 PID: 10158 Comm: mysql_tzinfo_to Tainted: G            E   4.4.41-k8s #1
[ 3960.008059] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[ 3960.008059] task: ffff8807578fae00 ti: ffff88075f028000 task.ti: ffff88075f028000
[ 3960.008059] RIP: 0010:[<ffffffff810b332f>]  [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0
[ 3960.008059] RSP: 0018:ffff88075f02be38  EFLAGS: 00010046
[ 3960.008059] RAX: 0000000000000000 RBX: ffff8807250ff400 RCX: 0000000000000000
[ 3960.008059] RDX: ffff88078fc95e30 RSI: 0000000000000000 RDI: ffff8807250ff400
[ 3960.008059] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff88076bc13700
[ 3960.008059] R10: 0000000000001cf7 R11: ffffea001c98a100 R12: 0000000000015dc0
[ 3960.008059] R13: 0000000000000000 R14: ffff88078fc95dc0 R15: 0000000000000004
[ 3960.008059] FS:  00007fa34b7f6740(0000) GS:ffff88078fc80000(0000) knlGS:0000000000000000
[ 3960.008059] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3960.008059] CR2: 0000000000000080 CR3: 000000067762d000 CR4: 00000000001406e0
[ 3960.008059] Stack:
[ 3960.008059]  ffff8807578fae00 0000000000001000 0000000200000000 0000000000015dc0
[ 3960.008059]  ffff88078fc95e30 00007fa34b7fc000 000000005ef04228 ffff88078fc95dc0
[ 3960.008059]  ffff8807578fae00 0000000000015dc0 0000000000000000 ffff8807578fb2a0
[ 3960.008059] Call Trace:
[ 3960.008059]  [<ffffffff8159cd1f>] ? __schedule+0xdf/0x960
[ 3960.008059]  [<ffffffff8159d5d1>] ? schedule+0x31/0x80
[ 3960.008059]  [<ffffffff810031cb>] ? exit_to_usermode_loop+0x6b/0xc0
[ 3960.008059]  [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110
[ 3960.008059]  [<ffffffff815a1518>] ? int_ret_from_sys_call+0x25/0x8f
[ 3960.008059] Code: c6 44 24 17 00 eb 4d 48 8b 5c 24 20 eb 29 31 ed 48 89 df e8 04 a2 ff ff 84 c0 0f 85 99 fd ff ff 48 89 df 48 89 ee e8 11 70 ff ff <48> 8b 98 80 00 00 00 48 85 db 74 57 48 8b 6b 38 48 85 ed 74 e0 
[ 3960.008059] RIP  [<ffffffff810b332f>] pick_next_task_fair+0x30f/0x4a0
[ 3960.008059]  RSP <ffff88075f02be38>
[ 3960.008059] CR2: 0000000000000080
[ 3960.008059] ---[ end trace e1b9f0775b83e8e3 ]---
[ 3960.008059] Kernel panic - not syncing: Fatal exception
[ 3960.008059] Shutting down cpus with NMI
[ 3960.008059] Kernel Offset: disabled
[    0.000000] Initializing cgroup subsys cpuset

Thanks

@lcjlcj
Copy link

lcjlcj commented Feb 11, 2017

@justinsb Thanks for quick response. We were trying to debug the problem and finally determined that it's more likely a kernel issue. We saw many kernel backtraces. Most of them were in CFS scheduler area, pick_next_task_fair+0x30f/0x4a0 or wakeup_preempt_entity.isra.62+0x9/0x50. We specified cpu resource limit for quite some pods. Those limits are small (at 0.1 - 0.2 cores per pod). This probably caused some complication between CFS and cgroup limit.
We disabled cpu limit temporarily and system is stable now.
I believe kubernetes maps 0.1 cores to {cfs_period_us: 100000, cfs_quota_us: 10000}. With some moderate number of pods (less than 100 pods per minion on m3.2xlarge AWS), kernel would panic consistently within minutes. We will file a bug for kernel. In the mean time, it might be good document kuebrnetes CPU resource limit best practice.

@justinsb
Copy link
Member

justinsb commented Feb 11, 2017

Hi - was pondering. AFAIK there's no known issue with what you're doing - I was just guessing on the high rate of pod churn because, well, the panic is in the scheduler!

So we have some great leads:

  • CFS / cgroup limits (much stronger than my guesses!)
  • Some configuration difference between kube-up & kops (there definitely are differences, for example we set different sysctls, but I don't think they would make a difference)
  • Something in docker 1.12 with k8s 1.4 (but kernel panics should never be possible)

I agree that we're triggering a kernel bug. I didn't see any known issues in the same function, but I can also build a newer kernel, though it would probably be purely speculative.

I'm going to cc @kubernetes/sig-node-bugs as this could well be kops/aws specific, but it does seem like it is related to resource limits. @dchen1107 let us know if we should copy this bug into kubernetes/kubernetes!

On the apiserver memory ballooning, I would definitely open an issue on kubernetes/kubernetes. The apiserver will use more memory if there are more pods, and we do retain recent pods for a period of time, so I believe it follows that high churn of pods => more memory. But I honestly just don't know enough of the details here to say if what you are seeing is "expected" - it feels excessive, but I could well be wrong.

@zhan849
Copy link

zhan849 commented Feb 11, 2017

@justinsb thanks for filing the bug about apiserver memory usage. I can provide more details about our workload and configuration of apiserver. (we took the numbers provided in kubernetes source code into consideration, i.e., set target-ran-mb to 60MB per 30Pods)

@zytek
Copy link
Contributor

zytek commented Feb 15, 2017

@chrislovecnm

https://docs.docker.com/engine/userguide/storagedriver/selectadriver/
~ overlayfs is still the recommended and more tested driver per @justinsb

how does this compare to the fact that kops provisioned kubernetes clusters use devicemapper by default ?
Related to #1731 probably

@mutemule
Copy link

We've been running into a kernel oops that's fairly similar to #874 (comment):

[61778.046591] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[61778.054546] IP: [<ffffffff810b84a2>] wakeup_preempt_entity.isra.55+0x12/0x60
[61778.061672] PGD 2035f76067 PUD 2035f77067 PMD 0
[61778.066368] Oops: 0000 [#1] SMP
[61778.069644] Modules linked in: binfmt_misc nf_conntrack_netlink veth xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_raw xt_multiport ip_set_hash_net ip_set nfnetlink xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc overlay intel_rapl x86_pkg_temp_thermal kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul mei_me glue_helper sb_edac ablk_helper joydev input_leds cryptd mei lpc_ich ioatdma edac_core shpchp 8250_fintek ipmi_ssif acpi_power_meter mac_hid ipmi_si ipmi_devintf ipmi_msghandler coretemp bonding autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 ixgbe ast i2c_algo_bit dca ttm vxlan ip6_udp_tunnel drm_kms_helper udp_tunnel syscopyarea ptp sysfillrect hid_generic sysimgblt fb_sys_fops usbhid ahci pps_core drm hid libahci mdio wmi fjes
[61778.167824] CPU: 19 PID: 104 Comm: migration/19 Tainted: G        W       4.4.59-1-custom #1
[61778.176917] Hardware name: Supermicro SYS-2028TP-HTTR/X10DRT-PT
[61778.184705] task: ffff881038d78000 ti: ffff881038d78318 task.ti: ffff881038d78318
[61778.192223] RIP: 0010:[<ffffffff810b84a2>]  [<ffffffff810b84a2>] wakeup_preempt_entity.isra.55+0x12/0x60
[61778.201770] RSP: 0000:ffffc9000c9ebd68  EFLAGS: 00010086
[61778.207111] RAX: ffff88102410b000 RBX: 000002b26d33296d RCX: 0000000000146d08
[61778.214280] RDX: 0000002b89334993 RSI: 0000000000000000 RDI: 000002b26d33296d
[61778.221450] RBP: ffffc9000c9ebd78 R08: 0000000000000013 R09: 0000000000000000
[61778.228617] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000
[61778.235786] R13: 0000000000000000 R14: ffff88103fd73940 R15: ffff88103fd73940
[61778.242955] FS:  0000000000000000(0000) GS:ffff88103fd60000(0000) knlGS:0000000000000000
[61778.251082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61778.256857] CR2: 0000000000000050 CR3: 0000002036b76000 CR4: 00000000003606f0
[61778.264025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[61778.271192] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[61778.278361] Stack:
[61778.280384]  ffff88102a6c5200 0000000000000000 ffffc9000c9ebda8 ffffffff810b8562
[61778.287887]  ffff88102a6c5400 ffff88102a6c5200 ffffffff81a18c20 ffff88103fd73940
[61778.295383]  ffffc9000c9ebe18 ffffffff810c08cf ffff88103fd739b0 0000000000013940
[61778.302881] Call Trace:
[61778.305348]  [<ffffffff810b8562>] pick_next_entity+0x72/0x130
[61778.311129]  [<ffffffff810c08cf>] pick_next_task_fair+0x7f/0x500
[61778.317173]  [<ffffffff8188af63>] __schedule+0x473/0x7d0
[61778.322516]  [<ffffffff8188b2f9>] schedule+0x39/0x80
[61778.327508]  [<ffffffff810a5e0d>] smpboot_thread_fn+0xcd/0x180
[61778.333370]  [<ffffffff810a5d40>] ? sort_range+0x30/0x30
[61778.338715]  [<ffffffff810a20b9>] kthread+0xd9/0xf0
[61778.343619]  [<ffffffff810a1fe0>] ? kthread_create_on_node+0x190/0x190
[61778.350181]  [<ffffffff8188fa6e>] ret_from_fork+0x3e/0x70
[61778.355609]  [<ffffffff810a1fe0>] ? kthread_create_on_node+0x190/0x190
[61778.362174] Code: 89 e5 53 48 89 f3 48 89 df e8 7b fb ff ff 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 fb <49> 2b 5c 24 50 48 85 db 7e 30 49 81 3c 24 00 04 00 00 8b 05 46
[61778.383928] RIP  [<ffffffff810b84a2>] wakeup_preempt_entity.isra.55+0x12/0x60
[61778.392892]  RSP <ffffc9000c9ebd68>
[61778.398176] CR2: 0000000000000050
[61778.409187] ---[ end trace bd67486fd1f7583f ]---

This is a custom kernel based on 4.4.59, but the issue traces back to wakup_preempt_entity in kernel/sched/fair.c. This is called from pick_next_entity, which in turn is called from pick_next_task_fair.

I haven't been able to prove this, but my initial hunch is that in pick_next_entity, __pick_first_entity(cfs_req) returns NULL, so the attempt to ensure left is assigned doesn't achieve the desired impact. I've never mucked about in the scheduler before, so there's a very good chance I'm way off.

@daxtens
Copy link

daxtens commented May 4, 2017

I've seen similar problems, eventually traced to the 4.4 kernel missing a couple of patches. My results are at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1687512

@chrislovecnm
Copy link
Contributor Author

Which patches?

@justinsb
Copy link
Member

justinsb commented May 4, 2017

Can we copy this into a bug in kubernetes/kubernetes? @kubernetes/sig-node-bugs will want to validate.

@k8s-ci-robot
Copy link
Contributor

@justinsb: These labels do not exist in this repository: sig/node.

In response to this:

Can we copy this into a bug in kubernetes/kubernetes? @kubernetes/sig-node-bugs will want to validate.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@daxtens
Copy link

daxtens commented May 4, 2017

Hi @chrislovecnm,

Yes, these are in every kernel since 4.7 (see git tag --contains 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7)

They're not currently in the Ubuntu Xenial kernel (which is 4.4 based), but they will be eventually. You could try the HWE kernel which will be more recent.

@bpineau
Copy link

bpineau commented May 28, 2017

For reference, the two scheduler fixes (754bd598 and 094f4691) also made their way in the just released 4.4.70 kernel.

So 4.4 should be good to go.

@chrislovecnm
Copy link
Contributor Author

Did anyone copy this bug over to Kubernetes/kubernetes?

@chrislovecnm
Copy link
Contributor Author

cc @dchen1107

Dawn we are noticing that a couple of scheduler fixes have helped with k8s kernel panics. Do we have an issue open in core already?

@zhan849
Copy link

zhan849 commented Jun 27, 2017

@justinsb @chrislovecnm

We observed different kernel crash message on 4.4.41 after running Kubernetes cluster for 2 month, this is preventing docker from start up. lemme know if there is a better place for kernel related bugs. (similar issue is mentioned in kubernetes/kubernetes#23253 but that issue has been closed)

Kernel Version:

root@ip-172-20-0-9:/home/admin# uname -a
Linux ip-172-20-0-9 4.4.41-k8s #1 SMP Mon Jan 9 15:34:39 UTC 2017 x86_64 GNU/Linux

Kernel crash message:

[5349238.836307] Call Trace:
[5349238.836408]  [<ffffffff813ed5d0>] ? wait_for_xmitr+0x30/0x90
[5349238.836415]  [<ffffffff813ed646>] ? serial8250_console_putchar+0x16/0x30
[5349238.836416]  [<ffffffff813ed630>] ? wait_for_xmitr+0x90/0x90
[5349238.836433]  [<ffffffff813e733e>] ? uart_console_write+0x2e/0x50
[5349238.836454]  [<ffffffff813f011c>] ? serial8250_console_write+0xcc/0x2a0
[5349238.836506]  [<ffffffff810c9238>] ? print_time.part.10+0x68/0x90
[5349238.836515]  [<ffffffff810c92c2>] ? print_prefix+0x62/0x90
[5349238.836527]  [<ffffffff810c9afe>] ? call_console_drivers.constprop.24+0xfe/0x110
[5349238.836529]  [<ffffffff810cadc0>] ? console_unlock+0x2f0/0x4d0
[5349238.836534]  [<ffffffff810cb356>] ? vprintk_emit+0x3b6/0x540
[5349238.836574]  [<ffffffff8159d9fe>] ? out_of_line_wait_on_bit+0x7e/0xa0
[5349238.836637]  [<ffffffff81168cfc>] ? printk+0x57/0x73
[5349238.837187]  [<ffffffffc028c7f1>] ? ext4_commit_super+0x1d1/0x270 [ext4]
[5349238.837210]  [<ffffffffc028d021>] ? __ext4_abort+0x81/0x170 [ext4]
[5349238.837236]  [<ffffffff811eadfb>] ? filename_parentat+0x10b/0x190
[5349238.837257]  [<ffffffffc027d40a>] ? ext4_unlink+0x1aa/0x380 [ext4]
[5349238.837285]  [<ffffffffc02a043e>] ? ext4_journal_check_start+0x6e/0x80 [ext4]
[5349238.837307]  [<ffffffffc02a0574>] ? __ext4_journal_start_sb+0x34/0x100 [ext4]
[5349238.837338]  [<ffffffffc027d40a>] ? ext4_unlink+0x1aa/0x380 [ext4]
[5349238.837347]  [<ffffffff811e6d94>] ? __inode_permission+0x24/0xa0
[5349238.837348]  [<ffffffff811e8227>] ? vfs_unlink+0xe7/0x180
[5349238.837349]  [<ffffffff811eb8b9>] ? do_unlinkat+0x289/0x300
[5349238.837362]  [<ffffffff815a13b6>] ? entry_SYSCALL_64_fastpath+0x16/0x75
[5349301.856150] INFO: rcu_sched detected stalls on CPUs/tasks:

[5349301.856207] 	7-...: (621 ticks this GP) idle=db7/140000000000000/0 softirq=291405734/291405734 fqs=1516663 
[5349301.856213] 	
[5349301.856223] (detected by 2, t=1643772 jiffies, g=333398020, c=333398019, q=2185604)
[5349301.856227] Task dump for CPU 7:
[5349301.856236] dockerd         R
[5349301.856240]   running task    
[5349301.856245]     0  1206      1 0x0000000c
[5349301.856268]  ffff00066c0a1000
[5349301.856269]  00000000400d65da
[5349301.856270]  ffffffff813ed5d0
[5349301.856270]  ffffffff81d27a00

[5349301.856271]  0000000000000030
[5349301.856273]  ffffffff81d27a00
[5349301.856276]  ffffffff813ed646
[5349301.856277]  ffffffff81cbcec0

[5349301.856277]  ffffffff813ed630
[5349301.856280]  ffffffff813e733e
[5349301.856280]  ffffffff81d27a00
[5349301.856282]  0000000000000001

[5349301.856290] Call Trace:
[5349301.856387]  [<ffffffff813ed5d0>] ? wait_for_xmitr+0x30/0x90
[5349301.856397]  [<ffffffff813ed646>] ? serial8250_console_putchar+0x16/0x30
[5349301.856398]  [<ffffffff813ed630>] ? wait_for_xmitr+0x90/0x90
[5349301.856416]  [<ffffffff813e733e>] ? uart_console_write+0x2e/0x50
[5349301.856431]  [<ffffffff813f011c>] ? serial8250_console_write+0xcc/0x2a0
[5349301.856487]  [<ffffffff810c9238>] ? print_time.part.10+0x68/0x90
[5349301.856493]  [<ffffffff810c92c2>] ? print_prefix+0x62/0x90
[5349301.856505]  [<ffffffff810c9afe>] ? call_console_drivers.constprop.24+0xfe/0x110
[5349301.856511]  [<ffffffff810cadc0>] ? console_unlock+0x2f0/0x4d0
[5349301.856513]  [<ffffffff810cb356>] ? vprintk_emit+0x3b6/0x540
[5349301.856549]  [<ffffffff8159d9fe>] ? out_of_line_wait_on_bit+0x7e/0xa0
[5349301.856590]  [<ffffffff81168cfc>] ? printk+0x57/0x73
[5349301.857035]  [<ffffffffc028c7f1>] ? ext4_commit_super+0x1d1/0x270 [ext4]
[5349301.857058]  [<ffffffffc028d021>] ? __ext4_abort+0x81/0x170 [ext4]
[5349301.857083]  [<ffffffff811eadfb>] ? filename_parentat+0x10b/0x190
[5349301.857099]  [<ffffffffc027d40a>] ? ext4_unlink+0x1aa/0x380 [ext4]
[5349301.857111]  [<ffffffffc02a043e>] ? ext4_journal_check_start+0x6e/0x80 [ext4]
[5349301.857123]  [<ffffffffc02a0574>] ? __ext4_journal_start_sb+0x34/0x100 [ext4]
[5349301.857134]  [<ffffffffc027d40a>] ? ext4_unlink+0x1aa/0x380 [ext4]
[5349301.857146]  [<ffffffff811e6d94>] ? __inode_permission+0x24/0xa0
[5349301.857155]  [<ffffffff811e8227>] ? vfs_unlink+0xe7/0x180
[5349301.857156]  [<ffffffff811eb8b9>] ? do_unlinkat+0x289/0x300
[5349301.857179]  [<ffffffff815a13b6>] ? entry_SYSCALL_64_fastpath+0x16/0x75
[5349364.876077] INFO: rcu_sched detected stalls on CPUs/tasks:

[5349364.876089] 	7-...: (1922 ticks this GP) idle=db7/140000000000000/0 softirq=291405734/291405734 fqs=1531627 
[5349364.876089] 	
[5349364.876095] (detected by 5, t=1659527 jiffies, g=333398020, c=333398019, q=2203695)
[5349364.876097] Task dump for CPU 7:
[5349364.876098] dockerd         R
[5349364.876099]   running task    
[5349364.876101]     0  1206      1 0x0000000c
[5349364.876109]  ffff00066c0a1000
[5349364.876109]  00000000400d65da
[5349364.876110]  ffffffff813ed5d0
[5349364.876110]  ffffffff81d27a00

[5349364.876111]  0000000000000072
[5349364.876111]  ffffffff81d27a00
[5349364.876111]  ffffffff813ed646
[5349364.876111]  ffffffff81cbceb4

[5349364.876112]  ffffffff813ed630
[5349364.876113]  ffffffff813e733e
[5349364.876114]  ffffffff81d27a00
[5349364.876116]  0000000000000001

[5349364.876118] Call Trace:
[5349364.876150]  [<ffffffff813ed5d0>] ? wait_for_xmitr+0x30/0x90
[5349364.876155]  [<ffffffff813ed646>] ? serial8250_console_putchar+0x16/0x30
[5349364.876157]  [<ffffffff813ed630>] ? wait_for_xmitr+0x90/0x90
[5349364.876168]  [<ffffffff813e733e>] ? uart_console_write+0x2e/0x50
[5349364.876176]  [<ffffffff813f011c>] ? serial8250_console_write+0xcc/0x2a0
[5349364.876192]  [<ffffffff810c9238>] ? print_time.part.10+0x68/0x90
[5349364.876196]  [<ffffffff810c92c2>] ? print_prefix+0x62/0x90
[5349364.876199]  [<ffffffff810c9afe>] ? call_console_drivers.constprop.24+0xfe/0x110
[5349364.876201]  [<ffffffff810cadc0>] ? console_unlock+0x2f0/0x4d0
[5349364.876204]  [<ffffffff810cb356>] ? vprintk_emit+0x3b6/0x540
[5349364.876221]  [<ffffffff8159d9fe>] ? out_of_line_wait_on_bit+0x7e/0xa0
[5349364.876242]  [<ffffffff81168cfc>] ? printk+0x57/0x73
[5349364.876557]  [<ffffffffc028c7f1>] ? ext4_commit_super+0x1d1/0x270 [ext4]
[5349364.876566]  [<ffffffffc028d021>] ? __ext4_abort+0x81/0x170 [ext4]
[5349364.876581]  [<ffffffff811eadfb>] ? filename_parentat+0x10b/0x190
[5349364.876595]  [<ffffffffc027d40a>] ? ext4_unlink+0x1aa/0x380 [ext4]
[5349364.876607]  [<ffffffffc02a043e>] ? ext4_journal_check_start+0x6e/0x80 [ext4]
[5349364.876619]  [<ffffffffc02a0574>] ? __ext4_journal_start_sb+0x34/0x100 [ext4]
[5349364.876628]  [<ffffffffc027d40a>] ? ext4_unlink+0x1aa/0x380 [ext4]
[5349364.876633]  [<ffffffff811e6d94>] ? __inode_permission+0x24/0xa0
[5349364.876634]  [<ffffffff811e8227>] ? vfs_unlink+0xe7/0x180
[5349364.876636]  [<ffffffff811eb8b9>] ? do_unlinkat+0x289/0x300
[5349364.876640]  [<ffffffff815a13b6>] ? entry_SYSCALL_64_fastpath+0x16/0x75
[5349378.904655] Detected aborted journal

[5349378.908433] systemd-journald[7132]: /dev/kmsg buffer overrun, some messages lost.
[5349380.045882] cbr0: port 1(veth924e7693) entered disabled state
[5349380.062306] device veth924e7693 left promiscuous mode
[5349380.066332] cbr0: port 1(veth924e7693) entered disabled state
[5363766.236135] EXT4-fs (dm-0): error count since last fsck: 32
[5363766.240072] EXT4-fs (dm-0): initial error at time 1492716166: ext4_do_update_inode:4652
[5363766.248073] EXT4-fs (dm-0): last error at time 1498397621: ext4_find_entry:1450: inode 9444726
[5450273.756116] EXT4-fs (dm-0): error count since last fsck: 32
[5450273.759960] EXT4-fs (dm-0): initial error at time 1492716166: ext4_do_update_inode:4652
[5450273.760604] EXT4-fs (dm-0): last error at time 1498397621: ext4_find_entry:1450: inode 9444726

@daxtens
Copy link

daxtens commented Jun 27, 2017 via email

@zhan849
Copy link

zhan849 commented Jun 28, 2017

@daxtens looks like so. I found someone else posted similar problems and he suspected it was hypervisor problem but there was no further discussion

@pierreozoux
Copy link
Contributor

Should we switch to overlay2 as discussed here?
We also see that on our staging cluster running kops 1.7.0 k8s 1.7.2:

failed to register layer: link /var/lib/docker/overlay/be836e6250911702549cdc77fbb598aa738f223516d0bb0872a8f0a860250edf/root/var/lib/yum/yumdb/l/de98a95d9c8b6c75f9c747096419d5e3c5017e0d-libdb-5.3.21-19.el7-x86_64/checksum_type /var/lib/docker/overlay/d4446fe19efb4a67daa5d5bd038ce29ac793baa28772d9562a989ba92442257a/tmproot592051146/var/lib/yum/yumdb/l/de98a95d9c8b6c75f9c747096419d5e3c5017e0d-libdb-5.3.21-19.el7-x86_64/checksum_type: too many links

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@Cryptophobia
Copy link
Contributor

Any chance of a Debian image that is using a new kernel > 4.13? We are running into docker overlayfs obscure bugs that are related to older kernels. We don't want to switch docker to auf storage driver as that is older technology.

moby/moby#19647

@Cryptophobia
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@Cryptophobia: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
focus/stability lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. P0
Projects
None yet
Development

No branches or pull requests