New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXD Install Stalls #151

Closed
paul-oreilly opened this Issue Dec 13, 2016 · 5 comments

Comments

Projects
None yet
4 participants
@paul-oreilly
Copy link

paul-oreilly commented Dec 13, 2016

Doing an install of Canonical Kubernetes with localhost (LXD), the install process repeatedly stalls out. (Repeated twice with the default bundle, then again with cloning the repo to add/increase resources limits in bundle.yaml and leave the process running overnight. The clone repo is as of commit a585523)

In each case, the process will stall for hours with the kubernetes master node state remaining on "Installing" and after about 8 hours or so settling on "Rendering authentication templates". A screenshot of "juju status" is below, after roughly 24 hours install time.

rv1uhip

Each time, the initial multi hour stall is after the master node logs show

2016-12-12 20:44:18 INFO install Successfully installed MarkupSafe-0.23 PyYAML-3.12 Tempita-0.5.2 charmhelpers-0.10.0 charms.reactive-0.4.5 netaddr-0.7.18 pip-8.1.2 pyaml-16.11.4
2016-12-12 20:44:21 INFO juju-log Reactive main running for hook install
2016-12-12 20:44:22 INFO juju-log Invoking reactive handler: reactive/kubernetes_master.py:55:install

Given eight hours or so, the nodes status will update to 'Rendering Templates' as in the screenshot above.

Running grep over the logs to check for errors gives

2016-12-12 20:22:05 ERROR juju.worker.dependency engine.go:539 "metric-collect" manifold worker returned unexpected error: failed to read charm from: /var/lib/juju/agents/unit-kubernetes-master-0/charm: stat /var/lib/juju/agents/unit-kubernetes-master-0/charm: no such file or directory
2016-12-12 20:22:08 ERROR juju.worker.dependency engine.go:539 "metric-collect" manifold worker returned unexpected error: failed to read charm from: /var/lib/juju/agents/unit-kubernetes-master-0/charm: stat /var/lib/juju/agents/unit-kubernetes-master-0/charm: no such file or directory
2016-12-12 20:22:11 ERROR juju.worker.dependency engine.go:539 "metric-collect" manifold worker returned unexpected error: failed to read charm from: /var/lib/juju/agents/unit-kubernetes-master-0/charm: stat /var/lib/juju/agents/unit-kubernetes-master-0/charm: no such file or directory
2016-12-12 20:22:14 ERROR juju.worker.dependency engine.go:539 "metric-collect" manifold worker returned unexpected error: failed to read charm from: /var/lib/juju/agents/unit-kubernetes-master-0/charm: stat /var/lib/juju/agents/unit-kubernetes-master-0/charm: no such file or directory

The entirety of the kubernetes master node logs are here - http://paste.ubuntu.com/23625873/

As troubleshooting steps undertaken so far, I've done a complete reinstall of Ubuntu 16.04.01, run "apt-get update && apt-get upgrade", installed the PPA's for LXD and Juju, updated their software and run through "lxd init" keeping the defaults before cloning the repo here to edit the bundle.yml and install with "juju deploy ./bundle.yml"

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Dec 14, 2016

Hi Paul,

The title of the issue really had me going at first. I know that we have been testing on bare metal very aggressively (backed by MAAS) and to see its stalling gave me pause for alarm.

The LXD defaults are what's causing you trouble. There is a default, very strict set of apparmor profiles, and limited access bits that we haven't come up with a clear way to indicate to the end user this is the case.

What you can do, if you prefer to keep testing on LXD, is install the conjure-up package, and

conjure-up canonical-kubernetes

The big thing that's happening behind the scenes, is conjure creates and alters the profile assigned to the LXDt container to allow privilege escalation it requires to run kubernetes.

If you want to know the exact bits that its tuning, the profile edits can be found in the spell

Here's the profile its actually using:

https://github.com/conjure-up/spells/blob/master/kubernetes-core/steps/lxd-profile.yaml

Let me know if this doesn't resolve your problem and I'm happy to hop in a hangout and do some real time troubleshooting to get you unblocked as quickly as possible.

@lazypower lazypower self-assigned this Dec 14, 2016

@paul-oreilly

This comment has been minimized.

Copy link

paul-oreilly commented Dec 14, 2016

Many thanks, that resolved it nicely - all installed well, and using conjure-up is IMO a much better user experience. I'd suggest it's worth changing the "Getting Started" video to use conjure-up as it is a far superior first run experience.

However, the documentation will need to show that users (currently) will have to install conjure-up from the PPA and not with the default (and current documentation) of "apt install", as the default version is one that is unable to find canonical-kubernetes to install at all.

A minor note worth adding in the documentation somewhere is that conjure-up will show an error about LXD not having been initialised and missing its network bridge if you have forgotten to add the user the command is running as to the lxc group. (Oops!)

Huge thanks for your time. Canonical Kubernetes is fantastic.

@marcoceppi

This comment has been minimized.

Copy link
Member

marcoceppi commented Dec 15, 2016

Thanks for the feedback!

I'd suggest it's worth changing the "Getting Started" video to use conjure-up as it is a far superior first run experience

@castrojo It'd be good for us to re-do the introduction video with conjure-up

the documentation will need to show that users (currently) will have to install conjure-up from the PPA and not with the default (and current documentation) of "apt install", as the default version is one that is unable to find canonical-kubernetes to install at all

@battlemidget Where is conjure-up in the queue for updates for xenial?

conjure-up will show an error about LXD not having been initialised and missing its network bridge if you have forgotten to add the user the command is running as to the lxc group. (Oops!)

This is a good point, we could add some validation to conjure-up which makes sure things like group is added, etc. I've opened conjure-up/conjure-up#521 to track this.

@battlemidget

This comment has been minimized.

Copy link
Contributor

battlemidget commented Dec 15, 2016

@marcoceppi we are coordinating our release to the archive with Juju 2.1, I don't have an exact date but Juju 2.1 rc1 is supposed to go out today so hopefully very soon.

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Jan 5, 2017

This was addressed in #170

If this continues to be an issue, please re-open the bug and lets continue the conversation.

@lazypower lazypower closed this Jan 5, 2017

@lazypower lazypower changed the title Bare Metal Install Stalls LXD Install Stalls Jan 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment