Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amulet run hangs #184

Closed
hloeung opened this issue Aug 9, 2017 · 4 comments
Closed

amulet run hangs #184

hloeung opened this issue Aug 9, 2017 · 4 comments

Comments

@hloeung
Copy link

hloeung commented Aug 9, 2017

Hi,

Trying to run the amulet tests against changes I made to the graylog charm[1], amulet hangs inside deployer.setup() and I'm not sure why.

| [hloeung@dharkan graylog]$ python3 tests/10-basic.py
| 2017-08-09 17:50:10 Starting deployment of canonistack-haw-lcy02:admin/default
| 2017-08-09 17:50:15 Deploying applications...
| 2017-08-09 17:50:15 Deploying application apache2 using cs:xenial/apache2-21
| 2017-08-09 17:50:22 Deploying application elasticsearch using cs:~elasticsearch-charmers/xenial/elasticsearch-0
| 2017-08-09 17:50:29 Deploying application graylog using /tmp/charmexvatjhf/xenial/graylog
| 2017-08-09 17:50:51 Deploying application mongodb using cs:xenial/mongodb-43
| 2017-08-09 17:56:40 Adding relations...
| 2017-08-09 17:56:40 Adding relation graylog:mongodb <-> mongodb:database
| 2017-08-09 17:56:40 Adding relation graylog:elasticsearch <-> elasticsearch:client
| 2017-08-09 17:56:41 Adding relation graylog:website <-> apache2:reverseproxy
| 2017-08-09 17:57:46 Deployment complete in 455.78 seconds
(hangs here)

Process listing shows a hung juju process:

| 21901 pts/6 S+ 0:00 | _ python3 tests/10-basic.py
| 23540 pts/6 Z+ 0:00 | _ [juju]

I've had someone else in my team also able to reproduce this so it's not my local set up.

[1]https://code.launchpad.net/~hloeung/graylog-charm/+git/graylog-charm

@tvansteenburgh
Copy link
Contributor

How do you know it's hung? How long are you waiting? Amulet tries to wait until the deployment has settled (all agents idle) before sentry.wait() returns.

@hloeung
Copy link
Author

hloeung commented Aug 10, 2017

So it seems to complete after about 2 hours. I've added some debugging output to the test as follows:

print("{} running cls.d.setup()".format(datetime.datetime.now().strftime("%d %b %Y %H:%M:%S")))
try:
    cls.d.setup(timeout=900)
    print("{} done running cls.d.setup(), running cls.d.sentry.wait()".format(datetime.datetime.now().strftime("%d %b %Y %H:%M:%S")))
    cls.d.sentry.wait()
    print("{} done cls.d.sentry.wait()".format(datetime.datetime.now().strftime("%d %b %Y %H:%M:%S")))
except amulet.helpers.TimeoutError:
    # Setup didn't complete before timeout
    pass
print("{} cls.elasticsearch = cls.d.sentry...".format(datetime.datetime.now().strftime("%d %b %Y %H:%M:%S")))

From that, the bulk of the time seems to be within amulet.Deployment.setup():

10 Aug 2017 10:14:18 running cls.d.setup()
10 Aug 2017 12:25:14 done running cls.d.setup(), running cls.d.sentry.wait()
10 Aug 2017 12:25:16 done cls.d.sentry.wait()
10 Aug 2017 12:25:16 cls.elasticsearch = cls.d.sentry...

Full output in pastebin - http://paste.ubuntu.com/25280876/
Juju status output in a loop in pastebin - http://paste.ubuntu.com/25280878/

I'm using snaps for juju:

[hloeung@dharkan ~]$ snap list
Name Version Rev Developer Notes
charm 2.2 15 charms classic
core 16-2.26.14 2462 canonical -
juju 2.2.2 2142 canonical classic

This is with the latest amulet from GitHub (so commit f1a5bad).

@hloeung
Copy link
Author

hloeung commented Aug 10, 2017

I added a wrapper to juju in a path before the snap path and it seems to be hanging on this:

2017-08-10 15:11:02 ssh -m canonistack-haw-lcy02:admin/default elasticsearch/35 -v mkdir -p -m a=rwx /tmp/amulet

From that, I found it's hanging on subprocess.communicate and was able to reproduce it:

[hloeung@dharkan graylog]$ python3

import subprocess
cmd = ['juju', 'ssh', '-m', 'canonistack-haw-lcy02:admin/default', 'mongodb/33', '-v', 'mkdir', '-p', '-m', 'a=rwx', '/tmp/haw-test']
p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE,)
stdout, stderr = p.communicate()

I think we could add a timeout to subprocess.communicate() calls, but that breaks with earlier versions of python (timeout was introduced in 3.3)

hloeung added a commit to hloeung/amulet that referenced this issue Aug 10, 2017
There's cases where amulet would hang for long periods of time (hours
and hours). It seems this is due to subprocess.communicate() calls
blocking with hung juju processes (see GitHub juju#184). This updates
calls to use a hard timeout.

Unfortunately, timeout was introduced in Python 3.3. so this breaks
for older versions of Python.
@hloeung
Copy link
Author

hloeung commented Aug 10, 2017

This isn't a fix for why the juju commands are hanging, so will still need to look into that.

@hloeung hloeung closed this as completed Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants