Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression? juju 2.9 / get_unit_address() causes disconnect? - breaks libjuju #615

Closed
ajkavanagh opened this issue Jan 12, 2022 · 4 comments

Comments

@ajkavanagh
Copy link

So this is possibly connected to #458, but I found the this bug integrating get_unit_address() to fetch the unit address unconditionally on an OpenStack cloud.

  • Juju 2.9.22
  • python 3.5 (xenial)
  • libjuju 2.9.5 // master

Essentially, like #458, libjuju gets a disconnect:

Traceback (most recent call last):
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 51, in <module>
    run_it(get_unit(unit_num))
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 36, in run_it
    result = task.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 43, in get_unit
    await get_address(units[n])
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 26, in get_address
    print("{} Address: get_public_address() {}".format(unit.name, await unit.get_public_address()))
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/.tox/func-target/lib/python3.5/site-packages/juju/unit.py", line 122, in get_public_address
    timeout=timeout)
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/.tox/func-target/lib/python3.5/site-packages/juju/model.py", line 883, in block_until
    raise websockets.ConnectionClosed(1006, 'no reason')
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1006 (connection closed abnormally [internal]), reason = no reason

The smallest reproducer that will definitely trigger this behaviour is these two scripts:

#!/bin/bash

_dir="$( cd "$(dirname "${BASH_SOURCE[0]}" )" && pwd)"
runner="${_dir}/fetch1.py"

# loop 10 times and fetch an instance address
i=0
while [ $i -ne 10 ];
do
    printf "\n\n\n!!!!!!"
    printf "\n\n\nDoing number $i"
    printf "\n\n"
    $runner $i
    i=$(($i+1))
done

and fetch1.py:

#!/usr/bin/env python3

from juju.model import Model
import asyncio
import sys

# set to use something other than the current model
MODEL=None


async def get_units():
    model = Model()
    if MODEL is None:
        await model.connect()
    else:
        await model.connect_model(MODEL)
    units = sorted(model.applications['ubuntu'].units, key=lambda u: u.name)
    await model.disconnect()
    return units


async def get_address(unit):
    model = Model()
    await model.connect_model(MODEL)
    print("{} Address: .public_address {}".format(unit.name, unit.public_address))
    print("{} Address: get_public_address() {}".format(unit.name, await unit.get_public_address()))
    print("{} Address: .public_address {}".format(unit.name, unit.public_address))
    print("\n")
    await model.disconnect()


def run_it(step):
    loop = asyncio.get_event_loop()
    task = loop.create_task(step)
    loop.run_until_complete(asyncio.wait([task], loop=loop))
    result = task.result()
    return result


async def get_unit(n):
    units = await get_units()
    print("units", units)
    await get_address(units[n])    # <-- triggers error; if commented out, no problems occur.


if __name__ == '__main__':
    unit_num = 0
    if len(sys.argv) > 1:
        unit_num = int(sys.argv[1])

    run_it(get_unit(unit_num))
    asyncio.get_event_loop().close()

i.e. bash -> python, and the python code may cause a disconnect:

Doing number 6

/home/ubuntu/git/github.com/openstack-charmers/zaza/.tox/func-target/lib/python3.5/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 3.5 support will be dropped in th
e next release of cryptography. Please upgrade your Python.
  from cryptography.hazmat.backends import default_backend
units [<Unit entity_id="ubuntu/0">, <Unit entity_id="ubuntu/1">, <Unit entity_id="ubuntu/2">, <Unit entity_id="ubuntu/3">, <Unit entity_id="ubuntu/4">, <Unit entity_id="ubuntu/5">, <Unit entity_i
d="ubuntu/6">, <Unit entity_id="ubuntu/7">, <Unit entity_id="ubuntu/8">, <Unit entity_id="ubuntu/9">]
ubuntu/6 Address: .public_address 172.20.1.249
ubuntu/6 Address: get_public_address() 172.20.1.249
ubuntu/6 Address: .public_address 172.20.1.249





!!!!!!


Doing number 7

/home/ubuntu/git/github.com/openstack-charmers/zaza/.tox/func-target/lib/python3.5/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 3.5 support will be dropped in th
e next release of cryptography. Please upgrade your Python.
  from cryptography.hazmat.backends import default_backend
units [<Unit entity_id="ubuntu/0">, <Unit entity_id="ubuntu/1">, <Unit entity_id="ubuntu/2">, <Unit entity_id="ubuntu/3">, <Unit entity_id="ubuntu/4">, <Unit entity_id="ubuntu/5">, <Unit entity_i
d="ubuntu/6">, <Unit entity_id="ubuntu/7">, <Unit entity_id="ubuntu/8">, <Unit entity_id="ubuntu/9">]
ubuntu/7 Address: .public_address None
Traceback (most recent call last):
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 51, in <module>
    run_it(get_unit(unit_num))
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 36, in run_it
    result = task.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 43, in get_unit
    await get_address(units[n])
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/scripts/fetch1.py", line 26, in get_address
    print("{} Address: get_public_address() {}".format(unit.name, await unit.get_public_address()))
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/.tox/func-target/lib/python3.5/site-packages/juju/unit.py", line 122, in get_public_address
    timeout=timeout)
  File "/home/ubuntu/git/github.com/openstack-charmers/zaza/.tox/func-target/lib/python3.5/site-packages/juju/model.py", line 883, in block_until
    raise websockets.ConnectionClosed(1006, 'no reason')

For the sake of clarify this means that repeatedly calling get_unit_address() between invocations of the python interpreter, loading libjuju fresh can trigger the disconnect.

There's obviously something strange going on in terms of the juju controller and idles' or disconnects?

Testing on juju 2.8.x does not show these problems.

ajkavanagh added a commit to ajkavanagh/zaza that referenced this issue Jan 13, 2022
Due to bug [1], it seems that there is a issue with libjuju
communicating with the Juju controller which causes a permanent
model disconnect that libjuju doesn't resolve.  Thus, for zaza, this
patch wraps unit.get_public.address() with a wrapper that can choose,
based on the environment variable `ZAZA_FEATURE_BUG472` to use a
subprocess shell call to the `juju status` command to get the public
address. This should always succeed.

The feature environment variable `ZAZA_FEATURE_BUG472` was added so that
the library can switch between the native libjuju function and the
fallback wrapper to enable testing of the issue as libjuju continues to
evolve.

By default, the wrapper function is used, to enable zaza to interoperate
with libjuju with Juju 2.9 on OpenStack providers.

The implementation is slightly complicated because an async version of
the wrapper `get_unit_public_address()` is needed as it is called from
async code.

[1]: juju/python-libjuju#615
ajkavanagh added a commit to ajkavanagh/zaza that referenced this issue Jan 19, 2022
Due to bug [1], it seems that there is a issue with libjuju
communicating with the Juju controller which causes a permanent
model disconnect that libjuju doesn't resolve.  Thus, for zaza, this
patch wraps unit.get_public.address() with a wrapper that can choose,
based on the environment variable `ZAZA_FEATURE_BUG472` to use a
subprocess shell call to the `juju status` command to get the public
address. This should always succeed.

The feature environment variable `ZAZA_FEATURE_BUG472` was added so that
the library can switch between the native libjuju function and the
fallback wrapper to enable testing of the issue as libjuju continues to
evolve.

By default, the wrapper function is used, to enable zaza to interoperate
with libjuju with Juju 2.9 on OpenStack providers.

The implementation is slightly complicated because an async version of
the wrapper `get_unit_public_address()` is needed as it is called from
async code.

[1]: juju/python-libjuju#615
@cderici
Copy link
Member

cderici commented Feb 3, 2022

This is not about Juju, it's on the pylibjuju's side, and the main theme of the issue is mutation. Here's what's going on in your example:

When you're running get_units, you create a Model object m (I'll call this m1), create a connection for it, then get the Unit objects (that are inheriting their connections from m1), then after getting all the references for the Units from m1, you disconnect m1, which destroys the _ws websocket object within the connection (reference of which all the Units share).

Then for each Unit you got from m1, before getting the public address (i.e. making a facade call with that Unit), you create a temporary Model object (m2) and connect it to the same logical model from before (e.g. current model), thereby creating new _ws object and new Unit objects etc. However, the "unit" that you take as an argument to your get_address method is from the first model m1 that you've disconnected, meaning their connection object no longer has _ws objects, which is why the facade call at the get_public_address is failing.

In other words, when you create the new Model m2, the connections of the Unit objects from the first disconnected Model m1 are not updated, even though the Model object is connected to the same logical model. And unfortunately I don't think there's currently any way of updating them either.

So if you want to make facade calls from the units of a model at any time in the future (or call methods that uses the connection in any way, such as calling get_public_address), you need to keep the Model's connection alive.
Any ideas @SimonRichardson ?

(p.s. The units that you're able to get the public addresses of are the ones that already have addresses in their safe_datas. The get_public_address method has a shortcut for it, so they're not making a facade call. If you remove that shortcut then they fail just like the others)

@cderici cderici closed this as completed Feb 3, 2022
@ajkavanagh
Copy link
Author

Hi Caner

Thanks for the clarification. To ensure my understanding, please could you confirm:

Units/objects that the API has provided from a model are invalid if the model disconnects, and the exception is "websockets.ConnectionClosed".

and:

If you have to reconnect, then all objects related to that model connection need to be recreated.

Does this mean that I need to assume that the model is disconnected, make a connection, get all the data they need and then disconnect; it's the only way to be sure that it might work (assuming the model doesn't disconnect between connecting and other actions on it).

Do all interactions with a model need to be wrapped in a try: except: in case the "websockets.ConnectionClosed" exception is raised?

Also, I've noticed that the issue is closed? Is that because it's not fixable or that it is fixed?

Thanks very much!

@cderici
Copy link
Member

cderici commented Feb 7, 2022

Hey Alex,

You are correct. The Model object is to be kept connected for the entirety of the computation, it is not designed for like a plug-and-play sort of use. Further, the contingencies are in place for the Model to reconnect gracefully if it's not disconnected on purpose, so you do not need to handle potential connection issues along the way. Just don't disconnect it until you're done for sure, and everything should work well.

I closed the issue because to fix your problem all you need to do is to just remove the lines that disconnect the model prematurely and it should work fine. We can re-open the issue no problem if further work is needed 👍

As a side info; we also thought that the entity data should not be available after the associated Model is disconnected, as it's not possible to validate that it's up-to-date with what's going on in Juju's side (as we're not getting any deltas anymore after disconnect). In your example some of your Units are able to get their addresses, because currently that data is available even though the model is disconnected (again, despite that there's no way of knowing if the unit's actual address is changed behind the scenes after disconnect). You can see the details on #629 if you're interested, though it's currently held for further discussion.

Hope this clears it up for you, let us know if you need further info or action on this, cheers!

@ajkavanagh
Copy link
Author

Hi Caner

Thanks for the confirmation. It does make it a bit awkward for zaza as it's a sync application trying to use an async library. Essentially, I need to arrange for it to have the async loop running (so that model updates occur and the objects attached to those models are updated).

My solution will be to put the libjuju loop into a background thread and then inject async calls into the loop to access libjuju functions using run_coroutine_threadsafe. The GIL will make the objects thread-safe for read access which means they can be used on the sync side more conveniently.

With luck I can keep all the 'interfaces' in zaza the same and thus the existing codebase won't need to change. At least that's the plan. I'll let you know how I get on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants