Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openstack: octavia --provider ovn healthmon leaks ports #921

Open
garloff opened this issue Mar 17, 2024 · 15 comments
Open

openstack: octavia --provider ovn healthmon leaks ports #921

garloff opened this issue Mar 17, 2024 · 15 comments
Labels
bug Something isn't working

Comments

@garloff
Copy link

garloff commented Mar 17, 2024

Testing on gx-scs (which runs OSISM 7.0.0d aka SCS IaaS R6 RC3):
Creating an octavia loadbalancer with --provider ovn and a healthmonitor with members in several subnets results in a port in each subnet with name ovn-lb-hm-$SUBNETID.
When deleting the members and healthmonitor again, only the port in the first subnet is cleaned up, the ports in the other subnets remain. This leaks ports and prevents the deletion of the subnets (and the networks that contain them).

@garloff
Copy link
Author

garloff commented Mar 17, 2024

Instructions to reproduce:
Create networks and subnets

for net in 1 2 3; do openstack network create net$net; done
#[...]
for net in 1 2 3; do
  openstack subnet create --network net$net --subnet-range 192.168.$net.0/24 subnet$net
done

Boot three servers (it might be enough to allocate three ports, actually).

openstack keypair create testkey > ~/.ssh/testkey.pem
chmod 0600 ~/.ssh/testkey.pem 
DEB12=$(openstack image list --name="Debian 12" -f value -c ID)
echo $DEB12
#[...]
for net in 1 2 3; do
  openstack server create --network net$net --key-name testkey --flavor SCS-1L-1 \
    --block-device boot_index=0,uuid=$DEB12,source_type=image,volume_size=8,destination_type=volume,delete_on_termination=true \
    vm$net
done
#[...]

And now create loadbalancer with listener, pool, healthmon and three members:

openstack loadbalancer create --provider ovn --vip-subnet-id=subnet3 --name testlb --wait
#[...]
openstack loadbalancer pool create --loadbalancer testlb --name testpool --protocol TCP --lb-algorithm SOURCE_IP_PORT --wait
#[...]
openstack loadbalancer listener create --name testlistener --default-pool testpool --protocol TCP --protocol-port 80 testlb --wait
#[...]
openstack loadbalancer healthmonitor create --name testhm --delay 3 --timeout 2 --max-retries 1 --max-retries-down 1 --type TCP testpool --wait
#[...]
for net in 1 2 3; do
  FIXED=$(openstack server list --name vm$net -f value -c Networks | sed "s/^[^\[]*\['\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)'.*\$/\1/")
  echo $FIXED
  openstack loadbalancer member create --name member$net --address $FIXED --subnet-id subnet$net --protocol-port 80 testpool --wait
done
#[...]

So now we have 3 servers in 3 networks with a loadbalancer member for each of the 3 and a health-monitor.
If we look at the ports in subnet3, we find:

openstack port list --fixed-ip subnet=subnet3
+--------------------------------------+-----------------------------------------+-------------------+-------------------------------------------+--------+
| ID                                   | Name                                    | MAC Address       | Fixed IP Addresses                        | Status |
+--------------------------------------+-----------------------------------------+-------------------+-------------------------------------------+--------+
| 0737eb6b-305b-4174-a3b1-6b8055dd9527 |                                         | fa:16:3e:bb:17:18 | ip_address='192.168.3.190', subnet_id='ee | ACTIVE |
|                                      |                                         |                   | bc5377-7c93-4012-87a6-37d8ba19d9a5'       |        |
| 65645d99-909f-4c72-937d-a2561af53561 |                                         | fa:16:3e:88:68:a0 | ip_address='192.168.3.2', subnet_id='eebc | DOWN   |
|                                      |                                         |                   | 5377-7c93-4012-87a6-37d8ba19d9a5'         |        |
| 78d1cbba-b549-41e2-849a-4453dfba7bd0 | ovn-lb-vip-b9c9c4a3-3d31-4028-bb4e-     | fa:16:3e:17:f1:0b | ip_address='192.168.3.148', subnet_id='ee | DOWN   |
|                                      | e595b66a55d9                            |                   | bc5377-7c93-4012-87a6-37d8ba19d9a5'       |        |
| 8873b3f2-f414-492a-9d30-0a870620dce1 | ovn-lb-hm-                              | fa:16:3e:88:cc:4e | ip_address='192.168.3.24', subnet_id='eeb | DOWN   |
|                                      | eebc5377-7c93-4012-87a6-37d8ba19d9a5    |                   | c5377-7c93-4012-87a6-37d8ba19d9a5'        |        |
+--------------------------------------+-----------------------------------------+-------------------+-------------------------------------------+--------+

There is the DHCP port (.2) in each network, the server port (.190 in this example), a port for the healthmonitor and a port for the virtual IP of the listener. The latter only exists in subnet3, as we have chosen it during the LB creation. So subnet1 and subnet2 have one less port.
Let's clean up the whole thing:

for net in 1 2 3; do openstack loadbalancer member delete testpool member$net --wait; done
# No output, takes a few seconds
openstack loadbalancer healthmonitor delete testhm --wait
# No output

If we now look at the ports in the subnets, we can see that in subnet3, the ovn-lb-hm-$SUBNETID port is gone while the VIP port ovn-lb-vip-$LBID still exists. So far, so good.
However, the ovn-lb-hm-$SUBNETID ports still exist in subnet2 and in subnet1.
Looks like a bug where the code to clean up the HM ports is using the VIP subnet instead of all member subnets. (As we have already removed the members, the latter may not be readily available ...)
This is enough to see the bug.

@garloff
Copy link
Author

garloff commented Mar 17, 2024

For completeness, let's do the cleanup here.

openstack loadbalancer delete --cascade testlb --wait
# No output
openstack server delete vm1 vm2 vm3 --wait
# No output

subnet3 is clean now, but subnet1 and subnet2 still have the leaked port from the healthmon.
So we need the extra step to clean things up that should not be necessary:

for net in 1 2 3; do 
  PORT=$(openstack port list --fixed-ip subnet=subnet$net -f value -c ID -c Name | grep ovn-lb-hm | cut -f1 -d" ")
  if test -n "$PORT"; then openstack port delete $PORT; fi
done

Now we can proceed normally again and clean up the restt:

 for net in 1 2 3; do openstack subnet delete subnet$net; openstack network delete net$net; done

And also remove the generated key pair:

openstack keypair delete testkey

@garloff
Copy link
Author

garloff commented Mar 18, 2024

If we delete the lb healthmonitor first, things are fine and all ovn-lb-hm ports are being cleaned up.
So it seems the lb member delete path lacks a check whether we have a healthmon and whether we delete the last member in a subnet; if so we would have to clean up the ovn-lb-hm port.

@garloff
Copy link
Author

garloff commented Mar 18, 2024

And another valid workaround: openstack loadbalancer delete --cascade also avoids leaking ports.

@garloff
Copy link
Author

garloff commented Mar 19, 2024

ovn_octavia_provider/helper.py: member_delete() does only remote the hm port, if the whole pool has gone offline.
https://github.com/openstack/ovn-octavia-provider/blob/eb5010f8ec27c61a3455d09f14057e29cd387f05/ovn_octavia_provider/helper.py#L2200
This is wrong IMVHO. It needs to remove the HM port if the last member in this particular subnet is gone.

@garloff
Copy link
Author

garloff commented Mar 19, 2024

Code should probably look similar to this:

    def _member_list(self):
        existing_members = ovn_lb.external_ids.get(
            ovn_const.OVN_MEMBER_STATUS_KEY)
        try:
            existing_members = jsonutils.loads(existing_members)
        except TypeError:
            LOG.debug("no member status on external_ids: %s",
                      str(existing_members))
            existing_members = {}
        return existing_members

    def member_delete(self, member):
        error_deleting_member = False
        try:
            pool_key, ovn_lb = self._find_ovn_lb_by_pool_id(
                member[constants.POOL_ID])

            pool_status = self._remove_member(member, ovn_lb, pool_key)

            if ovn_lb.health_check:
                mysubnet = member[constants.SUBNET_ID]
                member_in_subnet = False
                for amember in self._member_list():
                    if amember[constants.SUBNET_ID] == mysubnet and
                       amember[constants.PROVISIONING_STATUS] != constants.DELETED:
                        member_in_subnet = True
                        break
                if not member_in_subnet:
                    # NOTE(garloff): if we were the last member in the subnet
                    # we should clean up the hm-port.
                    # (froyo) We need to do this call after the cleaning of the
                    # ip_port_mappings for the ovn LB.
                    self._clean_up_hm_port(member[constants.SUBNET_ID])
       [...]

@garloff
Copy link
Author

garloff commented Mar 19, 2024

I will submit bug and fix upstream as soon as I have a suitable test environment.

@berendt berendt added the bug Something isn't working label Mar 25, 2024
@berendt berendt changed the title 7.0.0 octavia --provider ovn healthmon leaks ports openstack: octavia --provider ovn healthmon leaks ports Mar 26, 2024
@garloff
Copy link
Author

garloff commented Mar 28, 2024

With #890 out of the way, I'll have a development environment again to tackle this one...

@garloff
Copy link
Author

garloff commented Mar 30, 2024

Testcase script:

#!/bin/bash
# Testcase to demonstrate ovn hm port leak (OSISM issue 921)
# Usage: ./testcase_ovn_osism921.sh [CLEANUP]
#  CLEANUP is to remove remaining ressources from previous runs,
#  should not normally be needed.
# Script displays errors and has a non-zero exit code in case there are errors.
# (c) Kurt Garloff <scs@garloff.de>, 3/2024
# SPDX-License-Identifier: Apache-2.0

BOLD="\e[0;1m"
RED="\e[0;31m"
GREEN="\e[0;32m"
NORM="\e[0;0m"

NETS="1 2"
NONETS=$(echo $NETS | wc -w)

declare -i cnt=0
# Input: $@ => networks to look at (optional, default: $NETS), testsubnet will be prepended
# Output: $cnt has number of found ovn-lb-hm ports
count_hm_ports()
{
        nets="$@"
        if test -z "$nets"; then nets="$NETS"; fi
        cnt=0
        for net in $nets; do
                let cnt+=$(openstack port list --fixed-ip subnet=testsubnet$net | grep 'ovn\-lb\-hm\-' | wc -l)
        done
        return 0
}

declare -i err=0
# Input: $1: Expected number of ports
#        $2 - ...: networks to look at (optional, see count_hm_ports)
# Output: None (but err will be increase if error
expect_hm_ports()
{
        expected=$1
        shift
        count_hm_ports "$@"
        echo -en "${BOLD}Found $cnt ovn-lb-hm ports: "
        if test $cnt != $expected; then
                echo -e "${RED}BAD, expected ${expected}${NORM}"
                if test $cnt -gt $expected; then let err+=$((cnt-expected)); else let err+=$((expected-cnt)); fi
        else
                echo -e "${GREEN}GOOD${NORM}"
        fi
}
if test "$1" = "-d" -o "$1" = "--debug"; then set -x; shift; fi
if test "$1" != "CLEANUP"; then
set -e
echo -e "${BOLD}Create $NONETS networks and ports${NORM}"
for net in $NETS; do
        openstack network create testnet$net >/dev/null
        openstack subnet create --network testnet$net --subnet-range 192.168.$net.0/24 testsubnet$net >/dev/null
        for port in 1 2; do
                openstack port create --network testnet$net testport$net.$port >/dev/null
        done
done
echo -e "${BOLD}Setup loadbalancer with ${NONETS}x2 members${NORM}"
openstack loadbalancer create --name testlb --provider ovn --vip-subnet-id testsubnet1 --wait >/dev/null
openstack loadbalancer pool create --loadbalancer testlb --name testpool --protocol TCP --lb-algorithm SOURCE_IP_PORT --wait >/dev/null
openstack loadbalancer listener create --name testlistener --default-pool testpool --protocol TCP --protocol-port 80 testlb --wait >/dev/null
openstack loadbalancer healthmonitor create --name testhm --delay 3 --timeout 2 --max-retries 1 --max-retries-down 1 --type TCP testpool --wait >/dev/null
echo -en " ${BOLD}Add the ${NONETS}x2 ports as members${NORM}"
for net in $NETS; do
        for port in 1 2; do
                FIXED=$(openstack port show testport$net.$port -f json | jq '.fixed_ips[] | .ip_address' | tr -d '"')
                echo -n " $FIXED"
                openstack loadbalancer member create --name testmember$net.$port --address $FIXED --subnet-id testsubnet$net --protocol-port 80 testpool --wait >/dev/null
        done
done
echo
# We should have 2 ovn-lb-hm ports now, one in each subnet
expect_hm_ports $NONETS
set +e
fi      # CLEANUP
# Now start deleting members
echo -e "${BOLD}Deleting one member from each subnet${NORM}"
for net in $NETS; do
        openstack loadbalancer member delete testpool testmember$net.2 --wait
done
expect_hm_ports $NONETS
echo -e "${BOLD}Deleting last members from each subnet${NORM}"
for net in $NETS; do
        openstack loadbalancer member delete testpool testmember$net.1 --wait
done
openstack loadbalancer healthmonitor delete testhm --wait
expect_hm_ports 0
echo -e "${BOLD}Cleaning up the rest of the LB${NORM}"
openstack loadbalancer listener delete testlistener --wait
openstack loadbalancer pool delete testpool --wait
openstack loadbalancer delete testlb --wait
echo -e "${BOLD}Cleaning up ports and networks ${NORM}"
for net in $NETS; do
        for port in 1 2; do
                openstack port delete testport$net.$port
        done
        # Extra cleanup ovn-lb-hm
        OVNHMPORTS=$(openstack port list --fixed-ip subnet=testsubnet$net | grep 'ovn\-lb\-hm' | sed 's/^[^o]*\(ovn\-lb\-hm\-[0-9a-f\-]*\).*$/\1/')
        if test -n "$OVNHMPORTS"; then
                echo -e "${RED}ERROR: $OVNHMPORTS still present, cleaning up${NORM}"
                openstack port delete $OVNHMPORTS
        fi
        openstack subnet delete testsubnet$net
        if test $? != 0; then
                SUBS=$(openstack subnet list -f value -c ID -c Name | grep testsubnet$net | awk '{ print $1; }')
                if test -n "$SUBS"; then openstack subnet delete $SUBS; fi
        fi
        openstack network delete testnet$net
        if test $? != 0; then
                NETS=$(openstack network list -f value -c ID -c Name | grep testnet$net | awk '{ print $1; }')
                if test -n "$NETS"; then openstack network delete $NETS; fi
        fi
done
echo -e "${BOLD}There were $err errors${NORM}"
exit $err

@garloff
Copy link
Author

garloff commented Mar 30, 2024

Issue is still present with 7.0.1, but I can now test and finalize the fix on my CiaB.

@garloff
Copy link
Author

garloff commented Mar 31, 2024

Working patch for ovn_octavia_provider:

--- ovn_octavia_provider/helper.py.orig 2024-03-27 20:17:00.000000000 +0000
+++ ovn_octavia_provider/helper.py      2024-03-31 23:19:13.747183391 +0000
@@ -2046,20 +2046,33 @@ class OvnProviderHelper():
                 user_fault_string=msg,
                 operator_fault_string=msg)

+    def _members_in_subnet(self, ovn_lb, subnet_id):
+        members = []
+        for key, value in ovn_lb.external_ids.items():
+            if key.startswith(ovn_const.LB_EXT_IDS_MEMBER_PREFIX):
+                mem_id, mem_ip_port, mem_subnet = key.split('_')[1:]
+                #mem_ip, mem_port = mem_ip_port.split(':')
+                if mem_subnet == subnet_id:
+                    members.append(mem_id)
+        return members
+
     def member_delete(self, member):
         error_deleting_member = False
         try:
-            pool_key, ovn_lb = self._find_ovn_lb_by_pool_id(
-                member[constants.POOL_ID])
+            pool_id = member[constants.POOL_ID]
+            pool_key, ovn_lb = self._find_ovn_lb_by_pool_id(pool_id)

             pool_status = self._remove_member(member, ovn_lb, pool_key)

-            if ovn_lb.health_check and pool_status == constants.OFFLINE:
-                # NOTE(froyo): if the pool status is OFFLINE there are no more
-                # members. So we should ensure the hm-port is deleted if no
-                # more LB are using it. We need to do this call after the
-                # cleaning of the ip_port_mappings for the ovn LB.
-                self._clean_up_hm_port(member[constants.SUBNET_ID])
+            if ovn_lb.health_check:
+                mem_subnet = member[constants.SUBNET_ID]
+                member_list = self._members_in_subnet(ovn_lb, mem_subnet)
+                if not member_list:
+                    # NOTE(garloff): if we were the last member in the subnet
+                    # we should clean up the hm-port.
+                    # (froyo) We need to do this call after the cleaning of the
+                    # ip_port_mappings for the ovn LB.
+                    self._clean_up_hm_port(member[constants.SUBNET_ID])
         except Exception:
             LOG.exception(ovn_const.EXCEPTION_MSG, "deletion of member")
             error_deleting_member = True

@garloff
Copy link
Author

garloff commented Mar 31, 2024

Next step is to report issue upstream, provide reproducing script and patch ...

@garloff
Copy link
Author

garloff commented Apr 20, 2024

Created an upstream issue: https://bugs.launchpad.net/octavia/+bug/2062965

@berendt
Copy link
Member

berendt commented Apr 23, 2024

@berendt
Copy link
Member

berendt commented Apr 30, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants