Skip to content

Commit

Permalink
Add CLI configuration options for teamd retry count feature (sonic-ne…
Browse files Browse the repository at this point in the history
…t#2642)

* Add CLI configuration options for teamd retry count feature

Add a SONiC CLI to more easily configure the retry count for port
channels. This effectively acts like a wrapper around the underlying
teamdctl command.

Also add a python script that'll be installed into
/usr/local/bin/teamd_increase_retry_count.py that will detect if the
peer device likely supports this teamd feature (based on LLDP neighbor
info) and increases the teamd retry count to 5 in preparation for warm
upgrade. This script requires sudo to run.

This is tied to sonic-net/sonic-buildimage#13453.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add test for error case from teamd when it's not running

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix up test cases

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add some error handling if teamdctl doesn't exist

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add probe functionality and sending current LACPDU packet functionality

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Check to see if the retry count feature is enabled before doing a get or set

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to only send probe packets or only change retry count

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Call the teamd retry count script if doing a warm-reboot

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix pycheck errors, and disable scapy's IPv6 and verbose mode

Scapy's IPv6 support appears to have caused some issues with older
versions of scapy, which may be present on older SONiC images.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make teamd retry count support optional

Don't fail warm reboot if teamd retry count support doesn't happen to be
present.

Also use fastfast-reboot for Mellanox devices.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address review comments, and restructure code to increase code coverage

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address some review comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Replace tabs with spaces

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Verify that expected keys are present in the data returned from teamdctl

Also update a failure message in the warm-reboot script if the retry
count script fails.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix TimeoutExpired undefined error

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add ability to mock subprocess calls (at a limited level)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Return an actual subprocess object, and add a test for checking timeout

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change variable syntax

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Fix set being accessed with an index

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Add option to warm-reboot script to control if teamd retry count is required or not

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move the teamd retry count check to before orchagent

This is so that in the case of the teamd retry count check failing,
there's fewer changes that happen on the system (it'll fail faster).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Move retry count script start to be prior to point-of-no-return

This doesn't need to be after the point-of-no-return, since this will
detach and be sending LACPDUs on its own.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Set executable bit

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Address PR comments

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Change to case-insensitive string contains check

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Make sure the global abort variable is used

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

---------

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
  • Loading branch information
saiarcot895 authored and pdhruv-marvell committed Aug 23, 2023
1 parent 638537a commit 64c37ff
Show file tree
Hide file tree
Showing 5 changed files with 616 additions and 1 deletion.
87 changes: 87 additions & 0 deletions config/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2281,6 +2281,93 @@ def del_portchannel_member(ctx, portchannel_name, port_name):
except JsonPatchConflict:
ctx.fail("Invalid or nonexistent portchannel or interface. Please ensure existence of portchannel member.")

@portchannel.group(cls=clicommon.AbbreviationGroup, name='retry-count')
@click.pass_context
def portchannel_retry_count(ctx):
pass

def check_if_retry_count_is_enabled(ctx, portchannel_name):
try:
proc = subprocess.Popen(["teamdctl", portchannel_name, "state", "item", "get", "runner.enable_retry_count_feature"], text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, err = proc.communicate(timeout=10)
if proc.returncode != 0:
ctx.fail("Unable to determine if the retry count feature is enabled or not: {}".format(err.strip()))
return output.strip() == "true"
except subprocess.TimeoutExpired as e:
proc.kill()
proc.communicate()
ctx.fail("Unable to determine if the retry count feature is enabled or not: {}".format(e))

@portchannel_retry_count.command('get')
@click.argument('portchannel_name', metavar='<portchannel_name>', required=True)
@click.pass_context
def get_portchannel_retry_count(ctx, portchannel_name):
"""Get the retry count for a port channel"""
db = ValidatedConfigDBConnector(ctx.obj['db'])

# Don't proceed if the port channel name is not valid
if is_portchannel_name_valid(portchannel_name) is False:
ctx.fail("{} is invalid!, name should have prefix '{}' and suffix '{}'"
.format(portchannel_name, CFG_PORTCHANNEL_PREFIX, CFG_PORTCHANNEL_NO))

# Don't proceed if the port channel does not exist
if is_portchannel_present_in_db(db, portchannel_name) is False:
ctx.fail("{} is not present.".format(portchannel_name))

try:
is_retry_count_enabled = check_if_retry_count_is_enabled(ctx, portchannel_name)
if not is_retry_count_enabled:
ctx.fail("Retry count feature is not enabled!")

proc = subprocess.Popen(["teamdctl", portchannel_name, "state", "item", "get", "runner.retry_count"], text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, err = proc.communicate(timeout=10)
if proc.returncode != 0:
ctx.fail("Unable to get the retry count: {}".format(err.strip()))
click.echo(output.strip())
except FileNotFoundError:
ctx.fail("Unable to get the retry count: teamdctl could not be run")
except subprocess.TimeoutExpired as e:
proc.kill()
proc.communicate()
ctx.fail("Unable to get the retry count: {}".format(e))
except Exception as e:
ctx.fail("Unable to get the retry count: {}".format(e))

@portchannel_retry_count.command('set')
@click.argument('portchannel_name', metavar='<portchannel_name>', required=True)
@click.argument('retry_count', metavar='<retry_count>', required=True, type=click.IntRange(3,10))
@click.pass_context
def set_portchannel_retry_count(ctx, portchannel_name, retry_count):
"""Set the retry count for a port channel"""
db = ValidatedConfigDBConnector(ctx.obj['db'])

# Don't proceed if the port channel name is not valid
if is_portchannel_name_valid(portchannel_name) is False:
ctx.fail("{} is invalid!, name should have prefix '{}' and suffix '{}'"
.format(portchannel_name, CFG_PORTCHANNEL_PREFIX, CFG_PORTCHANNEL_NO))

# Don't proceed if the port channel does not exist
if is_portchannel_present_in_db(db, portchannel_name) is False:
ctx.fail("{} is not present.".format(portchannel_name))

try:
is_retry_count_enabled = check_if_retry_count_is_enabled(ctx, portchannel_name)
if not is_retry_count_enabled:
ctx.fail("Retry count feature is not enabled!")

proc = subprocess.Popen(["teamdctl", portchannel_name, "state", "item", "set", "runner.retry_count", str(retry_count)], text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, err = proc.communicate(timeout=10)
if proc.returncode != 0:
ctx.fail("Unable to set the retry count: {}".format(err.strip()))
except FileNotFoundError:
ctx.fail("Unable to set the retry count: teamdctl could not be run")
except subprocess.TimeoutExpired as e:
proc.kill()
proc.communicate()
ctx.fail("Unable to set the retry count: {}".format(e))
except Exception as e:
ctx.fail("Unable to set the retry count: {}".format(e))


#
# 'mirror_session' group ('config mirror_session ...')
Expand Down
32 changes: 31 additions & 1 deletion scripts/fast-reboot
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ TAG_LATEST=yes
DETACH=no
LOG_PATH="/var/log/${REBOOT_TYPE}.txt"
UIMAGE_HDR_SIZE=64
REQUIRE_TEAMD_RETRY_COUNT=no

# Require 100M available on the hard drive for warm reboot temp files,
# Size is in 1K blocks:
Expand All @@ -48,6 +49,7 @@ EXIT_DB_INTEGRITY_FAILURE=15
EXIT_NO_CONTROL_PLANE_ASSISTANT=20
EXIT_SONIC_INSTALLER_VERIFY_REBOOT=21
EXIT_PLATFORM_FW_AU_FAILURE=22
EXIT_TEAMD_RETRY_COUNT_FAILURE=23

function error()
{
Expand Down Expand Up @@ -79,13 +81,15 @@ function showHelpAndExit()
echo " -t : Don't tag the current kube images as latest"
echo " -D : detached mode - closing terminal will not cause stopping reboot"
echo " -u : include ssd-upgrader-part in boot options"
echo " -n : don't require peer devices to be running SONiC with retry count feature [default]"
echo " -N : require peer devices to be running SONiC with retry count feature"

exit "${EXIT_SUCCESS}"
}

function parseOptions()
{
while getopts "vfidh?rkxc:sDu" opt; do #TODO "t" is missing
while getopts "vfidh?rkxc:sDunN" opt; do #TODO "t" is missing
case ${opt} in
h|\? )
showHelpAndExit
Expand Down Expand Up @@ -126,6 +130,12 @@ function parseOptions()
u )
SSD_FW_UPDATE_BOOT_OPTION=yes
;;
n )
REQUIRE_TEAMD_RETRY_COUNT=no
;;
N )
REQUIRE_TEAMD_RETRY_COUNT=yes
;;
esac
done
}
Expand Down Expand Up @@ -636,6 +646,22 @@ init_warm_reboot_states
setup_control_plane_assistant
TEAMD_INCREASE_RETRY_COUNT=0
if [[ "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ]]; then
TEAMD_RETRY_COUNT_PROBE_RC=0
/usr/local/bin/teamd_increase_retry_count.py --probe-only || TEAMD_RETRY_COUNT_PROBE_RC=$?
if [[ ${TEAMD_RETRY_COUNT_PROBE_RC} -ne 0 ]]; then
if [[ "${REQUIRE_TEAMD_RETRY_COUNT}" = "yes" ]]; then
error "Could not confirm that all neighbor devices are running SONiC with the retry count feature"
exit "${EXIT_TEAMD_RETRY_COUNT_FAILURE}"
else
debug "Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available"
fi
else
TEAMD_INCREASE_RETRY_COUNT=1
fi
fi
if [[ "$REBOOT_TYPE" = "warm-reboot" || "$REBOOT_TYPE" = "fastfast-reboot" || "$REBOOT_TYPE" = "fast-reboot" ]]; then
# Freeze orchagent for warm restart
# Ask orchagent_restart_check to try freeze 5 times with interval of 2 seconds,
Expand Down Expand Up @@ -664,6 +690,10 @@ if [[ "$REBOOT_TYPE" = "fast-reboot" ]]; then
fi
fi
if [[ ( "${REBOOT_TYPE}" = "warm-reboot" || "${REBOOT_TYPE}" = "fastfast-reboot" ) && "${TEAMD_INCREASE_RETRY_COUNT}" -eq 1 ]]; then
/usr/local/bin/teamd_increase_retry_count.py
fi
# We are fully committed to reboot from this point on because critical
# service will go down and we cannot recover from it.
set +e
Expand Down
Loading

0 comments on commit 64c37ff

Please sign in to comment.