Skip to content

CNTR-1: fix stale gNOI connection post reboot + Implement PushConfig for static bind.#5367

Open
kjahed wants to merge 1 commit intoopenconfig:mainfrom
b4firex:kjahed/cntr1
Open

CNTR-1: fix stale gNOI connection post reboot + Implement PushConfig for static bind.#5367
kjahed wants to merge 1 commit intoopenconfig:mainfrom
b4firex:kjahed/cntr1

Conversation

@kjahed
Copy link
Copy Markdown
Contributor

@kjahed kjahed commented Apr 20, 2026

Currently, first containerZ call post reboot fails with 'connection reset by peer error' due to the stale connection handle cached by Ondatra. This PR waits for the device to come back up post reboot by polling the device state (instead of the current arbitrary sleep), and polls the device using gNOI (instead of the usual gNMI) to ensure the connection handle is proper post reboot.

The PR also implements PushConfig for static binding as it required to create a new containerZ client for certain vendors.

@kjahed kjahed requested a review from a team as a code owner April 20, 2026 17:01
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the reliability of container lifecycle tests by replacing arbitrary wait times with active health checks post-reboot. Additionally, it extends the static binding functionality to include configuration pushing, which is required for proper client setup in certain vendor environments.

Highlights

  • Reboot Handling: Replaced the static sleep timer with a robust polling mechanism using gNOI to detect when the device is fully back online, preventing stale connection errors.
  • Configuration Management: Implemented the PushConfig method for static binding to support necessary client initialization for specific vendors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@OpenConfigBot
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces a static sleep with a polling loop to detect device reboots in container lifecycle tests and adds a PushConfig method to the static DUT binding. Feedback includes reducing the polling interval for better reliability, improving error messaging and failure handling in tests, implementing shorter timeouts for reachability checks to prevent blocking, and fixing a resource leak in the new PushConfig method by ensuring gRPC connections are properly closed.

// Wait for reboot.
time.Sleep(8 * time.Minute)
maxRebootTime := 8 * time.Minute
ticker := time.NewTicker(30 * time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A 30-second ticker interval might be too long for detecting a reboot, especially on virtual devices or fast-rebooting hardware. If the device reboots and comes back up within the 30-second window, the loop might miss the 'down' state and eventually timeout. Consider reducing the interval to 5 or 10 seconds for better reliability.

Suggested change
ticker := time.NewTicker(30 * time.Second)
ticker := time.NewTicker(10 * time.Second)

for {
select {
case <-timeout:
t.Fatalf("Timeout exceeded: DUT did not reboot within %v seconds.", maxRebootTime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message 'DUT did not reboot' is slightly misleading if the device actually went down but failed to become reachable again. Additionally, using %v seconds with a time.Duration results in a redundant unit (e.g., '8m0s seconds'). Using t.Fatalf is preferred here as the failure makes subsequent test steps meaningless.

Suggested change
t.Fatalf("Timeout exceeded: DUT did not reboot within %v seconds.", maxRebootTime)
t.Fatalf("Timeout exceeded: DUT did not become reachable after reboot within %v.", maxRebootTime)
References
  1. In tests, t.Fatalf is preferred over t.Errorf when a failure makes subsequent test steps meaningless, as this fails fast and reduces overall test execution time.

case <-ticker.C:
// use GNOI to refresh the stale cached connection post reboot.
sysClient := dut.RawAPIs().GNOI(t).System()
_, err := sysClient.Time(ctx, &gspb.TimeRequest{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Time RPC call uses the test context ctx, which has a long timeout (8 minutes). If the device is in a state where it accepts connections but hangs on RPCs, this call could block the polling loop for a long time. It's safer to use a shorter timeout for each individual reachability check.

tctx, cancel := context.WithTimeout(ctx, 10*time.Second)
_, err := sysClient.Time(tctx, &gspb.TimeRequest{})
cancel()

Comment on lines +201 to +207
gnmiClient, err := d.DialGNMI(ctx)
if err != nil {
return err
}
if _, err := gnmiClient.Set(ctx, setRequest); err != nil {
return err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to d.DialGNMI(ctx) creates a new gRPC connection that is never closed, leading to a resource leak. Since PushConfig is an internal method of staticDUT, you should use dialConn directly to obtain the connection and ensure it is closed after the operation.

Suggested change
gnmiClient, err := d.DialGNMI(ctx)
if err != nil {
return err
}
if _, err := gnmiClient.Set(ctx, setRequest); err != nil {
return err
}
conn, err := dialConn(ctx, d, introspect.GNMI, nil)
if err != nil {
return err
}
defer conn.Close()
if _, err := gpb.NewGNMIClient(conn).Set(ctx, setRequest); err != nil {
return err
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants