CNTR-1: fix stale gNOI connection post reboot + Implement PushConfig for static bind.#5367
CNTR-1: fix stale gNOI connection post reboot + Implement PushConfig for static bind.#5367kjahed wants to merge 1 commit intoopenconfig:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the reliability of container lifecycle tests by replacing arbitrary wait times with active health checks post-reboot. Additionally, it extends the static binding functionality to include configuration pushing, which is required for proper client setup in certain vendor environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Pull Request Functional Test Report for #5367 / 603ca74Virtual Devices
Hardware Devices
|
There was a problem hiding this comment.
Code Review
This pull request replaces a static sleep with a polling loop to detect device reboots in container lifecycle tests and adds a PushConfig method to the static DUT binding. Feedback includes reducing the polling interval for better reliability, improving error messaging and failure handling in tests, implementing shorter timeouts for reachability checks to prevent blocking, and fixing a resource leak in the new PushConfig method by ensuring gRPC connections are properly closed.
| // Wait for reboot. | ||
| time.Sleep(8 * time.Minute) | ||
| maxRebootTime := 8 * time.Minute | ||
| ticker := time.NewTicker(30 * time.Second) |
There was a problem hiding this comment.
A 30-second ticker interval might be too long for detecting a reboot, especially on virtual devices or fast-rebooting hardware. If the device reboots and comes back up within the 30-second window, the loop might miss the 'down' state and eventually timeout. Consider reducing the interval to 5 or 10 seconds for better reliability.
| ticker := time.NewTicker(30 * time.Second) | |
| ticker := time.NewTicker(10 * time.Second) |
| for { | ||
| select { | ||
| case <-timeout: | ||
| t.Fatalf("Timeout exceeded: DUT did not reboot within %v seconds.", maxRebootTime) |
There was a problem hiding this comment.
The error message 'DUT did not reboot' is slightly misleading if the device actually went down but failed to become reachable again. Additionally, using %v seconds with a time.Duration results in a redundant unit (e.g., '8m0s seconds'). Using t.Fatalf is preferred here as the failure makes subsequent test steps meaningless.
| t.Fatalf("Timeout exceeded: DUT did not reboot within %v seconds.", maxRebootTime) | |
| t.Fatalf("Timeout exceeded: DUT did not become reachable after reboot within %v.", maxRebootTime) |
References
- In tests, t.Fatalf is preferred over t.Errorf when a failure makes subsequent test steps meaningless, as this fails fast and reduces overall test execution time.
| case <-ticker.C: | ||
| // use GNOI to refresh the stale cached connection post reboot. | ||
| sysClient := dut.RawAPIs().GNOI(t).System() | ||
| _, err := sysClient.Time(ctx, &gspb.TimeRequest{}) |
There was a problem hiding this comment.
The Time RPC call uses the test context ctx, which has a long timeout (8 minutes). If the device is in a state where it accepts connections but hangs on RPCs, this call could block the polling loop for a long time. It's safer to use a shorter timeout for each individual reachability check.
tctx, cancel := context.WithTimeout(ctx, 10*time.Second)
_, err := sysClient.Time(tctx, &gspb.TimeRequest{})
cancel()| gnmiClient, err := d.DialGNMI(ctx) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if _, err := gnmiClient.Set(ctx, setRequest); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
The call to d.DialGNMI(ctx) creates a new gRPC connection that is never closed, leading to a resource leak. Since PushConfig is an internal method of staticDUT, you should use dialConn directly to obtain the connection and ensure it is closed after the operation.
| gnmiClient, err := d.DialGNMI(ctx) | |
| if err != nil { | |
| return err | |
| } | |
| if _, err := gnmiClient.Set(ctx, setRequest); err != nil { | |
| return err | |
| } | |
| conn, err := dialConn(ctx, d, introspect.GNMI, nil) | |
| if err != nil { | |
| return err | |
| } | |
| defer conn.Close() | |
| if _, err := gpb.NewGNMIClient(conn).Set(ctx, setRequest); err != nil { | |
| return err | |
| } |
Currently, first containerZ call post reboot fails with 'connection reset by peer error' due to the stale connection handle cached by Ondatra. This PR waits for the device to come back up post reboot by polling the device state (instead of the current arbitrary sleep), and polls the device using gNOI (instead of the usual gNMI) to ensure the connection handle is proper post reboot.
The PR also implements PushConfig for static binding as it required to create a new containerZ client for certain vendors.