Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestDiagnosticLogsDownloadCommand fails during darwin CI runs quite often #32636

Open
nivi-apple opened this issue Mar 20, 2024 · 6 comments · May be fixed by #33788
Open

TestDiagnosticLogsDownloadCommand fails during darwin CI runs quite often #32636

nivi-apple opened this issue Mar 20, 2024 · 6 comments · May be fixed by #33788

Comments

@nivi-apple
Copy link
Contributor

nivi-apple commented Mar 20, 2024

The diagnostic log tests seem to fail quite frequently. Logs - https://github.com/project-chip/connectedhomeip/actions/runs/8347127817/job/22854062448?pr=32631 for an instance

Disabled in #34260

Please re-enable in https://github.com/project-chip/connectedhomeip/blob/master/scripts/tests/chiptest/__init__.py#L170, once this is addressed.

@bzbarsky-apple
Copy link
Contributor

This is failing on:

ERROR   21:13:51.686 - TEST OUT  : 		***** Test Step 12 : Read End User Support log intent after a failure to make sure that everything still works

Is it possible that the test is getting confused about which test the "failure" belongs with or something?

@bzbarsky-apple
Copy link
Contributor

@vivien-apple Could you take a look, please?

@harimau-qirex
Copy link
Contributor

I looked into this a bit, and here's a brief summary of what I believe is happening:

  • The previous step attempts to generate an error. It requests the network diagnostics logs with a very short timeout, under the assumption that the timeout will be hit and generate an error.
  • Some of the time, a CHIP_ERROR_INTERNAL error occurs during that step. I haven't looked into the source of that error.
  • That error is treated as the error the step was expecting and the test moves onto the next step, which requests the end user support logs.
  • The all clusters app receives the BDX StatusReport indicating the timeout and sends the appropriate command response indicating the diagnostic logs transfer failed.
  • The test associates this command response with the current step and considers the test to have failed.

There are a couple things here. The immediate cause is this mystery internal error, as that causes the test to move on to the next step so the expected failure is inadvertently associated with the wrong step.

A broader issue is that the tests aren't quite detailed enough in what they're expecting. Some steps expect that an error occurs but doesn't specify what kinds of error are acceptable, so the test treats any error as the step passing rather than stop the test when the wrong errors occur.

I ran into a case where this broader issue caused the tests to erroneously succeed. One of the steps right before these transfers the network diagnostics logs normally. However, the test moved on before the all clusters app sent the final ACK for the BlockAckEof message. This meant the all clusters app didn't consider the BDX transfer completed. When the network diagnostics logs were requested again, the all clusters app responded with a busy status since the previous transfer was still ongoing. The test saw the error from the busy status and treated it as the expected timeout-related error, instead of stopping.

@nivi-apple nivi-apple linked a pull request Jun 6, 2024 that will close this issue
@nivi-apple
Copy link
Contributor Author

nivi-apple commented Jun 6, 2024

Uploaded PR #33788 but that does not fix the issue. It looks like we might not be aborting the transfer for timeout or internal errors on the Initiator side. This needs more analysis to fix it properly. I am working on it.

@nivi-apple
Copy link
Contributor Author

nivi-apple commented Jun 7, 2024

So I figured out why the test is failing intermittently. I am using logs from here as reference - https://github.com/project-chip/connectedhomeip/actions/runs/9372056858/job/25802710559?pr=33672.

It looks like for the test -***** Test Step 11 : Read Network Diagnostic log intent with a very short timeout and a very long log

We send a request to initiate BDX -

ERROR 19:23:46.404 - APP OUT : [1717529026404] [67942:190636] [EM] <<< [E:60809i S:61167 M:18707577] (S) Msg TX to 1:233CE82C33836403 [6D7B] [UDP:192.168.64.25:59528] --- Type 0002:01 (BDX:SendInit)

For some reason the Init fails and instead of a SendAccept we get a status report and that is not expected and hence the next test fails.

ERROR 19:23:47.411 - TEST OUT : ***** Test Step 12 : Read End User Support log intent after a failure to make sure that everything still works
ERROR 19:23:47.450 - APP OUT : [1717529027450] [67942:190664] [BDX] OutputEvent type: StatusReceived
ERROR 19:23:47.451 - APP OUT : [1717529027450] [67942:190664] [BDX] Diagnostic logs transfer: StatusReport Error 5f

To alleviate this, I guess we can make sure we get a sendaccept before we proceed with success tests. I will also corroborate with couple other logs to see if its the same issue

@nivi-apple
Copy link
Contributor Author

also it looks like sometimes the status report gets through before reset happens. Other times the status report is not sent. I had observed this when initially writing the code for downloading logs. I need to look into the code and make sure we ensure the status report is sent out before calling reset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Status: Todo
Development

Successfully merging a pull request may close this issue.

3 participants