Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: set test-fs-watch as flaky #50250

Closed

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Oct 18, 2023

Ref: #50249

@anonrig anonrig added flaky-test Issues and PRs related to the tests with unstable failures on the CI. fast-track PRs that do not need to wait for 48 hours to land. request-ci Add this label to start a Jenkins CI on a PR. labels Oct 18, 2023
@github-actions
Copy link
Contributor

Fast-track has been requested by @anonrig. Please 👍 to approve.

@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Oct 18, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Oct 18, 2023
@nodejs-github-bot
Copy link
Collaborator

@anonrig anonrig added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Oct 18, 2023
@lpinca
Copy link
Member

lpinca commented Oct 18, 2023

The referenced issue was opened 2 hours ago. I think we should at least ping @nodejs/platform-s390 before marking the test flaky.

@anonrig
Copy link
Member Author

anonrig commented Oct 18, 2023

The referenced issue was opened 2 hours ago. I think we should at least ping @nodejs/platform-s390 before marking the test flaky.

s390 team can always follow up and unflake the issue. the other way around, a.k.a. waiting for someone to respond, will only frustrate the existing pull requests and increase our CI work pressure. I recommend setting them flaky first, and later follow up when/if someone from the team is available. We should act these flaky test declarations as a todo list.

Also we should remember that there is always an open issue about the flakiness of this test.

@nodejs-github-bot
Copy link
Collaborator

@lpinca
Copy link
Member

lpinca commented Oct 19, 2023

s390 team can always follow up and unflake the issue. the other way around, a.k.a. waiting for someone to respond, will only frustrate the existing pull requests and increase our CI work pressure. I recommend setting them flaky first, and later follow up when/if someone from the team is available. We should act these flaky test declarations as a todo list.

Also we should remember that there is always an open issue about the flakiness of this test.

I disagree, was this discussed in a TSC meeting? It is too easy to mark a test flaky and forget about it the first time it fails. No one will ever going to look into it if there isn't a constant reminder/annoyance that the test keeps failing. The list of flaky tests will only grow over time instead of reducing.

@anonrig
Copy link
Member Author

anonrig commented Oct 19, 2023

I disagree, was this discussed in a TSC meeting? It is too easy to mark a test flaky and forget about it the first time it fails. No one will ever going to look into it if there isn't a constant reminder/annoyance that the test keeps failing. The list of flaky tests will only grow over time instead of reducing.

@lpinca It was not officially discussed. It was mostly couple of folks saying "Yeah let's do it".

cc @nodejs/tsc for visibility

@anonrig anonrig force-pushed the test-fs-watch-recursive-add-file-with-url branch from 67ed76e to 3ed1053 Compare October 19, 2023 16:31
@anonrig anonrig added the request-ci Add this label to start a Jenkins CI on a PR. label Oct 19, 2023
@lpinca
Copy link
Member

lpinca commented Oct 19, 2023

Ok, fwiw I think it is not a good idea.

@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Oct 19, 2023
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

CI: https://ci.nodejs.org/job/node-test-pull-request/55055/

@mhdawson
Copy link
Member

I agree we should have a threshold for flakes versus a first time failure. I'm going to object to the fast track so that we have some time for the discussion.

@mhdawson mhdawson added tsc-agenda Issues and PRs to discuss during the meetings of the TSC. and removed fast-track PRs that do not need to wait for 48 hours to land. labels Oct 20, 2023
@mhdawson
Copy link
Member

I've also added to the tsc-agenda so we have a discussion/feedback on what might be an appropriate threshold.

@anonrig
Copy link
Member Author

anonrig commented Oct 20, 2023

@mhdawson Is there any reason for not-merging this pull request? Is tsc-agenda label related to this PR specifically or a general discussion about the conversation above?

@sxa
Copy link
Member

sxa commented Oct 21, 2023

@mhdawson Is there any reason for not-merging this pull request? Is tsc-agenda label related to this PR specifically or a general discussion about the conversation above?

While it has had two reviews I think with people against this, and I'll include myself in that, it makes sense to pause the merging.

Having said that if we are struggling to get a response from the team maintaining that then we may have a wider issue (I think I may technically be on the @nodejs/platform-s390 team but I'm not actively maintaining the port there) If we can't get a response in another week or so then I'd likely be in favour, so at this stage I'm +1 on having it as part of a TSC discussion before making the decision on merging.

@anonrig
Copy link
Member Author

anonrig commented Oct 21, 2023

If we can't get a response in another week

Although, I agree with where this is going this means that all pull requests in the next week will face this flaky test and potentially result in overwhelmed contributors, which is the main reason of fast tracking these tests.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sxa
Copy link
Member

sxa commented Oct 21, 2023

If we can't get a response in another week
all pull requests in the next week will face this flaky test

So it's failing 100% of the time at the moment?

@mhdawson
Copy link
Member

mhdawson commented Oct 23, 2023

From the reliability reports the last time this failed was Oct 10th - nodejs/reliability#690, so it's not every run. The reliability report shows that it failed a number of times on both macos and s390 between the Oct 8th and Oct 10th.

It's not failed since. I think this is a possibly example for not being too hasty about excluding a test as it's not so flaky that it affects all builds since we've not seen failures in the last 13 days. [EDIT searching on the reliability reports I think some may not show up as I can't always find an entry in the report when we have a report in a CI].

Since it was also failing on multiple platforms, I'm not sure its a good candidate for wanting a response from a particular platform team either.

In one of the failing runs on osx (https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/), it looked like this:

[parallel.test-fs-watch-recursive-add-file](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/parallel/test_fs_watch_recursive_add_file/)
[parallel.test-fs-watch-recursive-add-file-to-existing-subfolder](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/parallel/test_fs_watch_recursive_add_file_to_existing_subfolder/)
[parallel.test-fs-watch-recursive-add-file-to-new-folder](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/parallel/test_fs_watch_recursive_add_file_to_new_folder/)
[parallel.test-fs-watch-recursive-add-file-with-url](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/parallel/test_fs_watch_recursive_add_file_with_url/)
[parallel.test-fs-watch-recursive-add-folder](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/parallel/test_fs_watch_recursive_add_folder/)
[parallel.test-http-server-headers-timeout-keepalive](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/parallel/test_http_server_headers_timeout_keepalive_/)
[sequential.test-watch-mode](https://ci.nodejs.org/job/node-test-commit-osx-arm/13767/nodes=osx11/testReport/junit/(root)/sequential/test_watch_mode_/)

With multiple tests failing like that I think thats less often a flaky test and more often something that needs to be cleaned up on the machine.

So unless I'm looking at the data wrong, I don't think we should disable this test at this point. It's not failed in 13 days and some of the failures looked more like machine issues.

Separately I think it would be good to agree/document something around marking tests as flaky in terms of wether it's ok to mark flaky after seeing one failure, or if not what threshold we think makes sense.

@anonrig
Copy link
Member Author

anonrig commented Oct 25, 2023

It's not failed in 13 days and some of the failures looked more like machine issues.

@mhdawson If you resume a build and make sure the flaky test succeeds before the daily reliability check being done, you can spoof the reliability report. The following test link from the original issue is an example of this: https://ci.nodejs.org/job/node-test-commit-linuxone/40398/

I think this is the fundamental issue of reliability reports.

@richardlau
Copy link
Member

@mhdawson If you resume a build and make sure the flaky test succeeds before the daily reliability check being done, you can spoof the reliability report. The following test link from the original issue is an example of this: https://ci.nodejs.org/job/node-test-commit-linuxone/40398/

I think this is the fundamental issue of reliability reports.

That's not how reliability reports work. The reports look back on the 100 most recent runs of node-test-pull-request.

@aduh95 aduh95 added the request-ci Add this label to start a Jenkins CI on a PR. label Nov 8, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Nov 8, 2023
@nodejs-github-bot
Copy link
Collaborator

Copy link
Contributor

@aduh95 aduh95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the last time this test has been reported as failing was on nodejs/reliability#690, so one month ago, and nothing shows up since. Are we sure this is still relevant?
Adding a request for changes so it doesn't land until we have confirmation it is still flaky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. flaky-test Issues and PRs related to the tests with unstable failures on the CI. needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. tsc-agenda Issues and PRs to discuss during the meetings of the TSC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet