Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: refactored fs watch_dir tests for stability #480

Closed
wants to merge 1 commit into from

Conversation

whitlockjc
Copy link
Contributor

fs_event_watch_dir and fs_event_watch_dir_recursive could fail randomly
due to the way in which the tests were written. Originally timers were
used to create, remove and recreate the test files but this could lead
to a race condition if the timeout used to delete the test files ran
before all file creation fs events were handled. On top of that, the
file removal timer scheduled another timer to recreate the test files
and that timer's timeout could also lead to the same condition.

The refactoring removed timers for the removal/recreation of the test
files and instead the fs event callback was updated to have the
necessary logic to drive the test file removal. We no longer recreate
the test files since it appears to be unnecessary.

With the current PR, I can run 1000 tests for fs_event_watch_dir and
fs_event_watch_dir_recursive without a single failure, where prior to this
PR they would fail 14/100 on average.

Fixes #30

@whitlockjc
Copy link
Contributor Author

/cc @saghul @indutny @misterdjules

@whitlockjc whitlockjc changed the title tests: Refactored fs watch_dir tests for stability tests: refactored fs watch_dir tests for stability Aug 14, 2015
@whitlockjc
Copy link
Contributor Author

One thing I forgot to mention, or maybe it wasn't clear, is that the original test appeared to do four cycles of creating files and then removing the files. This current PR only does one cycle but updating it to do four should be trivial. It's your call as to whether I add the extra cycles back or not.

@saghul
Copy link
Member

saghul commented Aug 15, 2015

I'm afk until Monday. Will take a look then! Thanks!
On Aug 15, 2015 00:09, "Jeremy Whitlock" notifications@github.com wrote:

/cc @saghul https://github.com/saghul @indutny
https://github.com/indutny @misterdjules
https://github.com/misterdjules


Reply to this email directly or view it on GitHub
#480 (comment).

fs_event_unlink_files(handle);
} else if (fs_event_cb_called == fs_event_created + fs_event_removed) {
/* Once we've processed all create and delete events, stop watching */
ASSERT(0 == uv_fs_event_stop(handle));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really necessary, since calling uv_close will trigger a stop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I didn't write the original test so I didn't want to change it but I will remove it if you think it's unnecessary.

@saghul
Copy link
Member

saghul commented Aug 17, 2015

@saghul
Copy link
Member

saghul commented Aug 17, 2015

Not sure if it was failing before, but SmartOS and FreeBSD are failing:

SmartOS:

`fs_event_watch_dir` failed: exit code 134
Output from process `fs_event_watch_dir`:
Assertion failed in test/test-fs-event.c on line 175: events == UV_RENAME

FreeBSD:

%  86|+ 216|-   7|T   0|S  12]: fs_event_watch_dir
`fs_event_watch_dir` failed: timeout
Output from process `fs_event_watch_dir`:
fs_event_cb_dir_multi_file called: 1
  created: 16
  removed: 0

@whitlockjc
Copy link
Contributor Author

I will remove the unnecessary stuff and test on the other environments. I will update you.

@saghul
Copy link
Member

saghul commented Aug 17, 2015

Thanks! Ping me me if you need me to trigger the CI.

@whitlockjc
Copy link
Contributor Author

I have updated the PR. I removed my debugging output (sorry), removed the unnecessary assertions @saghul mentioned and updated the assertion in the fs_event_cb_dir_multi_file callback so we do not have assertion failures on SmartOS. Fixing the assertion on SmartOS does not fix the test completely as it will still timeout like it does on FreeBSD but these tests were failing prior to my PR and appear to be a deeper libuv issue than the test race condition I've fixed. (A quick debugging session shows that on FreeBSD and SmartOS that the fs_event_cb_dir_multi_file callback is only invoked once instead instead of once per created/removed file.)

@saghul
Copy link
Member

saghul commented Aug 17, 2015

@whitlockjc
Copy link
Contributor Author

I've found something. On FreeBSD (#128) and SmartOS, the tests were failing prior to this PR. But after looking at the CI environment, it seems something changed for fs_event_watch_dir because it use to pass on CentOS 7 but now it's failing there. That being said, I'll chase this down and update the PR.

@whitlockjc
Copy link
Contributor Author

I've cleaned up the sources a little to remove some compiler warnings on non-OSX/Windows. Progress:

FreeBSD

FreeBSD was failing prior to this PR but I wanted to see if there was something I could do about it. What I've noticed is that the fs event callback fs_event_cb_dir_multi_file is called only once even though the test creates 16 files, which should result in 16 separate fs_event_cb_dir_multi_file invocations. dtruss -e -d -f ./test/run-tests fs_event_watch_dir shows only 3 kevent calls and that doesn't seem right but it seems to match up with what I'm seeing in my debugging. On a whim, I removed the EV_ONESHOT from https://github.com/libuv/libuv/blob/v1.x/src/unix/kqueue.c#L97 and the test passes but I know that is not the fix, and it could very well be a red herring. I'm working with @indutny on this but I don't know how much of his time I'll have.

CentOS

Building an environment as we speak. No idea why it would stop passing due to these changes but I'll find out soon enough.

SmartOS

SmartOS was failing prior to this PR as well and it has similar symptoms to FreeBSD. I will circle around to SmartOS when I make progress on one of the environments above.

@whitlockjc whitlockjc force-pushed the fix-fs_event_watch_dir branch 2 times, most recently from 5da99b1 to edcb386 Compare August 18, 2015 15:00
@whitlockjc
Copy link
Contributor Author

I've merged in v1.x changes and resolved conflicts. I'm testing on CentOS right now.

@whitlockjc
Copy link
Contributor Author

I just tested locally on CentOS 7 and fs_event_watch_dir passes without issue. Testing FreeBSD again.

@whitlockjc
Copy link
Contributor Author

I think we're back in the state we were prior to this PR. FreeBSD and SmartOS are failing as they were before and I think it might be a bigger issue not related to the previously flaky tests.

@saghul Can you kick off a new CI build to make sure CentOS 7 is passing again?

@saghul
Copy link
Member

saghul commented Aug 18, 2015

Thanks again for the effort you're putting on this. I'll trigger the CI as
soon as I'm back home.
On Aug 18, 2015 5:33 PM, "Jeremy Whitlock" notifications@github.com wrote:

I think we're back in the state we were prior to this PR. FreeBSD and
SmartOS are failing as they were before and I think it might be a bigger
issue.

@saghul https://github.com/saghul Can you kick off a new CI build to
make sure CentOS7 is passing again?


Reply to this email directly or view it on GitHub
#480 (comment).

@whitlockjc
Copy link
Contributor Author

No problem. If you get a few minutes when you get home, ping me in IRC so we can chat.

@indutny
Copy link
Member

indutny commented Aug 18, 2015

I have investigated the FreeBSD side. I'm afraid this is just how it works: it coalesces events into one knote and returns them at kevent() call.

I think we could do following thing - make it create and unlink files in different event-loop ticks. Does it make sense?

@whitlockjc
Copy link
Contributor Author

Yes. I will update the PR when I've got that approach implemented. Thanks for your help.

@whitlockjc whitlockjc force-pushed the fix-fs_event_watch_dir branch 2 times, most recently from 1506476 to a124253 Compare August 18, 2015 21:44
@whitlockjc
Copy link
Contributor Author

The latest version of the PR has fs_event_watch_dir passing on FreeBSD, which should allow us to check that one off of #128 if/when this gets merged, and it is also passing on SmartOS now. (Both of these were failing prior to this PR.)

Once @indutny verified that the fs event callback fs_event_watch_dir was only being called once and was by design (expected), we realized the only way for this test to pass as written was to create one file per event loop tick and remove one file per event loop tick to make sure that the proper number of fs events were emitted for the created and removed files.

As it stands, this PR not only makes the fs_event_watch_dir and fs_event_watch_dir_recursive stable, it also fixes fs_event_watch_dir on FreeBSD, SmartOS and possibly others where it might had been failing. Thanks a lot @indutny and @misterdjules.

@saghul
Copy link
Member

saghul commented Aug 19, 2015

@whitlockjc sorry, yesterday ended up not working for me. CI: https://jenkins-iojs.nodesource.com/view/libuv/job/libuv+any-pr+multi/133/

@saghul
Copy link
Member

saghul commented Aug 19, 2015

@whitlockjc can you please rebase again? I pushed a fix for the broken build on Windows.

fs_event_watch_dir and fs_event_watch_dir_recursive could fail randomly
due to the way in which the tests were written.  Originally timers were
used to create, remove and recreate the test files but this could lead
to a race condition if the timeout used to delete the test files ran
before all file creation fs events were handled.  On top of that, the
file removal timer scheduled another timer to recreate the test files
and that timer's timeout could also lead to the same condition.

The refactoring removed timers for the removal/recreation of the test
files and instead the fs event callback was updated to have the
necessary logic to drive the test file removal.  We no longer recreate
the test files since it appears to be unnecessary.

Fixes libuv#30
@whitlockjc
Copy link
Contributor Author

Done.

@saghul
Copy link
Member

saghul commented Aug 19, 2015

@saghul
Copy link
Member

saghul commented Aug 19, 2015

Great job @whitlockjc! LGTM 👍

@whitlockjc
Copy link
Contributor Author

Thanks to all that helped: @indutny and @misterdjules Two excellent mentors.

saghul pushed a commit that referenced this pull request Aug 19, 2015
fs_event_watch_dir and fs_event_watch_dir_recursive could fail randomly
due to the way in which the tests were written.  Originally timers were
used to create, remove and recreate the test files but this could lead
to a race condition if the timeout used to delete the test files ran
before all file creation fs events were handled.  On top of that, the
file removal timer scheduled another timer to recreate the test files
and that timer's timeout could also lead to the same condition.

The refactoring removed timers for the removal/recreation of the test
files and instead the fs event callback was updated to have the
necessary logic to drive the test file removal.  We no longer recreate
the test files since it appears to be unnecessary.

Fixes: #30
PR-URL: #480
Reviewed-By: Saúl Ibarra Corretgé <saghul@gmail.com>
@saghul
Copy link
Member

saghul commented Aug 19, 2015

Landed in ✨ 8326470 ✨ Much obliged, @whitlockjc!

@saghul saghul closed this Aug 19, 2015
@whitlockjc
Copy link
Contributor Author

WOOHOO!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants