-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from watching a network share (CIFS), which might be down at times #293
Comments
As a quick patch, try adding listen/lib/listen/directory.rb Line 33 in d5e648a
I'm not sure if assuming the host is down is the same as a file not existing, but during the next scan it should pick up the changes as "new files" once the host it up again. Ultimately, you want to have enough rescues to avoid exceptions "leaking" outside the method where they happen. Don't worry about any crashes beyond the first one, since one crash can pretty much bring down everything. There may be a generic way to handle these and other exceptions, but for now it's going to be faster and better for you to patch until you have no more problems (and send me a PR or even a diff, and I'll add tests/specs, etc.) And don't hesitate to ask about internals - they can get a bit tricky, and unless you're Detective Columbo, I'd recommend just asking. |
@e2 thanks for the detailed response. I'll have a shot at adding the rescues and see how things turn out. |
As for not watching files when the host it back up, you may want to do a "touch" on a directory or file, so Listen can rebuild it's directory records - I'm not sure yet why it wouldn't pick up changes, but rescuing exceptions first is the fastest/best thing to do at this point. |
@e2 just deployed https://github.com/senny/listen/commit/35ef5a0cdc0f1598144855bf99260ef3f5a8bc2f and it looks like it solves the issue. We simulated some downtime and listen kept working without crashes. I put the patch in our sandbox environment so we can collect more data under somewhat real conditions. |
Theoretically there should be errors in other places (not just when reading directories) - but given there's caching going on over the network, it may be hard to reproduce other errors (which also means they won't occur in production anyway). So if you want better testing, you may want o tweak parameters in Listen (e.g. polling latency - since I'm guessing you're using polling on CIFS) and in CIFS server (e.g. reducing cache size, timeouts, etc.). If you're done, let me know and I'll release a version with the fix(es). |
@e2 been running this patch for some time in production. It seems to do the trick for us. |
❤️ |
Released in 2.8.6, so you can forget this issue ever existed ;) |
First of all, thank you for this library! 💛
We are using listen (in polling mode) to monitor files placed on a network share (cifs). The server providing the share is restarted quite often, which means the share is gone for a couple minutes. Sometime listen is able to recover from that downtime but usually the listener gets into a broken state. Meaning, the process is still alive but it's no longer watching the share when it gets back online.
Inspecting the logs shows the following errors (full log including backtraces at https://gist.github.com/senny/5ef7515ae6ecf0c57b9f):
1.) [2015-02-12T03:25:25.223244 #32185] ERROR -- : Actor crashed!
2.) [2015-02-12T03:25:35.226416 #32185] ERROR -- : Actor crashed!
3.) [2015-02-12T03:25:36.228680 #32185] ERROR -- : Actor crashed!
4.) [2015-02-12T03:25:36.232612 #32185] ERROR -- : Actor crashed!
5.) [2015-02-12T03:25:36.267081 #32185] ERROR -- : Actor crashed!
As seen in the full log, most of these errors are printed multiple times.
As the process is not dying our monitoring solution does not pick up the broken state and doesn't restart the process. I was wondering wether there is a way to recover from such downtimes.
Thanks in advance.
The text was updated successfully, but these errors were encountered: