Recover from watching a network share (CIFS), which might be down at times #293

senny · 2015-02-12T13:41:48Z

First of all, thank you for this library! 💛

We are using listen (in polling mode) to monitor files placed on a network share (cifs). The server providing the share is restarted quite often, which means the share is gone for a couple minutes. Sometime listen is able to recover from that downtime but usually the listener gets into a broken state. Meaning, the process is still alive but it's no longer watching the share when it gets back online.

Inspecting the logs shows the following errors (full log including backtraces at https://gist.github.com/senny/5ef7515ae6ecf0c57b9f):

1.) [2015-02-12T03:25:25.223244 #32185] ERROR -- : Actor crashed!

Errno::EHOSTDOWN: Host is down @ dir_initialize - <THE/DIRECTORY/LISTEN/WATCHES>

2.) [2015-02-12T03:25:35.226416 #32185] ERROR -- : Actor crashed!

Celluloid::TimeoutError: linking timeout of 5 seconds exceeded

3.) [2015-02-12T03:25:36.228680 #32185] ERROR -- : Actor crashed!

Celluloid::DeadTaskError: cannot resume a dead task (dead fiber called)

4.) [2015-02-12T03:25:36.232612 #32185] ERROR -- : Actor crashed!

NoMethodError: undefined method `silencer' for #Hash:0x007fd35746c008

5.) [2015-02-12T03:25:36.267081 #32185] ERROR -- : Actor crashed!

RuntimeError: can't add a new key into hash during iteration

As seen in the full log, most of these errors are printed multiple times.

As the process is not dying our monitoring solution does not pick up the broken state and doesn't restart the process. I was wondering wether there is a way to recover from such downtimes.

Thanks in advance.

e2 · 2015-02-12T14:10:12Z

As a quick patch, try adding Errno::EHOSTDOWN wherever there's a rescue block with Errno::ENOENT, e.g. here:

listen/lib/listen/directory.rb

Line 33 in d5e648a

rescue Errno::ENOENT

I'm not sure if assuming the host is down is the same as a file not existing, but during the next scan it should pick up the changes as "new files" once the host it up again.

Ultimately, you want to have enough rescues to avoid exceptions "leaking" outside the method where they happen.

Don't worry about any crashes beyond the first one, since one crash can pretty much bring down everything.

There may be a generic way to handle these and other exceptions, but for now it's going to be faster and better for you to patch until you have no more problems (and send me a PR or even a diff, and I'll add tests/specs, etc.)

And don't hesitate to ask about internals - they can get a bit tricky, and unless you're Detective Columbo, I'd recommend just asking.

senny · 2015-02-12T14:12:22Z

@e2 thanks for the detailed response. I'll have a shot at adding the rescues and see how things turn out.

e2 · 2015-02-12T14:13:05Z

As for not watching files when the host it back up, you may want to do a "touch" on a directory or file, so Listen can rebuild it's directory records - I'm not sure yet why it wouldn't pick up changes, but rescuing exceptions first is the fastest/best thing to do at this point.

senny · 2015-02-12T14:39:47Z

@e2 just deployed https://github.com/senny/listen/commit/35ef5a0cdc0f1598144855bf99260ef3f5a8bc2f and it looks like it solves the issue. We simulated some downtime and listen kept working without crashes. I put the patch in our sandbox environment so we can collect more data under somewhat real conditions.

e2 · 2015-02-12T15:31:12Z

Theoretically there should be errors in other places (not just when reading directories) - but given there's caching going on over the network, it may be hard to reproduce other errors (which also means they won't occur in production anyway).

So if you want better testing, you may want o tweak parameters in Listen (e.g. polling latency - since I'm guessing you're using polling on CIFS) and in CIFS server (e.g. reducing cache size, timeouts, etc.).

If you're done, let me know and I'll release a version with the fix(es).

senny · 2015-03-09T09:13:36Z

@e2 been running this patch for some time in production. It seems to do the trick for us.

recover from SMB failures [fix #293]

senny · 2015-03-09T10:32:02Z

❤️

e2 · 2015-03-09T10:40:32Z

Released in 2.8.6, so you can forget this issue ever existed ;)

e2 added a commit that referenced this issue Mar 9, 2015

recover from SMB failures [fix #293]

9451524

e2 closed this as completed in b168916 Mar 9, 2015

e2 added a commit that referenced this issue Mar 9, 2015

Merge pull request #296 from guard/fix_293_recover_from_smb_failure

547aba1

recover from SMB failures [fix #293]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from watching a network share (CIFS), which might be down at times #293

Recover from watching a network share (CIFS), which might be down at times #293

senny commented Feb 12, 2015

e2 commented Feb 12, 2015

senny commented Feb 12, 2015

e2 commented Feb 12, 2015

senny commented Feb 12, 2015

e2 commented Feb 12, 2015

senny commented Mar 9, 2015

senny commented Mar 9, 2015

e2 commented Mar 9, 2015

Recover from watching a network share (CIFS), which might be down at times #293

Recover from watching a network share (CIFS), which might be down at times #293

Comments

senny commented Feb 12, 2015

e2 commented Feb 12, 2015

senny commented Feb 12, 2015

e2 commented Feb 12, 2015

senny commented Feb 12, 2015

e2 commented Feb 12, 2015

senny commented Mar 9, 2015

senny commented Mar 9, 2015

e2 commented Mar 9, 2015