Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from watching a network share (CIFS), which might be down at times #293

Closed
senny opened this issue Feb 12, 2015 · 8 comments
Closed

Comments

@senny
Copy link

senny commented Feb 12, 2015

First of all, thank you for this library! 💛

We are using listen (in polling mode) to monitor files placed on a network share (cifs). The server providing the share is restarted quite often, which means the share is gone for a couple minutes. Sometime listen is able to recover from that downtime but usually the listener gets into a broken state. Meaning, the process is still alive but it's no longer watching the share when it gets back online.

Inspecting the logs shows the following errors (full log including backtraces at https://gist.github.com/senny/5ef7515ae6ecf0c57b9f):

1.) [2015-02-12T03:25:25.223244 #32185] ERROR -- : Actor crashed!

Errno::EHOSTDOWN: Host is down @ dir_initialize - <THE/DIRECTORY/LISTEN/WATCHES>

2.) [2015-02-12T03:25:35.226416 #32185] ERROR -- : Actor crashed!

Celluloid::TimeoutError: linking timeout of 5 seconds exceeded

3.) [2015-02-12T03:25:36.228680 #32185] ERROR -- : Actor crashed!

Celluloid::DeadTaskError: cannot resume a dead task (dead fiber called)

4.) [2015-02-12T03:25:36.232612 #32185] ERROR -- : Actor crashed!

NoMethodError: undefined method `silencer' for #Hash:0x007fd35746c008

5.) [2015-02-12T03:25:36.267081 #32185] ERROR -- : Actor crashed!

RuntimeError: can't add a new key into hash during iteration

As seen in the full log, most of these errors are printed multiple times.

As the process is not dying our monitoring solution does not pick up the broken state and doesn't restart the process. I was wondering wether there is a way to recover from such downtimes.

Thanks in advance.

@e2
Copy link
Contributor

e2 commented Feb 12, 2015

As a quick patch, try adding Errno::EHOSTDOWN wherever there's a rescue block with Errno::ENOENT, e.g. here:

rescue Errno::ENOENT

I'm not sure if assuming the host is down is the same as a file not existing, but during the next scan it should pick up the changes as "new files" once the host it up again.

Ultimately, you want to have enough rescues to avoid exceptions "leaking" outside the method where they happen.

Don't worry about any crashes beyond the first one, since one crash can pretty much bring down everything.

There may be a generic way to handle these and other exceptions, but for now it's going to be faster and better for you to patch until you have no more problems (and send me a PR or even a diff, and I'll add tests/specs, etc.)

And don't hesitate to ask about internals - they can get a bit tricky, and unless you're Detective Columbo, I'd recommend just asking.

@senny
Copy link
Author

senny commented Feb 12, 2015

@e2 thanks for the detailed response. I'll have a shot at adding the rescues and see how things turn out.

@e2
Copy link
Contributor

e2 commented Feb 12, 2015

As for not watching files when the host it back up, you may want to do a "touch" on a directory or file, so Listen can rebuild it's directory records - I'm not sure yet why it wouldn't pick up changes, but rescuing exceptions first is the fastest/best thing to do at this point.

@senny
Copy link
Author

senny commented Feb 12, 2015

@e2 just deployed https://github.com/senny/listen/commit/35ef5a0cdc0f1598144855bf99260ef3f5a8bc2f and it looks like it solves the issue. We simulated some downtime and listen kept working without crashes. I put the patch in our sandbox environment so we can collect more data under somewhat real conditions.

@e2
Copy link
Contributor

e2 commented Feb 12, 2015

Theoretically there should be errors in other places (not just when reading directories) - but given there's caching going on over the network, it may be hard to reproduce other errors (which also means they won't occur in production anyway).

So if you want better testing, you may want o tweak parameters in Listen (e.g. polling latency - since I'm guessing you're using polling on CIFS) and in CIFS server (e.g. reducing cache size, timeouts, etc.).

If you're done, let me know and I'll release a version with the fix(es).

@senny
Copy link
Author

senny commented Mar 9, 2015

@e2 been running this patch for some time in production. It seems to do the trick for us.

e2 added a commit that referenced this issue Mar 9, 2015
@e2 e2 closed this as completed in b168916 Mar 9, 2015
e2 added a commit that referenced this issue Mar 9, 2015
@senny
Copy link
Author

senny commented Mar 9, 2015

❤️

@e2
Copy link
Contributor

e2 commented Mar 9, 2015

Released in 2.8.6, so you can forget this issue ever existed ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants