New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flock #19
Comments
Really interesting observation. And one that I hadn't considered. If rsync is ignoring flock() on specific files, then it seems reasonable that the file can be transmitted in an inconsistent state. However, since whisper will wrap the update transaction in a lock and the sync job will rsync the source file to /tmp on the target host in a batch, then merge the values into the target database, this risk only affects us if whisper is inside a create or update transaction when rsync reads and sends the file. I'm honestly not sure how to solve this and continue to use rsync, if rsync isn't willing to respect a shared lock. Do you have any ideas here? Removing the rsync dependency and having carbon-sync use the carbonlink protocol directly to pull data is an option, but one that will require significant work and tuning. |
Well if there was a pip libflockit (https://github.com/smerritt/flockit) it could be added to requirements :) Or perhaps something similar with python On either end at the moment we do:
But seeing as the file on the dest is being replaced and healed in-situ locally there should be no need to use libflockit.so on the dest unless it is a file that carbon is updating via a relay, etc (e.g. exists). |
Are you suggesting we apply a shared lock around the batch of files to sync in https://github.com/graphite-project/whisper/blob/master/whisper.py#L540-L542? I worry about that causing problems by blocking the local carbons, but with a sufficiently small batch size it might be fine. |
Yeah we flock rsync per file. Figured it was the only way to ensure the we are not creating a Schrödinger's cat type scenario, without validating every timeseries data set somehow. |
I'd be OK with that, but I wonder if it should be default. I'm still worried that since we transfer files in a batch of 1000 at a time to avoid the overhead of setting up the connection for every file we'll cause the carbon processes on the source node to block for an unreasonable amount of time. I'm unsure how to handle this. |
Want to send over a PR with your ideas? We can riff on that and get this solved. |
Forked, no quick promises though, have libcloud, fog and skyline pull requests that need to be done that are lagging. This social coding malarkey, who thought that was a good idea ;) |
hi,
|
@filippog - That seems accurate. I believe we want to duplicate the |
our graphite clustering plan involves using carbonate while carbon-cache is running, this can potentially lead to corrupted whisper files if locking isn't used, see also graphite-project/carbonate#19 Bug: T86316 Change-Id: I76b064acf3b7ccad17313a4f05c3b72b3b01b798
Fixed in #71 |
Great effort, whisper can be a pain.
I have always been led to believe that "Graphite will lock the Whisper file using flock() during updates so that no other Graphite process reads it mid-update and finds inconsistent data.
Unfortunately, flock() is only an advisory lock and rsync ignores it. This means that if Graphite updates a particular Whisper file while rsync is in the middle of backing it up, the backup copy might be inconsistent." from the boys SwiftStack - @smerritt
Having looked through the code I do not see any reference to flock in anyway. Is this just trivial because of the context of healing to local whisper file? If so, adding an ssh key option pull request is in order for me I think.
The text was updated successfully, but these errors were encountered: