Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flock #19

Closed
earthgecko opened this issue Jun 10, 2014 · 11 comments
Closed

flock #19

earthgecko opened this issue Jun 10, 2014 · 11 comments
Labels

Comments

@earthgecko
Copy link

Great effort, whisper can be a pain.

I have always been led to believe that "Graphite will lock the Whisper file using flock() during updates so that no other Graphite process reads it mid-update and finds inconsistent data.

Unfortunately, flock() is only an advisory lock and rsync ignores it. This means that if Graphite updates a particular Whisper file while rsync is in the middle of backing it up, the backup copy might be inconsistent." from the boys SwiftStack - @smerritt

Having looked through the code I do not see any reference to flock in anyway. Is this just trivial because of the context of healing to local whisper file? If so, adding an ssh key option pull request is in order for me I think.

@jssjr
Copy link
Member

jssjr commented Jun 10, 2014

Unfortunately, flock() is only an advisory lock and rsync ignores it. This means that if Graphite updates a particular Whisper file while rsync is in the middle of backing it up, the backup copy might be inconsistent.

Really interesting observation. And one that I hadn't considered.

If rsync is ignoring flock() on specific files, then it seems reasonable that the file can be transmitted in an inconsistent state. However, since whisper will wrap the update transaction in a lock and the sync job will rsync the source file to /tmp on the target host in a batch, then merge the values into the target database, this risk only affects us if whisper is inside a create or update transaction when rsync reads and sends the file.

I'm honestly not sure how to solve this and continue to use rsync, if rsync isn't willing to respect a shared lock. Do you have any ideas here?

Removing the rsync dependency and having carbon-sync use the carbonlink protocol directly to pull data is an option, but one that will require significant work and tuning.

@earthgecko
Copy link
Author

Well if there was a pip libflockit (https://github.com/smerritt/flockit) it could be added to requirements :) Or perhaps something similar with python fcntl.flock()

On either end at the moment we do:
flockit.so rsync .wsp file to tmp location -> rsync .wsp to remote tmp location -> flockit.so rsync tmp_remote .wsp to pyschical .wsp file

env LD_PRELOAD=/usr/lib/libflockit.so FLOCKIT_FILE_PREFIX=$WHISPER_PATH rsync $OPTION_THINGS <src> <dest> would do it all.

But seeing as the file on the dest is being replaced and healed in-situ locally there should be no need to use libflockit.so on the dest unless it is a file that carbon is updating via a relay, etc (e.g. exists).

@jssjr
Copy link
Member

jssjr commented Jun 10, 2014

Are you suggesting we apply a shared lock around the batch of files to sync in https://github.com/graphite-project/whisper/blob/master/whisper.py#L540-L542? I worry about that causing problems by blocking the local carbons, but with a sufficiently small batch size it might be fine.

@earthgecko
Copy link
Author

Yeah we flock rsync per file. Figured it was the only way to ensure the we are not creating a Schrödinger's cat type scenario, without validating every timeseries data set somehow.

@jssjr
Copy link
Member

jssjr commented Jun 10, 2014

I'd be OK with that, but I wonder if it should be default. I'm still worried that since we transfer files in a batch of 1000 at a time to avoid the overhead of setting up the connection for every file we'll cause the carbon processes on the source node to block for an unreasonable amount of time. I'm unsure how to handle this.

@jssjr
Copy link
Member

jssjr commented Jun 10, 2014

Want to send over a PR with your ideas? We can riff on that and get this solved.

@jssjr jssjr added the bug label Jun 10, 2014
@earthgecko
Copy link
Author

Forked, no quick promises though, have libcloud, fog and skyline pull requests that need to be done that are lagging. This social coding malarkey, who thought that was a good idea ;)

@filippog
Copy link
Contributor

filippog commented Feb 2, 2015

hi,
thanks for carbonate! I was thinking about locking too, related to this whisper.LOCK seems to be set by carbon when WHISPER_LOCK_WRITES is set in graphite's config (see graphite-project/whisper#97) however carbonate doesn't explicitly lock if I'm reading correctly so this is also possible I think:

  • expand a cluster ring from N to M
  • metrics start flowing to their new position in the ring
  • carbon-sync is ran to sync and merge metrics
    • meanwhile the local carbon-cache also writes to the same files while carbon-sync is backfilling
  • whisper files are corrupted due to lack of locking (?)

@jssjr
Copy link
Member

jssjr commented Feb 7, 2015

@filippog - That seems accurate. I believe we want to duplicate the WHISPER_LOCK_WRITES directive in the carbonate config file, modify the import so we have access to LOCK and set it accordingly so update_many() calls flock before making changes. Does this sound right? Want to take a stab at a PR or should I?

@filippog
Copy link
Contributor

filippog commented Feb 7, 2015

sounds good @jssjr ! I've split this into #47 so we can followup there, I'll take a stab at a PR shortly!

wmfgerrit pushed a commit to wikimedia/operations-puppet that referenced this issue Apr 20, 2015
our graphite clustering plan involves using carbonate while carbon-cache is
running, this can potentially lead to corrupted whisper files if locking isn't
used, see also graphite-project/carbonate#19

Bug: T86316
Change-Id: I76b064acf3b7ccad17313a4f05c3b72b3b01b798
@deniszh
Copy link
Member

deniszh commented Mar 19, 2017

Fixed in #71

@deniszh deniszh closed this as completed Mar 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants