New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New undergoal chunks while chunkserver is down #338

Closed
onlyjob opened this Issue Nov 22, 2015 · 14 comments

Comments

Projects
None yet
3 participants
@onlyjob
Copy link
Member

onlyjob commented Nov 22, 2015

This is somewhat related to #227 but more serious.
Suppose I have /home with goal std3 defined as $std {poolA poolB poolC}. There are no undergoal chunks in std3.
There are 5 chunkservers in poolC. I stop one poolC chunkserver for some time.
Meanwhile some files are re-written in /home while one chunkserver in poolC is down.
After some time (within REPLICATIONS_DELAY_DISCONNECT period) I start chunkserver in poolC and observe its initialisation: after scanning is done and chunkserver is re-joined, there are new undergoal std3 chunks(!).
In other words while chunkserver is down, goals are not obeyed and re-written chunks are degraded (undergoal) which is dangerous in case when chunkserver is lost and not coming back. Goal should be obeyed whenever possible (without accumulating debt).
In this scenario there are 4 remaining chunkservers in poolC with plenty of disk space. There are no reasons for not making a third copy as goal requires.

Please make sure that re-written chunks obey their goal setting even when chunkserver holding one replica is down. It requires creation of all updated replicas on available chunkservers and increment of chunk version.

LizardFS cluster should behave reasonably at all times even when some chunkservers are down. The larger LizardFS cluster is the greater probability that at least one chunkserver would be down at any time.

@psarna

This comment has been minimized.

Copy link
Member

psarna commented Nov 23, 2015

It has exactly the same cause as #227. We will definitely address all those issues (e.g. by ignoring the REPLICATIONS_DELAY_DISCONNECT value if chunk is modified and/or introducing minimal goals), but it is not high in priority list right now.

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Nov 23, 2015

Well, #227 was opened 11 months ago so maybe it could be prioritised a little? After all it is a matter of reliability... I suppose ignoring REPLICATIONS_DELAY_DISCONNECT should be easier than introducing minimal goal so maybe implementing first option could be done soon-ish?

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Dec 5, 2015

To me it is a pretty serious issue. For example, we have three labels (PoolA, PoolB and PoolC) to accommodate std3 files. Chunkservers in PoolC are volatile and one or more of them are down for several hours every day. Under such circumstances the most important, most frequently modified files in std3 are degraded most of the time because LizardFS do not ensure that third replica is made on available chunkservers... In short term this problem appears to be easy to fix by "by ignoring the REPLICATIONS_DELAY_DISCONNECT value if chunk is modified". I hope it could be done sooner than later... Please?

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Dec 10, 2015

Finally it hit me and yesterday for the very first time in my experience with LizardFS I lost all three(!) replicas of file [1] just by sequentially restarting chunkservers.

[1]: FYI lost file is cookies.sqlite from my active Firefox (aka Iceweasel) session. :(

Please do something about this issue ASAP.

@DarkHaze

This comment has been minimized.

Copy link
Contributor

DarkHaze commented Dec 10, 2015

I can confirm that we are able to duplicate this problem. We will try to fix it as soon as possible.

@psarna

This comment has been minimized.

Copy link
Member

psarna commented Dec 10, 2015

By the way, if this unfortunate thing happens, applying mfsfilerepair will almost always safely restore the file - master reports that there are no valid copies because it cannot find the one with newest version, while older versions are still kept on chunkservers. We will fix this issue as soon as possible.

@psarna

This comment has been minimized.

Copy link
Member

psarna commented Dec 10, 2015

update: This issue is fully reproducible on 2.6.0 and 2.5.4, so there is a big task ahead of us - debugging someone else's old code. We will work it out, but it might take time.

The dangerous situation is when a file is frequently written to and its chunkservers are restarted in a short amount of time.

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Dec 10, 2015

Thanks for confirmation and reassuring guys. :)

By the way, if this unfortunate thing happens

Then one of the challenges is to find affected file. On this instance it took hours to scan 40_000_000 files with mfsfileinfo and identify the damaged file. It reminds me about old bugs #174 and #222 with instructions how to find corrupted file.

Provided there are enough operational chunkservers, 3 replicas should be sufficient protection from such incidents as goal=3 is typically used when losing file is not an option (i.e. when loss of data can not be tolerated).

...applying mfsfilerepair will almost always safely restore the file...

It is nice to know that mfsfilerepair not always just fill missing chunk with zeros but sometimes can revert chunk to previous version. However I'm afraid that it can leave multi-chunked files in inconsistent state when fragment of file is older than the rest of it.

For cookies.sqlite file I believe using mfsfilerepair would not be safe because there are companion files cookies.sqlite-shm and cookies.sqlite-wal and one of them contains journal. Certainly attempt to flush recent journal to older version of data file will either fail or leave data in inconsistent state. That's why I've restored all three files from last snapshot.

This issue is fully reproducible on 2.6.0 and 2.5.4

That's why I've opened #227 almost a year ago during early adoption of LizardFS...

I'm glad to know that you are working on this important issue. Good luck. :)

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Dec 10, 2015

Related bug: #252 "undergoal chunks on (normal) restart of chunkserver".

@psarna

This comment has been minimized.

Copy link
Member

psarna commented Dec 11, 2015

I was going to refer to #227 but you beat me to it. Actually, I think these posts relate to exactly the same issue.

We were able to reproduce the issue by writing to a small file non-stop, while continuously restarting all chunkservers it could write to. After ~200 restarts we managed to get a "no valid copies state" of this file. It happened because master was sure that it is available at version, say, 60, while the most fresh version a chunkserver could have was 59. This, in turn, happened because chunkservers were constantly forced to restart, so some writing operations were interrupted and failed.

In this case, what mfsfilerepair would do is restoring the most fresh version available, because write to version 60 was clearly attempted on a chunkserver that was being restarted, so it failed in the middle of the operation. Version 59 is safe and consistent, though.

"no valid copies" status, in this particular case, doesn't exactly mean 'data loss'. It means that administrator should decide what to do with the file, because there was a failure during writing to it - so the wise decision here is to try to resurrect faulty chunkservers, and then, if need be, run mfsfilerepair to restore the most fresh copy.

I believe a solution to this problem is introducing functionalities mentioned in #227,
which is, unfortunately, a big task. Doable, though.

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Dec 11, 2015

Actually, I think these posts relate to exactly the same issue.

Great, then we'll have opportunity to close all those bugs at once. ;)

We were able to reproduce the issue by writing to a small file non-stop, while continuously restarting all chunkservers it could write to.

Interesting to note that I did not restart all 3 chunkservers at the same time. I restarted 'em one by one with little delay between restarts (without waiting for full replication). Ironically I did not want to put data at risk because I was aware of undergoal chunks on restart of chunkserver. A little lack of patience and here we are... :)

so some writing operations were interrupted and failed.

Does it mean that chunkservers do not acknowledge writes or that client do not care to confirm successful write operations?

Just curious, which implementation of proposed ones you are considering?

  1. ignoring the REPLICATIONS_DELAY_DISCONNECT value if chunk is modified.
  2. always create all replicas as per goal requirement.
  3. introduce new minimal goal setting?

I think there is another related problem that I reported as separate issue: #353.

@psarna

This comment has been minimized.

Copy link
Member

psarna commented Dec 11, 2015

Interesting to note that I did not restart all 3 chunkservers at the same time.

We didn't either, always one by one.

As for failed writes, there are thousands of options where an operation could fail and it can affect every single part of a system: client, chunkserver, master, etc. What we wanted to achieve is that even if something goes really wrong, mfsfilerepair can help you.

ignoring the REPLICATIONS_DELAY_DISCONNECT value if chunk is modified.

This would work as long as replication after modifying the chunk is instant, but it may not be, so it's not a complete solution. It would probably help, though.

always create all replicas as per goal requirement.

This would sometimes mean to read the whole chunk first, then append data on some chunkservers and write the whole chunk to other ones, and this would be a real pain to implement due to system's design. It could also create a lot of overgoal chunks and network traffic. So no.

introduce new minimal goal setting?

This is the most complete solution.

@psarna

This comment has been minimized.

Copy link
Member

psarna commented Jul 15, 2017

And the most complete solution was merged a while ago: 33c1d28

@psarna psarna closed this Jul 15, 2017

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Jul 17, 2017

REDUNDANCY_LEVEL is poorly documented. From its description it looks like it merely affects replication. It is very unclear that REDUNDANCY_LEVEL influence writing as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment