XOR and EC goals are not safe, even when just single disk fails #910

aletus · 2021-10-05T19:34:37Z

All,

A head up for a major danger: XOR and EC goals are not safe even with a single disk failure due edge case of chunk distribution.

Recently one of my chunk server has some hardware issue and I had to take it offline for repair.

Everything was green across the board before hand (no missing chunks, endangered chunks, under goal chunks, etc..) and all my files were XOR3 or EC8+2 so lizards should be able to handle one chunk server down.

However, as soon as take the single chunk server down, a whole lots of chunks shown up as missing and files shown up as unavailable.

When I bring the server backup , everything went back to green and available.

So even though the files are XOR3 and EC8+2, the file system should have able to handle one disk failure it didn't.

In this case I am lucky this was an unrelated hardware failure and not a disk failure so I was able to bring it that chunk server back online otherwise I would have lost those files for good in case of a real disk failure.

Upon further investigation on a few of those unavailable files, I found out that particular single chunk server was storing multiple copies of the same chunk.

Here is an example for XOR3:

  chunk 3: 00000000002480C5_00000001 / (id:2392261 ver:1)
            copy 1: 192.168.1.11:9422:lp part 3/4 of xor3
            copy 2: 192.168.1.11:9422:lp part 4/4 of xor3
            copy 3: 192.168.1.18:9422:lp part 1/4 of xor3
            copy 4: 192.168.1.15:9422:lp part 2/4 of xor3

In this case, since this is XOR3, the chunk is split in 3 parts + 1 XOR parity part for a total of 4 copies. This should have been able to handle 1 disk failure.

However not sure how lizardfs got into this, but it did store 2 out 4 copies on the chunk server 192.168.5.11. So when I took that particular server offline, there are only 2 copies out 4 left which is not enough to reconstruct the chunk hence the chunk and file became missing and unavailable.

The dangerous thing is lizardfs currently does not a way to detect this scenario and therefore does not redistribute the copies to another chunk server so it it safer. There are more than enough chunk servers and disk space on them to do the redistribution but I don't think there is a current code to detect this scenario and redistribute.

I did a cursory code read through and it seemed like the file and chunk test loop only check for correct number of copies (in this example XOR3 so it needed 4 copies and there are 4 copies so it's good). It does not check where the copies are store and redistribute them so it will remain in this dangerous situation.

Hope this gives a head up for people looking to use XOR and EC goals.

The text was updated successfully, but these errors were encountered:

aNeutrino · 2021-10-06T01:21:14Z

Hi, @aletus this looks like a very interesting bug - thank you for sharing it.
Can we have a video chat about it, please?
If yes can you send me your contact data to aneutrino@lizardfs.org, please?
after we got more data and a clear reproduction scenario I will gladly update this issue with an explanation for the fix.

njhurst · 2021-11-09T02:43:45Z

I've observed this for EC goals too. @aNeutrino do you have a code pointer where the logic for spreading the chunks out is?

aletus · 2022-02-23T02:47:41Z

The inevitable did happen today. Lost files and data even with EC goals coverage and a single disk failure.

One of my hard drive went bad with bad sectors, lizardfs reported damaged disk and I had to remove it from the system. And a lots of chunks and files shown up as unavailable even though the goal was EC(8,2).

I spot checked a few smaller missing files with mfsfileinfo and they only has 7 out 10 copies. So either lizardfs put all 3 missing copies on this single failed disk or it only generates 8 copies and one of them is on this particular disk.

I'm not sure how to have a "clear reproduction scenario". The system is in a definitely bad and precarious state and we can certainly debug from here and hopefully has a way to fix it. But not sure how to get to this state from a clean install.

Best thing at the moment is to have a way to detect files that are at risk and way to force it to redistribute.

dagelf · 2022-04-13T09:28:38Z

I wonder if it could be behind the reason Moosefs doesn't delete discarded junks until a week later.

szycha76 · 2022-05-06T20:33:18Z

@aletus, which version are you using? I did notice uneven distribution on only a few files/cases throughout 3+ years quite intensive using of 3.12.0, but nothing that would threaten my data - and believe me, you wouldn't probably entrust your data with the drives I did ;-)

I'm using 16+2, 8+2, 4+2, 4+1, 2+1 and x2, x3, x4 replication modes mostly, and if one drive goes south, I can see several chunks that have 2 copies to replicate, but on the other hand they do not appear in "endangered chunks" list, so I didn't wonder what is the reason of this behavior.

uristdwarf mentioned this issue Feb 23, 2024

Known high-priority issues from LizardFS leil-io/saunafs#7

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XOR and EC goals are not safe, even when just single disk fails #910

XOR and EC goals are not safe, even when just single disk fails #910

aletus commented Oct 5, 2021 •

edited

aNeutrino commented Oct 6, 2021

njhurst commented Nov 9, 2021

aletus commented Feb 23, 2022

dagelf commented Apr 13, 2022

szycha76 commented May 6, 2022

XOR and EC goals are not safe, even when just single disk fails #910

XOR and EC goals are not safe, even when just single disk fails #910

Comments

aletus commented Oct 5, 2021 • edited

aNeutrino commented Oct 6, 2021

njhurst commented Nov 9, 2021

aletus commented Feb 23, 2022

dagelf commented Apr 13, 2022

szycha76 commented May 6, 2022

aletus commented Oct 5, 2021 •

edited