Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XOR and EC goals are not safe, even when just single disk fails #910

Open
aletus opened this issue Oct 5, 2021 · 5 comments
Open

XOR and EC goals are not safe, even when just single disk fails #910

aletus opened this issue Oct 5, 2021 · 5 comments

Comments

@aletus
Copy link

aletus commented Oct 5, 2021

All,

A head up for a major danger: XOR and EC goals are not safe even with a single disk failure due edge case of chunk distribution.

Recently one of my chunk server has some hardware issue and I had to take it offline for repair.

Everything was green across the board before hand (no missing chunks, endangered chunks, under goal chunks, etc..) and all my files were XOR3 or EC8+2 so lizards should be able to handle one chunk server down.

However, as soon as take the single chunk server down, a whole lots of chunks shown up as missing and files shown up as unavailable.

When I bring the server backup , everything went back to green and available.

So even though the files are XOR3 and EC8+2, the file system should have able to handle one disk failure it didn't.

In this case I am lucky this was an unrelated hardware failure and not a disk failure so I was able to bring it that chunk server back online otherwise I would have lost those files for good in case of a real disk failure.

Upon further investigation on a few of those unavailable files, I found out that particular single chunk server was storing multiple copies of the same chunk.

Here is an example for XOR3:

  chunk 3: 00000000002480C5_00000001 / (id:2392261 ver:1)
            copy 1: 192.168.1.11:9422:lp part 3/4 of xor3
            copy 2: 192.168.1.11:9422:lp part 4/4 of xor3
            copy 3: 192.168.1.18:9422:lp part 1/4 of xor3
            copy 4: 192.168.1.15:9422:lp part 2/4 of xor3

In this case, since this is XOR3, the chunk is split in 3 parts + 1 XOR parity part for a total of 4 copies. This should have been able to handle 1 disk failure.

However not sure how lizardfs got into this, but it did store 2 out 4 copies on the chunk server 192.168.5.11. So when I took that particular server offline, there are only 2 copies out 4 left which is not enough to reconstruct the chunk hence the chunk and file became missing and unavailable.

The dangerous thing is lizardfs currently does not a way to detect this scenario and therefore does not redistribute the copies to another chunk server so it it safer. There are more than enough chunk servers and disk space on them to do the redistribution but I don't think there is a current code to detect this scenario and redistribute.

I did a cursory code read through and it seemed like the file and chunk test loop only check for correct number of copies (in this example XOR3 so it needed 4 copies and there are 4 copies so it's good). It does not check where the copies are store and redistribute them so it will remain in this dangerous situation.

Hope this gives a head up for people looking to use XOR and EC goals.

@aNeutrino
Copy link
Member

Hi, @aletus this looks like a very interesting bug - thank you for sharing it.
Can we have a video chat about it, please?
If yes can you send me your contact data to aneutrino@lizardfs.org, please?
after we got more data and a clear reproduction scenario I will gladly update this issue with an explanation for the fix.

@njhurst
Copy link

njhurst commented Nov 9, 2021

I've observed this for EC goals too. @aNeutrino do you have a code pointer where the logic for spreading the chunks out is?

@aletus
Copy link
Author

aletus commented Feb 23, 2022

The inevitable did happen today. Lost files and data even with EC goals coverage and a single disk failure.

One of my hard drive went bad with bad sectors, lizardfs reported damaged disk and I had to remove it from the system. And a lots of chunks and files shown up as unavailable even though the goal was EC(8,2).

I spot checked a few smaller missing files with mfsfileinfo and they only has 7 out 10 copies. So either lizardfs put all 3 missing copies on this single failed disk or it only generates 8 copies and one of them is on this particular disk.

I'm not sure how to have a "clear reproduction scenario". The system is in a definitely bad and precarious state and we can certainly debug from here and hopefully has a way to fix it. But not sure how to get to this state from a clean install.

Best thing at the moment is to have a way to detect files that are at risk and way to force it to redistribute.

@dagelf
Copy link

dagelf commented Apr 13, 2022

I wonder if it could be behind the reason Moosefs doesn't delete discarded junks until a week later.

@szycha76
Copy link

szycha76 commented May 6, 2022

@aletus, which version are you using? I did notice uneven distribution on only a few files/cases throughout 3+ years quite intensive using of 3.12.0, but nothing that would threaten my data - and believe me, you wouldn't probably entrust your data with the drives I did ;-)

I'm using 16+2, 8+2, 4+2, 4+1, 2+1 and x2, x3, x4 replication modes mostly, and if one drive goes south, I can see several chunks that have 2 copies to replicate, but on the other hand they do not appear in "endangered chunks" list, so I didn't wonder what is the reason of this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants