New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.13.0~rc1: epic fail? Growing amount of missing chunks; "replication status: IO error" #746

Open
onlyjob opened this Issue Aug 27, 2018 · 8 comments

Comments

Projects
None yet
4 participants
@onlyjob
Copy link
Member

onlyjob commented Aug 27, 2018

After upgrading chunkservers to 3.13.0~rc1 I'm afraid I'm not getting away without massive data loss: mfsmaster logs replication status: IO error all the time and as replication progresses, CGI's Chunks view report growing (!) number of missing chunks in ec, and xor goals.

Bloody hell... :( :( :(

@njhurst

This comment has been minimized.

Copy link

njhurst commented Aug 27, 2018

Oh no! :(:(:( Thanks for the warning and being the sacrifical one. I'll wait for now. I hope you have a backup.

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Aug 27, 2018

Thank you for kind words, @njhurst. No, there is no backup. Where would you backup 100+ TiB? Systems like LizardFS are meant to protect from disasters, not cause them... Unforgivable...

Now I have 100_000+ missing chunks... I'm guessing that 3.13 destroyed chunks that did not finish goal change and had excessive chunks (mix of replicated and EC chunks).

Most of the damage occurred in the most precious data with goals ec(2,2) and ec(3,2) - those files should have been protected by 2 redundant chunks (RAID-6 level of safety).
All lost files were readable. No hardware failure was involved.

Replicated goals were not affected as far as I can tell... Maybe safe upgrade could be to change all EC goals to replicated ones, wait till no EC chunks are left and then upgrade...

@Blackpaw

This comment has been minimized.

Copy link

Blackpaw commented Aug 27, 2018

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Aug 28, 2018

Yes I've managed to retrieve some files from snapshots though only some...

I've managed to recover some missing chunks from older (recently replaced) HDD by connecting it to chunkserver 3.12 (chunkserver 3.13 rapidly deletes valid ec chunks).

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Sep 3, 2018

Here is a quick summary of devastating upgrade from 3.12.0: tibibytes of data destroyed; 100_000+ missing chunks; 80_000+ files damaged. Almost all data in EC goals is gone either due to direct or collateral damage.

Pattern of damage is not enough parts available:

        chunk 0: 00000D9CA2DCB53A_00000001 / (id:14966398432570 ver:1)
                copy 1: 192.168.0.130:9422:wks part 4/4 of ec(2,2)
                not enough parts available
        chunk 0: 00000D9CA2DCB6EA_00000001 / (id:14966398433002 ver:1)
                copy 1: 192.168.0.250:9622:stor part 1/4 of ec(2,2)
                not enough parts available

Before upgrade I had fully replicated files with ec(2,2) goals and most of them are gone despite having no undergoal files prior to upgrade.

I also have significant loss (at least 50%) in ec(3,2) chunks of which some were fully replicated and some were in progress of changing goal from std:3 to ec(3,2) so there were enough replicas to avoid data loss.

This is how damaged ec(3,2) files look, according to lizardfs fileinfo:

        chunk 0: 0000000000954D73_00000001 / (id:9784691 ver:1)
                copy 1: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                copy 2: 192.168.0.250:9622:stor part 1/5 of ec(3,2)
                not enough parts available
        chunk 0: 0000000000954B9D_00000001 / (id:9784221 ver:1)
                copy 1: 192.168.0.2:9422:pool part 4/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool part 3/5 of ec(3,2)
                not enough parts available
        chunk 1: 0000000000954BA5_00000001 / (id:9784229 ver:1)
                copy 1: 192.168.0.2:9422:pool part 3/5 of ec(3,2)
                copy 2: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                not enough parts available
        chunk 2: 0000000000954BAC_00000001 / (id:9784236 ver:1)
                copy 1: 192.168.0.130:9422:wks part 2/5 of ec(3,2)
                not enough parts available
        chunk 3: 0000000000954BB8_00000001 / (id:9784248 ver:1)
                copy 1: 192.168.0.2:9422:pool part 4/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool part 3/5 of ec(3,2)
                copy 3: 192.168.0.4:9422:wks part 1/5 of ec(3,2)
                copy 4: 192.168.0.204:9422:pool
                copy 5: 192.168.0.250:9422:stor part 2/5 of ec(3,2)
                copy 6: 192.168.0.250:9522:stor part 5/5 of ec(3,2)
                copy 7: 192.168.0.250:9622:stor
        chunk 4: 0000000000954BBC_00000001 / (id:9784252 ver:1)
                copy 1: 192.168.0.2:9422:pool part 3/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool
                copy 3: 192.168.0.4:9422:wks part 2/5 of ec(3,2)
                copy 4: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                copy 5: 192.168.0.250:9422:stor part 5/5 of ec(3,2)
                copy 6: 192.168.0.250:9522:stor part 1/5 of ec(3,2)
                copy 7: 192.168.0.250:9622:stor
        chunk 5: 0000000000954BC4_00000002 / (id:9784260 ver:2)
                copy 1: 192.168.0.2:9422:pool part 1/5 of ec(3,2)
                copy 2: 192.168.0.3:9622:pool
                copy 3: 192.168.0.130:9422:wks part 2/5 of ec(3,2)
                copy 4: 192.168.0.204:9422:pool part 4/5 of ec(3,2)
                copy 5: 192.168.0.250:9422:stor
                copy 6: 192.168.0.250:9522:stor part 3/5 of ec(3,2)
                copy 7: 192.168.0.250:9622:stor part 5/5 of ec(3,2)

Snapshots were useless to recover data unless snapshots had different goals. In the aftermath I'll probably make an std:1 goal to use exclusively on snapshots an pin it to slow-ish chunkserver.

3.13.0~rc1 is very unsafe for EC chunks. It deletes valid copies causing massive replication of remaining data. Beware...

@eleaner

This comment has been minimized.

Copy link

eleaner commented Sep 14, 2018

Hi Guys. Now you scared me shitless.
I just started my adventure with LizzardFS and obviously with 3.13.0~rc1
200k chunks and I don't see any major problems yet, maybe except #765

What is the safest way forward? the whole point on using lizzard was to use ec instead of btrfs/zfs parity
is there a way to downgrade lizard version to a working one?

@onlyjob

This comment has been minimized.

Copy link
Member Author

onlyjob commented Sep 14, 2018

If you are already on v3.13.0~rc1 then you might be safe. Between 3.12 and 3.13 they've made a very unsafe change to convert EC chunks made by earlier LizardFS versions: e76c386. Unless there are other issues affecting EC chunks, this particular one is about upgrade to 3.13.0~rc1.

@eleaner

This comment has been minimized.

Copy link

eleaner commented Sep 14, 2018

@onlyjob
Yes. I started in v3.13.0~rc1
So I hope I am safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment