Replies: 2 comments 3 replies
-
By copying (I'm comfortable applying patches and/or testing from a branch/commit, but I don't have any familiarity with the code block it's stuck in, so I'm hesitant to attempt to make any changes myself 😄) |
Beta Was this translation helpful? Give feedback.
-
Do you still have an intact copy of the metadata file and the changelogs that caused this situation? This would be the best (and frankly the only) way to find out what happened. |
Beta Was this translation helpful? Give feedback.
-
Hi! 👋
I've got a fairly small cluster (I wish I could give better stats on number of chunks, etc, but it's currently down; more details shortly). There was a power outage today which took out my master -- I've had this a few times before, and usually
mfsmaster restore
brings me right back without much fanfare (either on the master's data or on a metalogger's data).This time however, my
mfsmaster restore
process has been running for five hours, which seemed kind of suspicious, so I copied the metalogger directory on a different machine and tried it there, this time with-xx
so I could get some indication of what's happening, and the logs show a bunch of what I can only assume are replaying changelog ops, and it stops at a line likechangelog_ml.23.mfs: change1619308908|EMPTYTRASH(457):1,0,45513
where it just hangs.I attached
gdb
(after doingstrace
and seeing nothing), and found that this is the code it's stuck in:moosefs/mfsmaster/datacachemgr.c
Lines 180 to 248 in b42f16e
The weird thing is that
p
is zero,dcm_inodehash[ih]
is zero, and every value ofdcm_tab[p]
is also zero ({inode = 0, cacheok = 0, sessionid = 0, iprev = 0, inext = 0, lruprev = 0, lrunext = 0}
), so it's looping infinitely onp = dcm_tab[p].inext
followed by checking whetherp
is still smaller thanDCM_TAB_LENG
, over and over and over.Does this sound familiar to anyone? Any clues/ideas for how to get my instance back online or at least recover something?
What notes I can provide, while it's offline:
mfsmaster restore
on version 3.0.115CHANGELOG_SAVE_MODE
to2
, hoping that doing so would make doing a restore on unexpected power less less necessary (boy does that feel silly now 🤕)Non-verbose output of
mfsmaster restore
before it hangs:Edit: just to confirm that it doesn't just "need a little more time", my original restore is still going strong 14 hours later, and my two other attempts on other machines are at 9 hours 😬
Beta Was this translation helpful? Give feedback.
All reactions