[bug:1593078] SAS library corruption on GlusterFS #902

gluster-ant · 2020-03-12T12:32:31Z

URL: https://bugzilla.redhat.com/1593078
Creator: rgowdapp at redhat
Time: 20180620T03:02:40

--- Additional comment from Csaba Henk on 2018-05-31 10:10:03 EDT ---

This is we find in the fusedump:

The file dm_errori.sas7bdat is managed via two names, this one ("the base file") and dm_errori.sas7bdat.lck ("the lck file"). At start, the base file exists and some interaction occur with it. Then the lck file is looked up, failed to be found, and gets created:

{"Unique":479868,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]}
{"Unique":479869,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]}
{"Unique":479870,"op":"CREATE","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"],"nodemap":{"dm_errori.sas7bdat.lck":139842869444064},"size":[0]}

(These are just summaries of the respective fuse messages. Occasionally, if it's needed to make my point, I might provide the complete message. Here we know that the LOOKUPs fail from the absence of the "nodemap" field (which indicates association of a file name with a nodeid). That the lookup is doubled does not have significance.)

Data is written to the lck file and finally it replaces the previous base file via RENAME:

{"Unique":480062,"op":"RENAME","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck","dm_errori.sas7bdat"]}

Then for a long while the base file is used in various ways, when at some point it seems the above scenario would be replayed.

{"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

(While this data lacks temporal information, we say "long while" based on the Unique values which index fuse messages. So this second window of action comes about 350,000 messages later.)

This time the lck file is found, but the process wants to start from scratch, as it's immediately truncated to 0 via the SETATTR request (Unique 822178), which can be seen by looking at the complete request (inferring from the values of the Valid and Size fields):

{"Truncated":false,"Msg":[null,"SETATTR",{"Len":128,"Opcode":4,"Unique":822178,"Nodeid":139843072187888,"Uid":4447,"Gid":9003,"Pid":53795,"Padding":0},[{"Valid":584,"Padding":0,"Fh":139842872218944,"Size":0,"LockOwner":8496685147530397345,"Atime":0,"Mtime":0,"Unused2":0,"Atimensec":0,"Mtimensec":0,"Unused3":0,"Mode":0,"Unused4":0,"Uid":0,"Gid":0,"Unused5":0}]]}

However, there is something irregular if we go back a bit further.

{"Unique":822172,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat",139843072187888],"nodemap":{"dm_errori.sas7bdat":139843072187888},"size":[1481113600]}
{"Unique":822173,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822174,"op":"SETLK","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}

Just before the lck file was accessed, the base file was also looked up -- and it resolved to the same nodeid as the lck file! That is, they are the same file under different names (hardlinks of each other). That is almost certainly against the intent of the application, as the fusedump does not contain any LINK messages (via which hardlinking is facilitated).

*** Please confirm if the applications accessing the volume do not do hard linking -- as here we talk about the fuse dump of a single client, theoretically it's possible that another client was creating the hardlink. ***

So, most likely the application assumes that the base file and the lck file are independent entities; and it's the oddity of the glusterfs backend that these two files are the same. The application might want to reset only the link file -- but that being the same as the base file, it's doing away with all of the content of the base file as well.

Hence the corruption.

How can the base file and the lck file get accidentally hardlinked to each other on the gluster backend? Most likely it's an issue with dht.

One possibility is that the RENAME above was not properly executed. In that case the situation of the base and the lck file being the same should linger on through a longer period, from the RENAME (Unique 480062) on to the eventual removal of the lck file (Unique 822200).

Other possibility is that just at the time when the base and the lck files get resolved to the same id, some other client is performing a RENAME on them which is not sufficiently synchronized and the LOOKUPs on this side can hit into an inconsistent intermediate state; in this case the hardlinked situation is ephemeral.

We'd be happy to get stat/getfattr information of both files (base and lck -- or a note on one's absence), on all bricks and all clients.

It would be however much more useful than post festam file metadata to monitor what's going on during the workload. Is there a period when presence of the lck file is observable? Is it a hard link of the base file?

--- Additional comment from Raghavendra G on 2018-06-01 08:56:20 EDT ---

(In reply to Csaba Henk from comment #23)

Other possibility is that just at the time when the base and the lck files
get resolved to the same id, some other client is performing a RENAME on
them which is not sufficiently synchronized and the LOOKUPs on this side can
hit into an inconsistent intermediate state; in this case the hardlinked
situation is ephemeral.

There is one scenario in dht_rename where for a brief period lookups on src and dst can be successful and identify them as hardlinks to each other. For the scenario to happen following conditions should hold good for a rename (src, dst).

(dst-cached != src-cached) and (dst-hashed == src-hashed) and (src-cached != dst-hashed)

In this scenario following is the control flow of dht-rename:

link (src, dst) on src-cached.
rename (src, dst) on dst-hashed/src-hashed (Note that dst-hashed == src-hashed).
rest of the rename which removes hardlink src on src-cached.

Note that between 2 and 3 till the hardlink is removed,

lookup (src) would fail on src-hashed resulting in lookup-everywhere. Since hardlink src exists on src-cached, lookup will be successful mapping it to inode with src-gfid.
lookup (dst) would identify a linkto file on dst-hashed. The linkto file points to src-cached, following which we'll find the hardlink dst on src-cached. Lookup (dst) succeeds mapping dst to inode with src-gfid.

Both src and dst would be identified as hardlinks to file with src-gfid in inode table of client. Same result is conveyed back to application.

If we've hit this scenario, we would see the failure of lookup (src) on src-hashed and eventual finding it on src-cached through lookup-everywhere. We need to set diagnostics.client-log-level to DEBUG to see these logs.

The other work-around (if the hypothesis is true) is to turn on cluster.lookup-optimize to true. When lookup-optimize is turned on dht-lookup doesn't resort to lookup-everywhere on not finding src on src-hashed. Instead it just conveys a failure to application. Since lookup won't reach src-cached, it won't find the hard-link.

gluster-ant · 2020-03-12T12:32:32Z

Time: 20180620T03:03:22
rgowdapp at redhat commented:
(In reply to Raghavendra G from comment #24)

(In reply to Csaba Henk from comment #23)

The other work-around (if the hypothesis is true) is to turn on
cluster.lookup-optimize to true. When lookup-optimize is turned on
dht-lookup doesn't resort to lookup-everywhere on not finding src on
src-hashed. Instead it just conveys a failure to application. Since lookup
won't reach src-cached, it won't find the hard-link.

Since patch [1] We've made hashed-subvol as the subvol to take locks during rename. This means presence of entrylks on hashed-subvol on the basename of src/dst of rename indicates a rename in progress. On observing locks, dht_lookup can,

acquire entrylk on parent inode with basename
Do the lookup
unlock entrylk
unwind the lookup

This will help lookup to synchronize with rename and hence to preserve the atomicity of rename. Note that this algorithm works even when first lookup fails with ENOENT. Also, the cost of synchronization is isolated to the lookups happening only during rename. Lookups happening outside rename window won't suffer the cost of synchronization.

This will be a code change and I'll be submitting a patch to do this.
[1] https://review.gluster.org/19547/

gluster-ant · 2020-03-12T12:32:33Z

Time: 20180908T14:37:34
bugzilla-bot at gluster.org commented:
REVIEW: https://review.gluster.org/21123 (performance/write-behind: remove the request from wip queue in wb_fulfill_request) posted (#1) for review on master by Raghavendra G

gluster-ant · 2020-03-12T12:32:34Z

Time: 20180908T16:16:48
bugzilla-bot at gluster.org commented:
REVISION POSTED: https://review.gluster.org/21123 (performance/write-behind: remove the request from wip queue in wb_fulfill_request) posted (#2) for review on master by Raghavendra G

gluster-ant · 2020-03-12T12:32:35Z

Time: 20190511T11:27:38
atumball at redhat commented:
Above patch is merged, but that seems to refer to different issue. Hence moving it back to NEW. Please update with proper status if worked on.

stale · 2020-10-08T20:33:11Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

stale · 2020-10-24T00:57:45Z

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

gluster-ant added Migrated Type:Bug labels Mar 12, 2020

stale bot added the wontfix Managed by stale[bot] label Oct 8, 2020

stale bot closed this as completed Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug:1593078] SAS library corruption on GlusterFS #902

[bug:1593078] SAS library corruption on GlusterFS #902

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

stale bot commented Oct 8, 2020

stale bot commented Oct 24, 2020

[bug:1593078] SAS library corruption on GlusterFS #902

[bug:1593078] SAS library corruption on GlusterFS #902

Comments

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

stale bot commented Oct 8, 2020

stale bot commented Oct 24, 2020