-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug:1593078] SAS library corruption on GlusterFS #902
Comments
Time: 20180620T03:03:22
Since patch [1] We've made hashed-subvol as the subvol to take locks during rename. This means presence of entrylks on hashed-subvol on the basename of src/dst of rename indicates a rename in progress. On observing locks, dht_lookup can,
This will help lookup to synchronize with rename and hence to preserve the atomicity of rename. Note that this algorithm works even when first lookup fails with ENOENT. Also, the cost of synchronization is isolated to the lookups happening only during rename. Lookups happening outside rename window won't suffer the cost of synchronization. This will be a code change and I'll be submitting a patch to do this. |
Time: 20180908T14:37:34 |
Time: 20180908T16:16:48 |
Time: 20190511T11:27:38 |
Thank you for your contributions. |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
URL: https://bugzilla.redhat.com/1593078
Creator: rgowdapp at redhat
Time: 20180620T03:02:40
--- Additional comment from Csaba Henk on 2018-05-31 10:10:03 EDT ---
This is we find in the fusedump:
The file dm_errori.sas7bdat is managed via two names, this one ("the base file") and dm_errori.sas7bdat.lck ("the lck file"). At start, the base file exists and some interaction occur with it. Then the lck file is looked up, failed to be found, and gets created:
{"Unique":479868,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]}
{"Unique":479869,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]}
{"Unique":479870,"op":"CREATE","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"],"nodemap":{"dm_errori.sas7bdat.lck":139842869444064},"size":[0]}
(These are just summaries of the respective fuse messages. Occasionally, if it's needed to make my point, I might provide the complete message. Here we know that the LOOKUPs fail from the absence of the "nodemap" field (which indicates association of a file name with a nodeid). That the lookup is doubled does not have significance.)
Data is written to the lck file and finally it replaces the previous base file via RENAME:
{"Unique":480062,"op":"RENAME","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck","dm_errori.sas7bdat"]}
Then for a long while the base file is used in various ways, when at some point it seems the above scenario would be replayed.
{"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
(While this data lacks temporal information, we say "long while" based on the Unique values which index fuse messages. So this second window of action comes about 350,000 messages later.)
This time the lck file is found, but the process wants to start from scratch, as it's immediately truncated to 0 via the SETATTR request (Unique 822178), which can be seen by looking at the complete request (inferring from the values of the Valid and Size fields):
{"Truncated":false,"Msg":[null,"SETATTR",{"Len":128,"Opcode":4,"Unique":822178,"Nodeid":139843072187888,"Uid":4447,"Gid":9003,"Pid":53795,"Padding":0},[{"Valid":584,"Padding":0,"Fh":139842872218944,"Size":0,"LockOwner":8496685147530397345,"Atime":0,"Mtime":0,"Unused2":0,"Atimensec":0,"Mtimensec":0,"Unused3":0,"Mode":0,"Unused4":0,"Uid":0,"Gid":0,"Unused5":0}]]}
However, there is something irregular if we go back a bit further.
{"Unique":822172,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat",139843072187888],"nodemap":{"dm_errori.sas7bdat":139843072187888},"size":[1481113600]}
{"Unique":822173,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822174,"op":"SETLK","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]}
{"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
{"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]}
Just before the lck file was accessed, the base file was also looked up -- and it resolved to the same nodeid as the lck file! That is, they are the same file under different names (hardlinks of each other). That is almost certainly against the intent of the application, as the fusedump does not contain any LINK messages (via which hardlinking is facilitated).
*** Please confirm if the applications accessing the volume do not do hard linking -- as here we talk about the fuse dump of a single client, theoretically it's possible that another client was creating the hardlink. ***
So, most likely the application assumes that the base file and the lck file are independent entities; and it's the oddity of the glusterfs backend that these two files are the same. The application might want to reset only the link file -- but that being the same as the base file, it's doing away with all of the content of the base file as well.
Hence the corruption.
How can the base file and the lck file get accidentally hardlinked to each other on the gluster backend? Most likely it's an issue with dht.
One possibility is that the RENAME above was not properly executed. In that case the situation of the base and the lck file being the same should linger on through a longer period, from the RENAME (Unique 480062) on to the eventual removal of the lck file (Unique 822200).
Other possibility is that just at the time when the base and the lck files get resolved to the same id, some other client is performing a RENAME on them which is not sufficiently synchronized and the LOOKUPs on this side can hit into an inconsistent intermediate state; in this case the hardlinked situation is ephemeral.
We'd be happy to get stat/getfattr information of both files (base and lck -- or a note on one's absence), on all bricks and all clients.
It would be however much more useful than post festam file metadata to monitor what's going on during the workload. Is there a period when presence of the lck file is observable? Is it a hard link of the base file?
--- Additional comment from Raghavendra G on 2018-06-01 08:56:20 EDT ---
(In reply to Csaba Henk from comment #23)
There is one scenario in dht_rename where for a brief period lookups on src and dst can be successful and identify them as hardlinks to each other. For the scenario to happen following conditions should hold good for a rename (src, dst).
(dst-cached != src-cached) and (dst-hashed == src-hashed) and (src-cached != dst-hashed)
In this scenario following is the control flow of dht-rename:
Note that between 2 and 3 till the hardlink is removed,
Both src and dst would be identified as hardlinks to file with src-gfid in inode table of client. Same result is conveyed back to application.
If we've hit this scenario, we would see the failure of lookup (src) on src-hashed and eventual finding it on src-cached through lookup-everywhere. We need to set diagnostics.client-log-level to DEBUG to see these logs.
The other work-around (if the hypothesis is true) is to turn on cluster.lookup-optimize to true. When lookup-optimize is turned on dht-lookup doesn't resort to lookup-everywhere on not finding src on src-hashed. Instead it just conveys a failure to application. Since lookup won't reach src-cached, it won't find the hard-link.
The text was updated successfully, but these errors were encountered: