New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One brick offline with signal received: 11 #1699
Comments
|
Urgent problem no longer an issue I don't understand this. My history shows that I did the stop, on this server, yesterday: So what got fixed? Why did the repeat today change things when yesterday it had no effect? There's still something wrong |
|
As you mentioned brick06 was crashed that's why gluster cli was showing status N/A. |
|
Some more info |
|
It seems the brick process is crashed again. Can you attached a core with gdb and share the output "thread apply all bt full" ? |
|
@mohit84 thanks for picking this up. I understand that the CLI status for brick06 on server verijolt is showing "N" under "online" when the brick has crashed. At the moment, brick06 on verijolt is showing "Y" despite the truncated brick log showing a potential crash (signal received: 11). The tail of |
|
Is CLI showing some different pid other than brick_pid that was crashed? |
|
@mohit84 Please could you give me a bit more help in running debugger etc., it's been a while since I last did that.
could you tell me the commands to use. |
|
|
It means brick is running , the pid(53399) is same showing by CLI, right. Before attach a core with gdb please install debug package of glusterfs.
|
I'm not sure what you mean by dump and all the nodes. This directory is large (~11GB) as this |
|
for now you can share glusterd.log and brick logs from all the nodes. |
Agreed, it looks as though the brick process is running, but I don't think this brick is in a working state. Script used to collate healing status |
On fedora, I installed where is the |
|
The core is saved the location configured at /proc/sys/kernel/core_pattern, you need to ti install glusterfs-debuginfo package. |
|
|
cat /proc/sys/kernel/core_pattern |
|
The full logs exceed the 10MB limit, so I pruned them a bit. srv-brick05_201023.log |
The date on this file does not correspond to yesterdays brick crash. |
|
Below are the latest brick logs for brick06, i am not seeing any issue in brick logs. [2020-10-23 11:44:20.574310] I [MSGID: 100030] [glusterfsd.c:2865:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 7.8 (args: /usr/sbin/glusterfsd -s verijolt --volfile-id gluvol1.verijolt.srv-brick06 -p /var/run/gluster/vols/gluvol1/verijolt-srv-brick06.pid -S /var/run/gluster/e81ffbefdc824bb9.socket --brick-name /srv/brick06 -l /var/log/glusterfs/bricks/srv-brick06.log --xlator-option -posix.glusterd-uuid=04eb8fdd-ebb8-44c9-9064-5578f43e55b8 --process-name brick --brick-port 49160 --xlator-option gluvol1-server.listen-port=49160) For specific to heal we need to check glustershd.log and heal logs glfsheal-gluvol1.log |
|
@mohit84 thanks for continuing to try and help me out here.
Is there anything you can suggest that I might try for the latter? |
My fault. These log files are too big for 10MB limit for pasting into this ticket. I stripped by date, not realising the crash dumps are not prefixed by date I'll see if I can attach a more complete tail of that log file. |
|
Can you explain why brick06 appears to be online, yet this log shows it has crashed? |
|
You need to run this to install the debug information (assuming you are running latest version): |
I had to do a load of installs: I think a found the core dump
```
ls -l /var/spool/abrt/ccpp-2020-10-22-14:23:50.21866-346125 \
2>&1 | awk '{print " " $0}'; date +\ \ %F\ %T%n
total 196960
-rw-r----- 1 root abrt 6 2020-10-22 14:23 abrt_version
-rw-r----- 1 root abrt 17 2020-10-22 14:23 analyzer
-rw-r----- 1 root abrt 6 2020-10-22 14:23 architecture
-rw-r----- 1 root abrt 34 2020-10-22 14:23 cgroup
-rw-r----- 1 root abrt 411 2020-10-22 14:23 cmdline
-rw-r----- 1 root abrt 9 2020-10-22 14:23 component
-rw-r----- 1 root abrt 40928 2020-10-22 14:23 core_backtrace
-rw-r----- 1 root abrt 201433088 2020-10-22 14:23 coredump
-rw-r----- 1 root abrt 2 2020-10-23 01:29 count
-rw-r----- 1 root abrt 2216 2020-10-22 14:23 cpuinfo
-rw-r----- 1 root abrt 32 2020-10-22 14:23 crash_function
-rw-r----- 1 root abrt 6348 2020-10-22 14:23 dso_list
-rw-r----- 1 root abrt 281 2020-10-22 14:23 environ
-rw-r----- 1 root abrt 20 2020-10-22 14:23 executable
-rw-r----- 1 root abrt 82 2020-10-22 14:23 exploitable
-rw-r----- 1 root abrt 8 2020-10-22 14:23 hostname
-rw-r----- 1 root abrt 126 2020-10-22 14:23 journald_cursor
-rw-r----- 1 root abrt 22 2020-10-22 14:23 kernel
-rw-r----- 1 root abrt 10 2020-10-23 01:29 last_occurrence
-rw-r----- 1 root abrt 1323 2020-10-22 14:23 limits
-rw-r----- 1 root abrt 35739 2020-10-22 14:23 maps
-rw-r----- 1 root abrt 3809 2020-10-22 14:23 mountinfo
-rw-r----- 1 root abrt 4003 2020-10-22 14:23 open_fds
-rw-r----- 1 root abrt 691 2020-10-22 14:23 os_info
-rw-r----- 1 root abrt 30 2020-10-22 14:23 os_release
-rw-r----- 1 root abrt 25 2020-10-22 14:23 package
-rw-r----- 1 root abrt 6 2020-10-22 14:23 pid
-rw-r----- 1 root abrt 6 2020-10-22 14:23 pkg_arch
-rw-r----- 1 root abrt 1 2020-10-22 14:23 pkg_epoch
-rw-r----- 1 root abrt 19 2020-10-22 14:23 pkg_fingerprint
-rw-r----- 1 root abrt 14 2020-10-22 14:23 pkg_name
-rw-r----- 1 root abrt 6 2020-10-22 14:23 pkg_release
-rw-r----- 1 root abrt 14 2020-10-22 14:23 pkg_vendor
-rw-r----- 1 root abrt 3 2020-10-22 14:23 pkg_version
-rw-r----- 1 root abrt 1326 2020-10-22 14:23 proc_pid_status
-rw-r----- 1 root abrt 1 2020-10-22 14:23 pwd
-rw-r----- 1 root abrt 28 2020-10-22 14:23 reason
-rw-r----- 1 root abrt 1 2020-10-22 14:23 rootdir
-rw-r----- 1 root abrt 4 2020-10-22 14:23 runlevel
-rw-r----- 1 root abrt 10 2020-10-22 14:23 time
-rw-r----- 1 root abrt 4 2020-10-22 14:23 type
-rw-r----- 1 root abrt 1 2020-10-22 14:23 uid
-rw-r----- 1 root abrt 5 2020-10-22 14:23 username
-rw-r----- 1 root abrt 40 2020-10-22 14:23 uuid
2020-10-23 16:29:37
```
I then ran giving the following on screen output where I entered |
|
Thanks @mohit84 and @xhernandez Meanwhile, does anyone have any suggestions for nudging my brick06 back into life?
```
export start_date=`date +%F\ %T`; \
du -smc /srv/brick06/* \
2>&1 | sort -n | awk '{printf(" %8d\t%s\n",$1,substr($0,index($0,$2)))}'; echo " ${start_date}" && date +\ \ %F\ %T
0 /srv/brick06/snap
1 /srv/brick06/ftp
1 /srv/brick06/mounted.txt
3587 /srv/brick06/root
11851 /srv/brick06/var
8091137 /srv/brick06/vault
8106574 total
2020-08-23 15:36:31
2020-08-23 16:28:17
on veriicon
0 /srv/brick06/snap
1 /srv/brick06/ftp
1 /srv/brick06/mounted.txt
3587 /srv/brick06/root
11851 /srv/brick06/var
8091131 /srv/brick06/vault
8106568 total
2020-08-23 15:38:00
2020-08-23 16:21:56
```
)
|
|
Hi, Any idea why you did create so many xattrs on the backend ? |
A 12TB RAID5 box died (either/both controller and power supply), but the 5 HDD's are ok. I am painstakingly restoring the data from the HDDs onto a gluster volume. I am confident that I am getting this right because of good parity across the HDDs and consistent checksums on a file by file basis. The data on this box was an rsnapshot backup so it contains a lot of hard links. It is conceivable that these scripts, full of chmod,chown and touch for each file in turn, place a burden on gluster. I stated in first (submission) comment at the top that this was a possible cause. If running such a script does "create many xattrs on the backend" then this is a likely cause. Why has only one brick crashed? Why was it fine for 5 hours or so? If this is the cause, then once my gluster volume is back to normal (brick06 on verijolt properly online), then I can break up my restore into more manageable chunks. This is a one-off exercise, I will not and do not want to be doing this again! Given you have a clue as to the cause, how would you suggest I bring brick06 on verijolt back to life? |
|
The fastest way I see to fix this is to identify the file that has so many extended attributes and remove/clear them. To do that, inside gdb, can you execute this: This will return 16 hexadecimal numbers, like this: You need to take the first two values and go to this directory in server verijolt: You should find a file there that has all the 16 numbers returned by gdb (with some '-' in the middle). Once you identify the file, you need to execute this command: Depending on what this returns, we'll decide how to proceed. |
|
To prevent recurrence of an issue you can configure the option "storage.max-hardlinks" to least value sothat client wont be able to create a hardlink if limit has been crossed. |
Should we have a default value for this option? Say 42 (ie, a sane random). That way, we can prevent a bad experience which Bockerman got into by prompting an error to application much earlier. After that, they can decide if the value needs to be increased or not depending on their usecase. My suggestion is however restrictive, we should keep default options which prevents any borderline issues like this, and makes sure glusterfs provides good performance, and stability. Users can alter the options only knowing what their usecase is, and that should be allowed, as they will be responsible for that particular usecase. |
Won't that make the application unusable on glusterfs? |
|
I don't see the value of this option. If we set it to a lower value, the application creating hardlinks will fail. If the user increases the value (because they are actually creating more hardlinks) we'll have a crash. And even worse, self-heal is unable to heal those files because the size of the xattrs is too big. What we need to do in this case is to disable gfid2path and any other feature that requires per-hardlink data (besides fixing the crash, of course). Even if we fix the crash and make it possible to handle hundreds of xattrs, it will be very bad from a performance point of view. |
|
The current default value is 100 and in xfs i have tried to create 100 hardlink i am not able to create more than 47 hard-link.After reaching the number of hardlink count is 47 setxattr throwing an error "No space left on device". {.key = {"max-hardlinks"}, I think we need to restrict the maximum value of max-hardlinks, i don;t think after restrict/confifure the max-hardlink application won't be able to use glusterfs. |
The segmentation fault happens because we use a stack allocated buffer to store the the contents of the xattrs. This is done in two steps: first we get the needed size, and then we allocate a buffer of that size to store the data. The problem happens because of the combination of 2 things:
This causes a segmentation fault when trying to allocate more space than available from stack.
In your particular case I would recommend to disable gfid2path feature. You also seem to be using quota. Quota works on a per-directory basis, but given you have multiple hardlinks, I'm not sure if it makes sense (to which directory the quota should be accounted for ?). If not strictly necessary, I would also disable quota. This doesn't fix the existing issues. Disabling gfid2path will prevent creating additional xattrs for newer files of new hardlinks, but it won't delete existing ones. We should sweep all the files of each brick and remove them. However standard tools (getfattr) doesn't seem to support big xatts either, so I'm not sure how to do that unless btrfs has some specific tool (I don't know btrfs)
Sure. We'll need to improve that in current code, at least to avoid a crash and return an error if size of too big.
As I've just commented, we should disable dome features and clean existing xattrs, but I don't know how to do that on btrfs if getfattr doesn't work.
You are using a replica. It's expected to have the same file in more than one brick.
Probably any file that returns an error for getfattr will also have issues in Gluster.
You should never do this. It can cause more troubles. To find them, this command should work: Any file returned inside /.glusterfs/<xx>/<yy>/ with a single link could be removed (be careful to not do this when the volume has load. Otherwise This won't find symbolic links that have been deleted. |
This is caused because you are using XFS and it limits the xattr size to 64KiB. It's not a limitation on the number of hardlinks. XFS and Gluster can create many more. But when we also add an xattr for each hardlink, the limit becomes the xattr size. Apparently btrfs doesn't have a 64 KiB limit. That's why the issue has happened without detecting any error.
If the application needs to create 100 hardlinks, it won't work on Gluster if we don't allow more than 47. So the application won't be usable on Gluster. |
Thanks for clarifying it more. |
|
Much appreciate all comments above from @mohit84 @xhernandez @pranithk and @amarts. |
|
@mohit84 wrote
More background info:
Prior to attempting to restore the rsnapshot I had set max-hardlinks to an aggressively high number, based on the number of files to be transferred (i.e. 10,000,000 being more than 5,313,170) because, at the time I did not know the maximum number of hard links actually present (9,774). Before the crash (brick06 offline, the topic of this issue) all the contents had been transferred onto gluster, apparently successfully because checksums matched the source. However, transferring this data took place over several weeks, not all in one go, and consequently not all of the hardlinks were preserved (not a requirement).
I am still checking that the data on the gluster target volume matches the source. So far I have not found any discrepancy (apart from differences in permissions, ownership and timestamps). So I am assuming that, in general, gluster is handling inodes with over 1,000 hard links. However, some operations, like healing one inode/file with 908 hardlinks is stuck. Am I right to assume that "storage.max-hardlinks" being too low is not the cause of the problem and that having a higher value does nothing to prevent recurrence of the issue? |
|
@amarts wrote
I agree, my data set is unusual, and I do not mind having to set an option to override any default. @pranithk wrote
I'm not sure what "application" you are imagining. Gluster is providing "storage", and if some of that storage contains backup or snapshot data, any user can read that, restore it to own user area, and run whatever application is desired. My 11TB of data, some of which has more than 100 hardlinks per file/inode, appears to be usable. |
|
@xhernandez Thanks for your detailed response.
Please could you tell me how to disable the gfid2path feature, I cannot find it in the documentation. I disabled quota. I deleted all 809 files that were causing the [Argument list too long] that self-healing could not handle.
I mean, files that exist on more than one brick for a given server. Cases like this result in
Obviously, I do not intentionally delete files directly from bricks, but I have found this is often the only way to resolve certain issues (like split-brain), but it is always possible with manual intervention like this that I could make a mistake. I will attempt to getfattr and hardlink count for all files on all bricks, but I will need to be careful how I do that. There's no point on getfattr for each file, it only needs to be done once per inode. Given a "find" on a 11TB subset of all the data takes over 5 hours, this could take days. |
|
I don't believe the issue is only specific to hardlink. gluster populates a xattr per hardlink basis, unless we don't know about the xattr populated on the backend it is difficult to find the reason. As xavi asked earlier to share about the xatts but getfattr is failing due to throwing an error "Argument list too long" so it is difficult, i am not sure if btrsfs provide some tool to fetch this info. In gluster default value of storage.max-hardlinks is 100 , unless you have not changed the value you can't create more than 100 For the time-being you can disable storage.gfid2path as like below after disable this application can create a hardlink and gluster won't populate a new xattr(gfid2path) for every hardlink, after disable it we can restrict only gfid2path xattr but we can't restrict if application is trying to populate custom xattr on file. For specific to brick crash we have to fix the code path, we need to call MALLOC instead of calling alloca in case if xattrsize is greater than some limit like(64k/128k). |
|
Disaster struck, see #1729 and #1728. I discovered a large number of files with silly permissions, and reported that on #1731 I am nervous about the integrity of my files, any suggestions welcome. I am continuing with:
|
|
Perhaps someone could explain the value of hard-link for each file residing under <brick_root>/.glusterfs/XX/YY/. Also, why is there a mixture of hard-links and symbolic-links? This means that finding "dangling" gfids (i.e. a gfid file with no corresponding actual file) is more difficult than @xhernandez suggests:
|
As I already said in my comment, this method won't work fine if you also have symbolic links. The command only finds regular files with a single hardlink. Since gluster keeps a hardlink between the real file and the gfid in .glusterfs/xx/yy, any file inside .glusterfs with a single hardlink means that there's not real file associated. The symbolic links inside .glusterfs may represent real symbolic link files or directories. To differentiate them is more complex. |
#1730) In case of file is having huge xattrs on backend a brick process is crashed while alloca(size) limit has been crossed 256k because iot_worker stack size is 256k. Use MALLOC to allocate memory instead of using alloca Fixes: #1699 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I100468234f83329a7d65b43cbe4e10450c1ccecd
gluster#1730) (gluster#1827) In case of file is having huge xattrs on backend a brick process is crashed while alloca(size) limit has been crossed 256k because iot_worker stack size is 256k. Fixes: gluster#1699 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I100468234f83329a7d65b43cbe4e10450c1ccecd
gluster#1730) In case of file is having huge xattrs on backend a brick process is crashed while alloca(size) limit has been crossed 256k because iot_worker stack size is 256k. > Fixes: gluster#1699 > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > Change-Id: I100468234f83329a7d65b43cbe4e10450c1ccecd > (Cherry pick from commit fd666ca) > (Reviewed on upstream link gluster#1828) Change-Id: I100468234f83329a7d65b43cbe4e10450c1ccecd Bug: 1903468 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Reviewed-on: https://code.engineering.redhat.com/gerrit/220872 Tested-by: RHGS Build Bot <nigelb@redhat.com> Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <sheggodu@redhat.com>
Description of problem:
One brick on one server is offline and all attempts to bring it back online have failed.
The corresponding brick on the other (of a replica 2) server is ok. Other bricks are ok.
The following do not clear the problem:
The problem appears to be similar to #1531
but the cause is different, and the number of volumes and bricks is different.
(I note the observation comment regarding "replica 2" and split-brain, but the cost (time/effort) to recover
from split-brain is manageable and usually due to external causes, such as a power cut.)
My urgent need is to find a way out of the current situation and bring back online brick06 on the second server.
Not so urgent is the need for gluster to handle this condition in a graceful way and report to the user/admin
what is the real cause of the problem and how to fix it (if it cannot be fixed automatically).
The exact command to reproduce the issue:
Not sure what actually caused this situation to arise, but activity at the time was:
Multiple clients, all active, but with minimal activity.
Intense activity from one client (actually one of the two gluster servers), scripted "chown" on over a million files
which had been running for over 5 hours and was 83% complete.
edit or "sed -i" on a 500MB script file (but should not have tipped over the 22GB Mem + 8 GB Swap)
The full output of the command that failed:
Expected results:
Some way to bring that brick back online.
- The output of the
gluster volume infocommand:- The operating system / glusterfs version:
Fedora F32
Linux veriicon 5.8.15-201.fc32.x86_64 #1 SMP Thu Oct 15 15:56:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Linux verijolt 5.8.15-201.fc32.x86_64 #1 SMP Thu Oct 15 15:56:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
glusterfs 7.8
Additional info:
on verijolt (2nd server)
There's a problem with brick06 on server verijolt.
snippet from /var/log/messages
snippet from /var/log/glusterfs/bricks/srv-brick06.log
The text was updated successfully, but these errors were encountered: