New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148
Comments
rafikc30
added a commit
to rafikc30/glusterfs
that referenced
this issue
May 16, 2023
Rebalance daemon is deleting a data file instead of linkfile which cause an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemone will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated succesfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stage of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have done on an actual file leading to permissions like "", either we may end up in wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
rafikc30
added a commit
to rafikc30/glusterfs
that referenced
this issue
May 16, 2023
Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
rafikc30
added a commit
to rafikc30/glusterfs
that referenced
this issue
May 17, 2023
Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
mohit84
pushed a commit
that referenced
this issue
May 18, 2023
Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: #4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
mohit84
pushed a commit
to mohit84/glusterfs
that referenced
this issue
May 18, 2023
Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. > Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc > Fixes: gluster#4148 > Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com> > (Reviewed on gluster#4149) > (Cherry picked from commit 1cedcc0) Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohit Agrawal<moagrawa@redhat.com>
Shwetha-Acharya
pushed a commit
that referenced
this issue
May 24, 2023
Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. > Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc > Fixes: #4148 > Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com> > (Reviewed on #4149) > (Cherry picked from commit 1cedcc0) Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: #4148 Signed-off-by: Mohit Agrawal<moagrawa@redhat.com> Co-authored-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description of problem:
The cause of the data loss was the Rebalance daemon mistakenly deleting a data file instead of a linkfile, resulting in the deletion of the entire file from the gluster namespace and causing complete data loss. In addition to the file being deleted, a modification to the data file can also cause data loss.
This data loss can occur in the following situations:
Deletion of the entire file:
This situation arises when there is a linkfile present on the hashed subvol. In this case, the rebalance daemon adds both the linkfile and the data file to the queue. Suppose the data file entry is picked first for migration and successfully migrated. When the linkfile is subsequently picked from the queue, the rebalance daemon checks whether the added linkfile has both hashed and cached to the same subvol. Mistakenly assuming it is a linkfile, the daemon deletes it. The check for the linkfile is based on the stbuf populated during the rebalance readdir. However, by the time the check is performed, the file has already been migrated and has become a data file.
Incorrect permissions and size:
Similar to the previous case, this occurs when a linkfile and a data file are selected for migration simultaneously. In this situation, a failure can occur at various stages of the rebalance process. Due to this failure, a cleanup operation might be performed on the destination where the rebalance process is incomplete. However, by the time the first file is completed and the cleanup is executed, it is performed on an actual file, resulting in incorrect permissions (e.g., "") or an incomplete data file with incorrect information.
The exact command to reproduce the issue:
The full output of the command that failed:
Expected results:
Mandatory info:
- The output of the
gluster volume info
command:- The output of the
gluster volume status
command:- The output of the
gluster volume heal
command:**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/
**- Is there any crash ? Provide the backtrace and coredump
Additional info:
A possible scenrio to have link file include rename operation, The used space on one of the replica set is above min-free-disk(90% by default)
- The operating system / glusterfs version:
gluster11
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
The text was updated successfully, but these errors were encountered: