Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148

Open
rafikc30 opened this issue May 16, 2023 · 0 comments · Fixed by #4149

Comments

@rafikc30
Copy link
Member

Description of problem:
The cause of the data loss was the Rebalance daemon mistakenly deleting a data file instead of a linkfile, resulting in the deletion of the entire file from the gluster namespace and causing complete data loss. In addition to the file being deleted, a modification to the data file can also cause data loss.

This data loss can occur in the following situations:

Deletion of the entire file:
This situation arises when there is a linkfile present on the hashed subvol. In this case, the rebalance daemon adds both the linkfile and the data file to the queue. Suppose the data file entry is picked first for migration and successfully migrated. When the linkfile is subsequently picked from the queue, the rebalance daemon checks whether the added linkfile has both hashed and cached to the same subvol. Mistakenly assuming it is a linkfile, the daemon deletes it. The check for the linkfile is based on the stbuf populated during the rebalance readdir. However, by the time the check is performed, the file has already been migrated and has become a data file.

Incorrect permissions and size:
Similar to the previous case, this occurs when a linkfile and a data file are selected for migration simultaneously. In this situation, a failure can occur at various stages of the rebalance process. Due to this failure, a cleanup operation might be performed on the destination where the rebalance process is incomplete. However, by the time the first file is completed and the cleanup is executed, it is performed on an actual file, resulting in incorrect permissions (e.g., "") or an incomplete data file with incorrect information.

The exact command to reproduce the issue:

The full output of the command that failed:

Expected results:

Mandatory info:
- The output of the gluster volume info command:

- The output of the gluster volume status command:

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Additional info:
A possible scenrio to have link file include rename operation, The used space on one of the replica set is above min-free-disk(90% by default)

- The operating system / glusterfs version:
gluster11

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

rafikc30 added a commit to rafikc30/glusterfs that referenced this issue May 16, 2023
Rebalance daemon is deleting a data file instead of
linkfile which cause an entire file to be deleted from
gluster namespace leading to a complete data loss.

This data loss occurs in following situations:-
 1) Entire file is deleted
    This happens when there is a linkfile on the hashed
    subvol. In this case, rebalance daemone will add both
    linkfile and data file to the queue. Let's say the
    data file entry picked up first to migrate and let's
    say the entry is migrated succesfully. When the linkfile
    is picked up from the queue, when they check that the
    added linkfile has both hashed and cached to the same
    subvol, then they assume that it is a linfile and then
    it got deleted. They check for the linkfile is performed
    on the stbuf populated during the rebalance readdir.
    So by the time when the check are done, the file is now
    migrated and had a datafile
 2) Wrong permission and Incorrect size:
    Similar to above case, this happens when a linkfile and
    data file are picked up to migrate at the same time,
    in this case, a failure can happen at different stage of
    rebalance process and due to this rebalance failure,
    a cleanup might be done on the destination where the rebalance
    has incomplete, but by the time the first one might be completed
    and the cleanup would have done on an actual file leading to
    permissions like "", either we may end up in wrong permissions
    like "----------." or a data file with incomplete
    information.

Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
Fixes: gluster#4148
Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
rafikc30 added a commit to rafikc30/glusterfs that referenced this issue May 16, 2023
Rebalance daemon is deleting a data file instead of linkfile which
causes an entire file to be deleted from gluster namespace leading
to a complete data loss.

This data loss occurs in the following situations:-
 1) Entire file is deleted
    This happens when there is a linkfile on the hashed
    subvol. In this case, rebalance daemon will add both
    linkfile and data file to the queue. Let's say the
    data file entry picked up first to migrate and let's
    say the entry is migrated successfully. When the linkfile
    is picked up from the queue, when they check that the
    added linkfile has both hashed and cached to the same
    subvol, then they assume that it is a linfile and then
    it got deleted. They check for the linkfile is performed
    on the stbuf populated during the rebalance readdir.
    So by the time when the check are done, the file is now
    migrated and had a datafile
 2) Wrong permission and Incorrect size:
    Similar to above case, this happens when a linkfile and
    data file are picked up to migrate at the same time,
    in this case, a failure can happen at different stages of
    rebalance process and due to this rebalance failure,
    a cleanup might be done on the destination where the rebalance
    has incomplete, but by the time the first one might be completed
    and the cleanup would have been done on an actual file due to this
    either we may end up with the wrong permissions
    like "----------." or a data file with incomplete
    information.

Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
Fixes: gluster#4148
Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
rafikc30 added a commit to rafikc30/glusterfs that referenced this issue May 17, 2023
Rebalance daemon is deleting a data file instead of linkfile which
causes an entire file to be deleted from gluster namespace leading
to a complete data loss.

This data loss occurs in the following situations:-
 1) Entire file is deleted
    This happens when there is a linkfile on the hashed
    subvol. In this case, rebalance daemon will add both
    linkfile and data file to the queue. Let's say the
    data file entry picked up first to migrate and let's
    say the entry is migrated successfully. When the linkfile
    is picked up from the queue, when they check that the
    added linkfile has both hashed and cached to the same
    subvol, then they assume that it is a linfile and then
    it got deleted. They check for the linkfile is performed
    on the stbuf populated during the rebalance readdir.
    So by the time when the check are done, the file is now
    migrated and had a datafile
 2) Wrong permission and Incorrect size:
    Similar to above case, this happens when a linkfile and
    data file are picked up to migrate at the same time,
    in this case, a failure can happen at different stages of
    rebalance process and due to this rebalance failure,
    a cleanup might be done on the destination where the rebalance
    has incomplete, but by the time the first one might be completed
    and the cleanup would have been done on an actual file due to this
    either we may end up with the wrong permissions
    like "----------." or a data file with incomplete
    information.

Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
Fixes: gluster#4148
Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
mohit84 pushed a commit that referenced this issue May 18, 2023
Rebalance daemon is deleting a data file instead of linkfile which
causes an entire file to be deleted from gluster namespace leading
to a complete data loss.

This data loss occurs in the following situations:-
 1) Entire file is deleted
    This happens when there is a linkfile on the hashed
    subvol. In this case, rebalance daemon will add both
    linkfile and data file to the queue. Let's say the
    data file entry picked up first to migrate and let's
    say the entry is migrated successfully. When the linkfile
    is picked up from the queue, when they check that the
    added linkfile has both hashed and cached to the same
    subvol, then they assume that it is a linfile and then
    it got deleted. They check for the linkfile is performed
    on the stbuf populated during the rebalance readdir.
    So by the time when the check are done, the file is now
    migrated and had a datafile
 2) Wrong permission and Incorrect size:
    Similar to above case, this happens when a linkfile and
    data file are picked up to migrate at the same time,
    in this case, a failure can happen at different stages of
    rebalance process and due to this rebalance failure,
    a cleanup might be done on the destination where the rebalance
    has incomplete, but by the time the first one might be completed
    and the cleanup would have been done on an actual file due to this
    either we may end up with the wrong permissions
    like "----------." or a data file with incomplete
    information.

Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
Fixes: #4148

Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
@mohit84 mohit84 reopened this May 18, 2023
mohit84 pushed a commit to mohit84/glusterfs that referenced this issue May 18, 2023
Rebalance daemon is deleting a data file instead of linkfile which
causes an entire file to be deleted from gluster namespace leading
to a complete data loss.

This data loss occurs in the following situations:-
 1) Entire file is deleted
    This happens when there is a linkfile on the hashed
    subvol. In this case, rebalance daemon will add both
    linkfile and data file to the queue. Let's say the
    data file entry picked up first to migrate and let's
    say the entry is migrated successfully. When the linkfile
    is picked up from the queue, when they check that the
    added linkfile has both hashed and cached to the same
    subvol, then they assume that it is a linfile and then
    it got deleted. They check for the linkfile is performed
    on the stbuf populated during the rebalance readdir.
    So by the time when the check are done, the file is now
    migrated and had a datafile
 2) Wrong permission and Incorrect size:
    Similar to above case, this happens when a linkfile and
    data file are picked up to migrate at the same time,
    in this case, a failure can happen at different stages of
    rebalance process and due to this rebalance failure,
    a cleanup might be done on the destination where the rebalance
    has incomplete, but by the time the first one might be completed
    and the cleanup would have been done on an actual file due to this
    either we may end up with the wrong permissions
    like "----------." or a data file with incomplete
    information.

> Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
> Fixes: gluster#4148
> Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
> (Reviewed on gluster#4149)
> (Cherry picked from commit 1cedcc0)

Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
Fixes: gluster#4148
Signed-off-by: Mohit Agrawal<moagrawa@redhat.com>
Shwetha-Acharya pushed a commit that referenced this issue May 24, 2023
Rebalance daemon is deleting a data file instead of linkfile which
causes an entire file to be deleted from gluster namespace leading
to a complete data loss.

This data loss occurs in the following situations:-
 1) Entire file is deleted
    This happens when there is a linkfile on the hashed
    subvol. In this case, rebalance daemon will add both
    linkfile and data file to the queue. Let's say the
    data file entry picked up first to migrate and let's
    say the entry is migrated successfully. When the linkfile
    is picked up from the queue, when they check that the
    added linkfile has both hashed and cached to the same
    subvol, then they assume that it is a linfile and then
    it got deleted. They check for the linkfile is performed
    on the stbuf populated during the rebalance readdir.
    So by the time when the check are done, the file is now
    migrated and had a datafile
 2) Wrong permission and Incorrect size:
    Similar to above case, this happens when a linkfile and
    data file are picked up to migrate at the same time,
    in this case, a failure can happen at different stages of
    rebalance process and due to this rebalance failure,
    a cleanup might be done on the destination where the rebalance
    has incomplete, but by the time the first one might be completed
    and the cleanup would have been done on an actual file due to this
    either we may end up with the wrong permissions
    like "----------." or a data file with incomplete
    information.

> Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
> Fixes: #4148
> Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
> (Reviewed on #4149)
> (Cherry picked from commit 1cedcc0)

Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc
Fixes: #4148

Signed-off-by: Mohit Agrawal<moagrawa@redhat.com>
Co-authored-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants