-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object loss: (not so) redundant local copy removed on all machines #2267
Comments
Blocks around this event:
IR:
Netmap:
At around the next epoch:
So it was seriously degraded for some time (and unknown reason). |
Ah, they were degraded because of #2263, then notary balances were fixed and nodes came back. |
In current implementation, when node leaves the container (due to network map changes), it initiates objects' replication and throws them away.
The point is we could hold out-of-container unreplicated objects until successful migration. So, there are two possible approaches:
|
|
Of this two, BTW, I think the first one is more correct. In general out-of-container may mean that we have a node in CN with some data belonging to a RU-only container and it's a serious policy violation. We can't delete data unless we ensure that proper replicas exist, but once we do it must be deleted. |
In previous implementation `Policer.processObject` considered local object replica as redundant if `processNodes` loop didn't set `needLocalCopy` flag. According to implementation of the `processNodes` method, it could break by context with unset flag. After that `processObject` triggered clean routine. In order to prevent potential data loss, `Policer` must handle this case. Check context expiration after placement loop but before `needLocalCopy` condition in `Policer.processObject`. Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
…o context There is no need to explicitly pass checking object descriptor and node cache to `processNodes` since they are not changed during process. Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
Local storage node can be outside the container of some object and hold its single replica. In previous implementation `Policer` considered this replica as redundant. There is a need to change behavior for such cases: `Policer` must not mark replica as redundant if it did not find any valid replicas at the stage of bypassing the container. Make `Policer.processObject` determine the presence of a local node in a container in the `processNodes` call loop. Do not consider single existing replica outside the container as redundant and, accordingly, do not pass it to the redundant replica's callback. Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
In the absence of a node in the network map, it is impossible to determine its belonging to the container in the case of a return. To prevent the potential loss of a single instance of data, offline nodes must hold objects. Make `Policer.processObject` to look up for local node in the network map before considering object replica as redundant. If the node is outside the network map, the replica is considered meaningful. Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
In the absence of a node in the network map, it is impossible to determine its belonging to the container in the case of a return. To prevent the potential loss of a single instance of data, offline nodes must hold objects. Make `Policer.processObject` to look up for local node in the network map before considering object replica as redundant. If the node is outside the network map, the replica is considered meaningful. Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
In the absence of a node in the network map, it is impossible to determine its belonging to the container in the case of a return. To prevent the potential loss of a single instance of data, offline nodes must hold objects. Make `Policer.processObject` to look up for local node in the network map before considering object replica as redundant. If the node is outside the network map, the replica is considered meaningful. Signed-off-by: Leonard Lyubich <ctulhurider@gmail.com>
* closes #2267 * closes #1453 I decided to not directly implement #1453 proposal: * out-of-container (netmap) cases are handled in a special way * insta replica removal after successful replication doesn't seem very good behavior: some nodes have been already violated storage policy and lost the object, therefore, it is worth waiting for some time interval, and make a decision already at the next check
Expected Behavior
Object storage system should store objects. I mean, you put them and then get them back at any time until you delete them.
Current Behavior
So we have a four-node network with
Hw57cmN31gCrqyEyKL5km31TFYETzQa3qk8DNECk6a4H
container using this policy:An
AV3Z7kpn8hxnaWWB2QRZUF9x8YhnyuB5ntC5fX6PRhz1
object was uploaded into this container some (pretty long) time ago. It's stored on nodes 3 and 4 (there were some movements before the incident, but it's not relevant) until this happens:and
Nodes 3 and 4 (holding the object) decide to move it to 1 and 2 at around the same time. Both fail to do so for some reason (which is not really important, replication can fail for a number of reasons). Both then delete their local copies. Object is gone. Forever.
Possible Solution
Looks like something is wrong in the logic ensuring a proper number of copies exists before deleting local one.
Context
Yeah, it's T5 testnet.
Your Environment
Node version 0.34.0.
The text was updated successfully, but these errors were encountered: