Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: Add tests for initialization failure #2238

Merged
merged 4 commits into from
Feb 6, 2023

Conversation

fyrchik
Copy link
Contributor

@fyrchik fyrchik commented Feb 6, 2023

Fix bugs along the way.
Invalid mode on a directory is an approximation of missing media.
We expect any Open/Init errors to be logged and a shard to be disabled.

@codecov
Copy link

codecov bot commented Feb 6, 2023

Codecov Report

Merging #2238 (5fec25c) into support/v0.35 (e3f1804) will increase coverage by 0.10%.
The diff coverage is 64.10%.

@@                Coverage Diff                @@
##           support/v0.35    #2238      +/-   ##
=================================================
+ Coverage          30.88%   30.99%   +0.10%     
=================================================
  Files                383      383              
  Lines              28395    28419      +24     
=================================================
+ Hits                8770     8808      +38     
+ Misses             18878    18870       -8     
+ Partials             747      741       -6     
Impacted Files Coverage Δ
cmd/neofs-node/config.go 0.00% <0.00%> (ø)
pkg/local_object_storage/shard/control.go 76.19% <50.00%> (-0.32%) ⬇️
pkg/local_object_storage/engine/control.go 84.02% <85.18%> (+11.83%) ⬆️
pkg/local_object_storage/engine/shards.go 70.50% <0.00%> (+2.87%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pkg/local_object_storage/engine/control.go Show resolved Hide resolved
c.log.Info("shard attached to engine", zap.Stringer("id", id))
}
}
if shardsAttached == 0 {
Copy link
Member

@carpawell carpawell Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a discussed behavior? i mean could a node starting with just an error in the logs but without some planned shards be expected by an admin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are still dropping shards in Init , so this behaviour is not new.
In our model shard is an (almost) independent domain of failure. I am mostly thinking about automatic node restart after hardware failures. We can discuss this in future, I am not sure what the expected behaviour here is: for 11 shards repetitve config manipulations is a laborious and error-prone task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean that we could start it in a degraded mode if we cannot write ID to the metabase?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest problem of this approach is that you can lose shards (and potentially data) on benign misconfiguration. This will happen.

Copy link
Member

@carpawell carpawell Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this behaviour is not new

but we had fatalOnErr(err) previously if a shard has not been attached. as i understand, that was very different. losing shards without some huge "you are losing shards" banner bothers me but i dont have any good ideas of how to solve that right now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our model shard is an (almost) independent domain of failure.

also, can't fully agree with this. one shard could contain Locks, Tombstones that protect/remove data from another so losing one shard is not just losing some objects. but that is our general problem

e.mtx.RLock()
defer e.mtx.RUnlock()
e.mtx.Lock()
defer e.mtx.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you duplicate this much code from Init()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate?
This line is here because we can remove shards in Open after this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about this particular line, I just had to attach this comment somewhere. There is quite some duplication between open and Init now it seems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.
I've thought about it here, though, I am not sure it will be simpler with some ad-hoc parrallelizing function which accepts functions as arguments.

c.log.Info("shard attached to engine", zap.Stringer("id", id))
}
}
if shardsAttached == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest problem of this approach is that you can lose shards (and potentially data) on benign misconfiguration. This will happen.

@fyrchik
Copy link
Contributor Author

fyrchik commented Feb 6, 2023

The biggest problem of this approach is that you can lose shards (and potentially data) on benign misconfiguration. This will happen.

I agree. But the alternative here is lose all vs lose some data.
And to be clear, we don't lose something here, just make it possibly unavailable.

carpawell
carpawell previously approved these changes Feb 6, 2023
Copy link
Member

@carpawell carpawell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add changes that affect node's actions to the CHANGELOG.

1. Both could initialize shards in parallel.
2. Both should close shards after an error.

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
…rrors

Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
Signed-off-by: Evgenii Stratonikov <e.stratonikov@yadro.com>
@fyrchik fyrchik merged commit f92c14f into nspcc-dev:support/v0.35 Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants