Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify gNMI expectation for persisting configuration #192

Open
dplore opened this issue Jul 18, 2023 · 9 comments
Open

Clarify gNMI expectation for persisting configuration #192

dplore opened this issue Jul 18, 2023 · 9 comments
Assignees

Comments

@dplore
Copy link
Member

dplore commented Jul 18, 2023

Currently there is no explicit requirement in gNMI regarding if a successful SetRequest should be saved to persistent storage on a device. gNMI intentionally doesn't define a method to make a configuration change that isn't persisted.

The gNMI reference should be clarified to require that a device must persist the configuration before sending a successful SetResponse.

@dplore dplore self-assigned this Jul 18, 2023
@hellt
Copy link
Contributor

hellt commented Jul 19, 2023

@dplore are you referring to a behavior when a set request leads to saving dynamic configuration to a persistent storage -- e.g. making current running config saved to a file that is used when the device boots?

@robshakir
Copy link
Contributor

couple of comments on this issue:

  • (nit, but important) the RPC is Set, so we should think about this as a "successful Set" not a successful SetRequest.
  • you say this should be before the SetResponse is sent. We should think carefully about this -- this will make the RPC blocking until the write of config has happened (could be latent) and means that you are mandating 1:1 write to Set correspondence. Consider rather whether this should be allowed asynchronously - i.e., the promise that is made is that the config will be persisted and telemetry can be used to determine if that write has been completed. This allows for higher Set throughput and optimisations like batch writing if you ended up with lots of simultaenous Set RPC calls.

@dplore
Copy link
Member Author

dplore commented Jul 19, 2023

@hellt yes, saving dynamic configuration to a persistent storage which is used to load the configuration when the device reboots.

couple of comments on this issue:

  • (nit, but important) the RPC is Set, so we should think about this as a "successful Set" not a successful SetRequest.

This is correct of course. I'll use Set when referring the to RPC overall and the SetRequest message and SetResponse message when specifically referring to those messages and expected behavior. This issue is about fine grained behavior expectations of the Set RPC's Request and the Response behavior. So I've been using SetRequest and SetResponse terms to differentiate.

  • you say this should be before the SetResponse is sent. We should think carefully about this -- this will make the RPC blocking until the write of config has happened (could be latent) and means that you are mandating 1:1 write to Set correspondence. Consider rather whether this should be allowed asynchronously - i.e., the promise that is made is that the config will be persisted and telemetry can be used to determine if that write has been completed. This allows for higher Set throughput and optimisations like batch writing if you ended up with lots of simultaenous Set RPC calls.

This is a good point! The question you point out here is if the "persist" of the configuration should be part of the Set transaction or occur asynchronously from of the Set transaction. If persisting the configuration data is not part of the Set transaction, then the Set transaction could succeed and the persisting could fail.

Operationally we want the configuration to take effect and we want it persisted. Set is supposed to do both of these things. If persisting fails, we definitely still want the configuration to take effect. If the persisting fails, we want a warning/error which triggers an action to repair whatever it is that is causing the "persist" of the configuration to fail.

If we allow a Set to succeed asynchronously from the the configuration is being persisted, then we are reliant on undefined logging / debug information to detect failure of persisting a configuration. It seems bad to require a behavior in an RPC which is not also guaranteed. If we want to have the status of "persisting" a configuration to be explicit in a Set RPC, then I can think of three options:

  1. Require the Set to block on persisting a configuration. If the "persist" cannot be completed, the configuration should still be applied and an indication in the SetResponse indicating the configuration is applied, but not persisted. Probably this should also be a gRPC error condition.
  2. Do not require gNMI Set to persist a configuration, instead create another RPC specifically to "persist" a configuration.
  3. Define Set RPC to be asynchronous and introduce a new SetResponseStatus message which indicates the status of each step in the Set operation and if the total operation is complete. ie: configuration is 'validated' and 'persisted' as two separate steps.
  4. ClarifySet as a synchronous operation for "persisting" a configuration and introduce option 2 or 3 as a new RPC.

Option 1 keeps things simple and synchronous, but as you said, has tradeoffs.
Options 2 and 3 allow a Set to asynchronous from "persisting" a configuration. (and are more complex)
Option 4 is just a combination of 1 and (2 or 3).

IMO, Set should not be a frequent transaction. A configuration change once per minute would be a very high rate of configuration change. Given the latency we see for configuration changes today, depending on the platform and total size of a configuration, once per minute is within the same order of magnitude of the performance we are seeing for Seting large configurations on large devices.

Maybe there are other options.

My thoughts are to proceed with 1 now and add 2 or 3 based on OC Operator feedback. (Resulting in option 4, assuming something is added).

@robshakir
Copy link
Contributor

robshakir commented Jul 19, 2023

I find the use of SetRequest and SetResponse in this context confusing :-) Set is an RPC that takes a payload of SetRequest, the response to the RPC is a message called SetResponse. SetRequest and SetResponse are just protos, they don't have "behaviours". The Set RPC has behaviours.

w.r.t your options... I would like to see data as to:

a. implementations conformance to each option
b. performance implications of each option

and consideration of backwards compatibility. we have expected Set to persist configuration before - so something like 2 causes there to be cases that do not implement this.

I don't think relying on telemetry is a bad thing, which you assume above, I believe. Today, if we look at the consistency model we have we are saying:

  • Set updates intended configuration
  • Intended configuration may transition to being applied configuration but there may be preconditions that are not met, or convergence time for intended to transition to applied
  • the only way to check whether intended configuration transitioned to applied configuration is to use telemetry

thus, I'd argue that architecturally we've already said that telemetry is something that is required to understand the state of a configuration on a device. Equally, we've said that the Set being successful is a promise that the configuration will try to be applied.

We also see cases where we'd like faster Set operations, e.g., for fault mitigation. Requiring that write to persistent storage are synchronous seems counter to that aim.

@hellt
Copy link
Contributor

hellt commented Jul 20, 2023

There are applications that might require quite some number of consecutive sets (e.g. atomic-based configuration management systems that manage "small" configuration regions and thus responsible for sending changes to these regions).
With this being said, I think ultimately requiring cfg persistency without being able to disable it is not flexible enough.

To add to the bucket of vendor's implementations, in SR Linux we have a configurable knob under the grpc server stanza (set to false by default) that when set will persist each successful Set.

@robshakir
Copy link
Contributor

Based on the comments here, @dplore - I suggest we write a short document that describes the constraints on the design space, the existing guidance, and the potential solutions and review this with the various gNMI stakeholders.

@dplore
Copy link
Member Author

dplore commented Jul 21, 2023

Based on the comments here, @dplore - I suggest we write a short document that describes the constraints on the design space, the existing guidance, and the potential solutions and review this with the various gNMI stakeholders.

I am happy to do this. But this task has now grown to something that has to be scheduled in a future sprint.

@robshakir
Copy link
Contributor

Thanks.

@hellt
Copy link
Contributor

hellt commented Jul 21, 2023

@robshakir you mentioned

we have expected Set to persist configuration before

Since this was not expressed in the spec I fear it might not be the case with implementations in the wild

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants