Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for revocation severity levels and context #47

Merged
merged 5 commits into from
Jun 24, 2021

Conversation

THS-on
Copy link
Member

@THS-on THS-on commented Jun 8, 2021

This is the proposal for adding severity levels and context to revocation events that are send by Keylime.

This is useful if other systems need to make more informed decisions about the state of an agent. This is currently not possible with Keylime.

I've tried to design the proposed implementation in such a way that (hopefully) most of components don't need heavy modifications.

Signed-off-by: Thore Sommer <mail@thson.de>
46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
Events can a specified context. Those can be either a static string or an JSON
object that contains more information event. This is optional and it is assumed
that two events from the same agent with the same event id have the same
severity and are the same event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can assume they are for the same event right? It's possible to have the same failure one hour, recover and have a similar failing again the next hour right? Or am I missing something.

Copy link
Member Author

@THS-on THS-on Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my wording for recoverable vs irrecoverable events is not precise enough. What I meant with recoverable was that Keylime can still operate and do (parts of) remote attestation. Is would case for an IMA failure but not if the qoute cannot be validated (and retries failed).

In my opinion you cannot really recover from a revocation event because in most cases that just means that a window for attack was open and the system might be compromised.
If it was determined that window was not used maybe the events can deleted manually or reset the agent by removing and adding it to the verifier again.

The main idea behind that is, that the receiver of the revocation events does not get a message every time when the failed agent is polled if it was already notified.
Another idea was that our decision is mainly based on the id and context is more for deeper analysis if needed. Also the level is only assigned based on the id not the context.

But I think if the context part is heavily used i could make sense to make it not optional and consider two events with the same id but different context as two separate events.

46_revocation_severity_and_context.md Outdated Show resolved Hide resolved
46_revocation_severity_and_context.md Show resolved Hide resolved
validating further. The state of the agent after that should be
`QUOTE_FAILED_IRRECOVERABLE`.

The highest severity is 0 and the higher the number the number the lower the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like the idea of being flexible, I'm not sure it's necessary. I can't imagine real world scenarios with more than a handful of severity levels otherwise it will just be too operationally confusing. Also, I can see folks making mistakes in configuration because 0 is high and something like 10 is low. How about using something like the syslog severity text lables? It's well known and covers all kinds of scenarios:

  • emerg
  • alert
  • crit
  • err
  • warning
  • notice
  • info
  • debug

I could even see simplifying it more and merging emerg, alert, and crit into a single value like "crit"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see that this might be too confusing and may lead to errors. Labels are more descriptive than numbers.

The main issue that I have with static labels is that you might use a label for a unintended purposes because you run out of labels. This will not be the case in most environments, but for us we want to use it for presorting events with a pretty fine granularity and I don't know if the proposed labels can represent accurately.

Another idea would be to use labels but make them configurable in the keylime.conf. Those labels would be strictly ordered by severity and are static during operation, but you still have the flexibility to add custom labels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the proposal to include user specified labels instead of using numbers.

Signed-off-by: Thore Sommer <mail@thson.de>
severity. This makes it easy to add less sever levels without the need to
increase the level of the highest severity.

A new table `revocation_events` is introduced to save the already sent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existence of a new table worries me, to be honest. Would this table continuously grow within the lifetime of an agent? In other words, are we expecting multiple lines with the same agent id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we would record every event that was generated for a agent during its lifetime.
This allows us to do two things. First can provide an API entry point for all the revocation events that have been generated for other systems. Second we want to check if we already sent an event. For example if a static PCR is failing it will in most cases always fail again and we want to be only notified once. Always resending the event would cause potentially a lot of redundant traffic.

We could put that information into a JSON object and store it string encoded in one single entry (if the database allows arbitrary length for fields), but that would make checking if we send a event already always a linear action.
Another approach is to only store the latest or the one with the current highest severity, but this will remove the option to check if a event was sent and a event log for an agent must be recorded by a component outside of Keylime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal would be breaking a key assumption on Keylime databases so far: these are not "historical", i.e., they do not accumulate data about an agent and, if destroyed, it can be quickly rebuilt from the "ground truth", which is the current state requested from such agents. This is design point made before my time as one of the keylime maintainers, but I must say that it is a point worth preserving. This being said, I would vote against this enhancement based due to the need for this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If changing that is not an option, what is your opinion on just saving the highest severity level and sending a event only if it has a higher severity? This would not require an extra table and cloud be more easily restored by requesting the state from the agent. Now the clients receiving the events can only make decisions on based on the severity level but I think in most cases this will be still a valuable improvement.

I think that it will be very useful to have some kind of API (either over REST API or Python module) such that external systems can get more information about all revocations events out of Keylime. Those systems can than implement the record keeping outside of Keylime and relay the necessary information to the relevant systems.
(We are already implementing a bridge between Keylime and our system for the tenant functionality, so we could move that out of Keylime and into our bridge.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My short answer would be yes, but I believe your original proposal, the "provenance" of the component/stage where the failure happened is tremendously useful on itself. After several months using Keylime in Production, I became a firm believer in the need to "decouple" the revocation from the core of keylime (the current implementation is not particularly good, flexible or robust) and adopting a more configurable approach (e.g., execute a given command in case an agent fails, passing all the appropriate parameters) is the way to move forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose that I keep the tracking where a event was generated and the user specified rules in this proposal but restructure the event sending part to just use the labels to demonstrate some aspects of the new gained capabilities of Keylime.

For the redesign of the revocation mechanism of Keylime we then open a new enhancement proposal which uses the added capabilities from this proposal as a base. (Here I would propose some kind of plugin API where the current revocation mechanism is just a plugin and can be swapped out for other mechanisms.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @maugustosilva here in that we shouldn't have keylime keep track of events. Having an extra status field about the last failure, etc attached to the verifier table would be good. Having an extra table that just grows over time would not be. We do need to tackle the idea of an event log and how to store/query/search/maintain that long term, but that's better suited as APIs to existing external systems like adapters for elasticsearch or similar.

`component.[sub_component].event`.

* `component`: Name of the part of Keylime where the event was generated.
* Current would be: `quote_validation`, `pcr_validation`, `measured_boot`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really really really good idea, and IMO should be implemented first (probably with an extra column on the main agent table) in an self-contained manner.

Signed-off-by: Thore Sommer <mail@thson.de>
Signed-off-by: Thore Sommer <mail@thson.de>
@THS-on
Copy link
Member Author

THS-on commented Jun 23, 2021

@maugustosilva I've now split the tagging part for events and the severity levels for events into two distinct parts.

@mpeters mpeters merged commit cd754e5 into keylime:master Jun 24, 2021
@THS-on THS-on mentioned this pull request Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants