Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Add ErrorMode configuration option #18187

Merged
merged 10 commits into from
Feb 3, 2023

Conversation

TylerHelmuth
Copy link
Member

@TylerHelmuth TylerHelmuth commented Jan 30, 2023

Description:
Adds a new configuration option to OTTL parsers that allow configuring ErrorMode.

The valid options at the moment are IgnoreError and PropagateError. When using IgnoreError, OTTL will log the error but not return it to the component executing the Statement. When using PropagateError, the error will be returned.

I have set PropagateError to be the default error mode, following the idea that OTTL should fail openly whenever possible.

Since this PR is only the configuration option and not actual enablement in any implementing components, I have set the OTTL contexts to use PropagateError so that this isn't a breaking change for transformprocessor, filterprocessor, or routingprocessor users. In future PRs I will update those components to allow for configuring this capability. At that point I will set the ottlcontexts to use IgnoreError by default.

This is the first of several PRs I'd like to submit to improve/standardize error handling when using OTTL.

Link to tracking Issue:

Works towards #16519
Related to #17563

Testing:
Added unit tests

Documentation:
I plan to do a large godocs PR once the OTTL API has stabilized.

@TylerHelmuth
Copy link
Member Author

/cc @swiatekm-sumo @sumo-drosiek @astencel-sumo

@runforesight
Copy link

runforesight bot commented Jan 30, 2023

Foresight Summary

    
Major Impacts

build-and-test-windows duration(5 seconds) has decreased 40 minutes 10 seconds compared to main branch avg(40 minutes 15 seconds).
View More Details

✅  check-links workflow has finished in 43 seconds (1 minute 14 seconds less than main branch avg.) and finished at 1st Feb, 2023.


Job Failed Steps Tests
changed files -     🔗  N/A See Details
check-links -     🔗  N/A See Details

✅  telemetrygen workflow has finished in 1 minute 4 seconds (1 minute 29 seconds less than main branch avg.) and finished at 1st Feb, 2023.


Job Failed Steps Tests
build-dev -     🔗  N/A See Details
publish-latest -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  prometheus-compliance-tests workflow has finished in 3 minutes 55 seconds (3 minutes 49 seconds less than main branch avg.) and finished at 1st Feb, 2023.


Job Failed Steps Tests
prometheus-compliance-tests -     🔗  ✅ 21  ❌ 0  ⏭ 0    🔗 See Details

✅  build-and-test workflow has finished in 41 minutes 51 seconds (11 minutes 28 seconds less than main branch avg.) and finished at 1st Feb, 2023.


Job Failed Steps Tests
unittest-matrix (1.19, internal) -     🔗  ✅ 561  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, internal) -     🔗  ✅ 561  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, processor) -     🔗  ✅ 1471  ❌ 0  ⏭ 0    🔗 See Details
correctness-metrics -     🔗  ✅ 2  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, processor) -     🔗  ✅ 1471  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, extension) -     🔗  ✅ 546  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, extension) -     🔗  ✅ 546  ❌ 0  ⏭ 0    🔗 See Details
correctness-traces -     🔗  ✅ 17  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, receiver-0) -     🔗  ✅ 2577  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-0) -     🔗  ✅ 2577  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, exporter) -     🔗  ✅ 2429  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, receiver-1) -     🔗  ✅ 1944  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, exporter) -     🔗  ✅ 2429  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-1) -     🔗  ✅ 1944  ❌ 0  ⏭ 0    🔗 See Details
integration-tests -     🔗  ✅ 55  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, other) -     🔗  ✅ 4668  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, other) -     🔗  ✅ 4668  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details
lint-matrix (receiver-0) -     🔗  N/A See Details
lint-matrix (receiver-1) -     🔗  N/A See Details
lint-matrix (processor) -     🔗  N/A See Details
lint-matrix (exporter) -     🔗  N/A See Details
lint-matrix (extension) -     🔗  N/A See Details
lint-matrix (internal) -     🔗  N/A See Details
lint-matrix (other) -     🔗  N/A See Details
check-codeowners -     🔗  N/A See Details
check-collector-module-version -     🔗  N/A See Details
checks -     🔗  N/A See Details
build-examples -     🔗  N/A See Details
lint -     🔗  N/A See Details
unittest (1.19) -     🔗  N/A See Details
unittest (1.18) -     🔗  N/A See Details
cross-compile (darwin, amd64) -     🔗  N/A See Details
cross-compile (darwin, arm64) -     🔗  N/A See Details
cross-compile (linux, 386) -     🔗  N/A See Details
cross-compile (linux, amd64) -     🔗  N/A See Details
cross-compile (linux, arm) -     🔗  N/A See Details
cross-compile (linux, arm64) -     🔗  N/A See Details
cross-compile (linux, ppc64le) -     🔗  N/A See Details
cross-compile (windows, 386) -     🔗  N/A See Details
cross-compile (windows, amd64) -     🔗  N/A See Details
build-package (deb) -     🔗  N/A See Details
build-package (rpm) -     🔗  N/A See Details
windows-msi -     🔗  N/A See Details
publish-check -     🔗  N/A See Details
publish-dev -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  load-tests workflow has finished in 9 minutes 42 seconds (4 minutes 51 seconds less than main branch avg.) and finished at 1st Feb, 2023.


Job Failed Steps Tests
loadtest (TestIdleMode) -     🔗  ✅ 1  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceAttributesProcessor) -     🔗  ✅ 3  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetricResourceProcessor|TestTrace10kSPS) -     🔗  ✅ 12  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) -     🔗  ✅ 10  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetric10kDPS|TestMetricsFromFile) -     🔗  ✅ 6  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) -     🔗  ✅ 8  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestBallastMemory|TestLog10kDPS) -     🔗  ✅ 19  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details

⭕  changelog workflow has finished in 4 seconds (2 minutes 2 seconds less than main branch avg.) and finished at 3rd Feb, 2023.


Job Failed Steps Tests
changelog -     🔗  N/A See Details

⭕  build-and-test-windows workflow has finished in 5 seconds (40 minutes 10 seconds less than main branch avg.) and finished at 3rd Feb, 2023.


Job Failed Steps Tests
windows-unittest-matrix -     🔗  N/A See Details
windows-unittest -     🔗  N/A See Details

🔎 See details on Foresight

*You can configure Foresight comments in your organization settings page.

@swiatekm-sumo
Copy link

I'm not sure I like the names. My immediate assumption for ErrorMode set to Drop was that it would drop data on error, and that Send would send the data ahead.

Instead of Drop, I like Ignore, as I think it's impossible to misinterpret in this context. The other option is a more difficult, maybe something like Propagate?

@TylerHelmuth
Copy link
Member Author

@swiatekm-sumo we could also consider naming the configuraiton on_error, which is what pkg/stanza uses.

@evan-bradley
Copy link
Contributor

I agree that we should reconsider the names for Drop and Send, since these verbs could also apply to telemetry. I could see it as meaning "on error, drop/send telemetry", which would be the opposite behavior of "on error, drop/send the error".

One idea could be to make the terms more explicit, such as DropErrors and SendErrors. Granted, on_error: drop_errors in the configuration feels a little redundant. Other possible terms could be "fatal" and "nonfatal" or "stop execution" and "continue execution". I think "ignore" and "propagate" are good options as well.

@swiatekm-sumo
Copy link

One idea could be to make the terms more explicit, such as DropErrors and SendErrors. Granted, on_error: drop_errors in the configuration feels a little redundant. Other possible terms could be "fatal" and "nonfatal" or "stop execution" and "continue execution". I think "ignore" and "propagate" are good options as well.

I don't think on_error: drop_error is even that bad. Definitely better than on_error: drop, which again makes it sound like we're dropping data, and this is exactly how stanza behaves. I'd personally be fine with stanza's terminology, though I still think ignore is the less ambiguous word in this context.

@TylerHelmuth
Copy link
Member Author

this is exactly how stanza behaves.

whoops, I shoulda looked at stanza a little closer. I'll update to ignore and propagate.

Comment on lines 58 to 62
if s.errorMode == PropagateError {
return nil, false, err
}
s.telemetrySettings.Logger.Error("error executing condition", zap.Error(err))
return nil, false, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to me looks like unnecessary here. The caller can decide what to do with the error. If a statement fail at any point there is not like we can continue executing that statement, so we need to stop and I would say return an error.

The caller of Execute can determine what they want to do.

I fail to understand where this extra code is useful for a caller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would see this useful in a "type Statements struct {statements []Statement}" that executes multiple statements to ignore the error from a previous statement and continue with next one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu I believe main advantage of adding error handling like this to the OTTL is that is allows us to set some expectations when using OTTL.

By including error handling logic to OTTL it becomes much easier to enforce the concept of "open erroring" within the package itself, which then allows us to make more exact statements about how functions should handle errors. By including this type of error handling we can make a statement like "functions should always return any failed type assertions as errors", while ensuring that OTTL will gracefully handle the error.

It also has the benefit of providing a default error handling strategy for components that use OTTL. If they want to always keep processing, they don't have to do anything extra.

We don't provide a type Statements struct {statements []Statement} struct, but I can see how this error handling fits nicely with that struct. I'll play with that as part of this PR and update the implementing components to use it in a future PR.

A struct like that would help with the future problem I foresee where an implementing component, like the transformprocessor, wants to configure not only the ability to skip to the next statement on error, but also skip to the next TransformContext and halting (#16519 (comment))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu add new Statements struct and moved the error handling there. In a followup PR I''l update ParseStatements to return a Statements struct. This will work for all of the components in contrib that use OTTL except for routingprocessor. For the routingprocessor, which only needs to execute a single statement and wants the results, condition, and error, I'll add a new ParseStatement function that just returns a Statement. What do you think?

Copy link
Member Author

@TylerHelmuth TylerHelmuth Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Parser will end up with the ability to parse a list of statements and a single statement, I think ErrorMode will be more appropriate as an arg/option on ParseStatements instead of the whole Parser struct. But I could see either being ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone is curious, I played around with the result and I like the implications: a6c2f19

pkg/ottl/parser.go Outdated Show resolved Hide resolved
@TylerHelmuth TylerHelmuth added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Feb 1, 2023
@github-actions github-actions bot removed the request for review from swiatekm-sumo February 1, 2023 21:06
Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction. Some unknowns for me are how you intend to let user get access to a Statements instance.

@TylerHelmuth TylerHelmuth added the ready to merge Code review completed; ready to merge by maintainers label Feb 3, 2023
@bogdandrutu bogdandrutu merged commit f5e42bf into open-telemetry:main Feb 3, 2023
@TylerHelmuth TylerHelmuth deleted the ottl-error-handling branch February 3, 2023 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg/ottl ready to merge Code review completed; ready to merge by maintainers Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants