Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWSEMFExporter - Add EMF Exporter to support exporting metrics to AWS CloudWatch #498

Merged
merged 31 commits into from
Sep 26, 2020

Conversation

shaochengwang
Copy link
Contributor

@shaochengwang shaochengwang commented Jul 23, 2020

Description:

This PR introduces an exporter for AWS CloudWatch Service. The exporter works by translating metrics into the Embedded Metric Format which enables you to ingest complex high-cardinality application data in the form of logs and to generate actionable metrics from them.

Testing:

Unit tests covered for emf_exporter, cloudwatchlogs client, pusher, publisher, request handler.
Manually tested the end to end functionality on AWS CloudWatch console.

Documentation:

A README is included. It explains this awsemfexporter's DataConversion, Configuration and AWS Credential.

Notes:
Split the original PR into several pieces. The current one only contains the very basic functionality to get metrics from pipeline, translate them into EMF format and publish to CloudWatch. It doesn't do any batching or queuing. The feature to correlate CloudWatch log group name and CloudWatch metrics namespace is also not included in this version. These features will be added in subsequent PRs.
Although there are 4783 lines added in this PR, most of them are dependencies change and unit tests. The source code change is around 1200 lines.

@shaochengwang shaochengwang requested a review from a team as a code owner July 23, 2020 18:37
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 23, 2020

CLA Check
The committers are authorized under a signed CLA.

@project-bot project-bot bot added this to In progress in Collector Jul 23, 2020
Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be almost impossible for one of the maintainer to review 5000+ lines. Consider to split this PR into more readable PRs, as a rule of thumb 500 lines is considered a large PR

// CloudWatch metrics namespace
Namespace string `mapstructure:"namespace"`
// CWLogs service endpoint
Endpoint string `mapstructure:"endpoint"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use one of the confignet addresses?

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick skim - agree it'd be good to split this up, perhaps something like

  1. Initialize basic connection to cloudwatch
  2. Add translation of data formats
  3. Wire up translation to basic connection
  4. Add advanced connection features (STS)

Also, this is probably going to happen naturally if going with the above structure but there seems to be an overcoupling of sending logic and translation logic. I think having translation logic separate makes the code easier to reason about, see our translator package for xray exporter https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/master/exporter/awsxrayexporter/translator/sql.go

Comment on lines 13 to 14
The following exporter configuration parameters are supported. They mirror and have the same affect as the
comparable AWS X-Ray Daemon configuration values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following exporter configuration parameters are supported. They mirror and have the same affect as the
comparable AWS X-Ray Daemon configuration values.
The following exporter configuration parameters are supported.

Don't think the x-ray daemon is relevant to readers of this file.

// Config defines configuration for AWS EMF exporter.
type Config struct {
configmodels.ExporterSettings `mapstructure:",squash"` // squash ensures fields are correctly decoded in embedded struct.
// LogGroupName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem to be missing descriptions for these config fields. I think the receiver is a good example for what level of detail we should include here

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/master/receiver/awsxrayreceiver/config.go

@@ -0,0 +1,52 @@
package mapWithExpiry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filename has typo

}

func (u *NonBlockingFifoQueue) Dequeue() (interface{}, bool) {
u.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not non-blocking if you lock

)

// It is a FIFO queue with the functionality that dropping the front if the queue size reaches to the maxSize
type NonBlockingFifoQueue struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, I didn't look in detail, but at first glance it seems strange to have a queue in the exporter. Shouldn't we be able to use the pipeline for queueing, batching, etc?

@codecov
Copy link

codecov bot commented Aug 4, 2020

Codecov Report

Merging #498 into master will increase coverage by 1.28%.
The diff coverage is 95.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #498      +/-   ##
==========================================
+ Coverage   87.97%   89.25%   +1.28%     
==========================================
  Files         251      260       +9     
  Lines       12012    12700     +688     
==========================================
+ Hits        10568    11336     +768     
+ Misses       1109     1007     -102     
- Partials      335      357      +22     
Flag Coverage Δ
#integration 75.42% <ø> (?)
#unit 88.34% <95.43%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
exporter/awsemfexporter/metric_translator.go 92.45% <92.45%> (ø)
exporter/awsemfexporter/conn.go 93.47% <93.47%> (ø)
exporter/awsemfexporter/pusher.go 96.49% <96.49%> (ø)
exporter/awsemfexporter/emf_exporter.go 97.53% <97.53%> (ø)
exporter/awsemfexporter/cwlog_client.go 100.00% <100.00%> (ø)
exporter/awsemfexporter/factory.go 100.00% <100.00%> (ø)
...fexporter/handler/request_structuredlog_handler.go 100.00% <100.00%> (ø)
...er/awsemfexporter/mapwithexpiry/map_with_expiry.go 100.00% <100.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9082de...4cfefe8. Read the comment docs.

@shaochengwang shaochengwang changed the title Add EMF Exporter to support exporting metrics to AWS CloudWatch AWSEMFExporter - Add EMF Exporter to support exporting metrics to AWS CloudWatch Aug 5, 2020
@shaochengwang
Copy link
Contributor Author

Split the original PR into several pieces. The current one only contains the very basic functionality to get metrics from pipeline, translate them into EMF format and publish to CloudWatch. It doesn't include any batching or queuing logic. The feature to correlate CloudWatch log group name and CloudWatch metrics namespace is not included in this version as well. These features will be added in subsequent PRs.
Although there are 4783 lines added in this PR, most of them are dependencies change and unit tests. The source code change is around 1200 lines.

tag @alolita on AWS side for her visibility on this PR.


const (
// this is the retry count, the total attempts would be retry count + 1 at most.
defaultRetryCount = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default retries should be 1. This is usually too many retries for observability functionality not on the main execution path before dropping the data. In addition, this many retries will likely trigger throttling by CloudWatch API.

default:
// ThrottlingException is handled here because the type cloudwatch.ThrottlingException is not yet available in public SDK
// Retry request if ThrottlingException happens
if awsErr.Code() == ErrCodeThrottlingException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying during throttling will never catch up. Just drop the data at this point. Possibly even the entire sequence.

}

//sleep some back off time before retries.
func backoffSleep(i int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeping this long will never allow catchup

retryCnt: *awsConfig.MaxRetries,
logger: logger,
}
if config.(*Config).ForceFlushInterval > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream processors in the collector such as batch_retry should handle retries and queuing. Exporter should not need a timer. Just push any data received immediately. Only thing that needs to be handled is making sure maximum records is not exceeded per put. You will likely need a loop for that. See the X-ray exporter.


//Put log events. The method mainly handles different possible error could be returned from server side, and retries them
//if necessary.
func (client *CloudWatchLogClient) PutLogEvents(input *cloudwatchlogs.PutLogEventsInput, retryCnt int) *string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return error to collector pipeline instead of handling everything here. Requests that should not be retried (HTTP 4xx) must be wrapped in consumererror.Permanent

Collector automation moved this from In progress to Review in progress Aug 19, 2020

// Possible exceptions are combination of common errors (https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/CommonErrors.html)
// and API specific erros (e.g. https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html#API_PutLogEvents_Errors)
type CloudWatchLogClient struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to make struct internal

)

// Factory is the factory for AWS EMF exporter.
type Factory struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to use the helper factory

…Add the pusher to check the batched payload size.
@tigrannajaryan
Copy link
Member

@shaochengwang can you split the PR as @bogdandrutu asked?

@alolita
Copy link
Member

alolita commented Sep 10, 2020

@tigrannajaryan @bogdandrutu - Ack will follow up w @shaochengwang

@shaochengwang
Copy link
Contributor Author

Thanks a lot for reviewing this PR. I have tried to make this feature as simple as possible in the second revision. Splitting it further may break its logic and functionality. Although it has 6000+ lines, I think most of them are the dependencies list and unit test files, the source code change should not be huge.

Comment on lines 17 to 18
| `log_group_name` | Customized log group name | |
| `log_stream_name` | Customized log stream name | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fill the default values for these configurations?

@@ -107,6 +108,7 @@ func components() (component.Factories, error) {
&splunkhecexporter.Factory{},
elasticexporter.NewFactory(),
&alibabacloudlogserviceexporter.Factory{},
&awsemfexporter.Factory{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest update to NewFactory() interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it in the latest version.

e.Message(),
e.Error(),
e))
backoffSleep(i)
Copy link
Member

@mxiamxia mxiamxia Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be waited if the error type is InvalidSequenceTokenException? it seems we can retry it right away with the new token returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Comment on lines 45 to 48
//The file name where this log event comes from
FileName string
//The offset for the input file
FilePosition int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not need?


// Shutdown stops the exporter and is invoked during shutdown.
func (emf *emfExporter) Shutdown(ctx context.Context) error {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we force to flush the unpublished batched log events(in puhsers) here?

Copy link
Member

@mxiamxia mxiamxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tested this implementation within AWS test environments. @kbrockhoff, pls help us review the PR. Thanks.

@bogdandrutu
Copy link
Member

@bogdandrutu
Copy link
Member

@anuraaga please review, we heard that this is urgent and we don't have time to review 5K lines PRs. But I trust that you as one of the future maintainer of this component can do this.

This exporter converts OpenTelemetry metrics to
[AWS CloudWatch Embedded Metric Format(EMF)](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html)
and then sends them directly to CloudWatch Logs using the
[PutLogEvents](https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html) API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really for this PR but is it easy to support both API sending and writing to STDOUT? I guess STDOUT can help in many cases too.

| `no_verify_ssl` | Enable or disable TLS certificate verification. | false |
| `proxy_address` | Upload Structured Logs to AWS CloudWatch through a proxy. | |
| `region` | Send Structured Logs to AWS CloudWatch in a specific region. | determined by metadata |
| `local_mode` | Local mode to skip EC2 instance metadata check. | false |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove local_mode? I know our other exporter has it but I want to consider removing it there too - I had trouble explaining this flag to a user before. We use it to control returning errors, but I don't think that's actually a good idea. I think we also use it to flag whether to fetch region info, but it seems ok to just try in any case instead of having this confusing flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. will remove this local_mode

| `endpoint` | Optionally override the default CloudWatch service endpoint. | |
| `no_verify_ssl` | Enable or disable TLS certificate verification. | false |
| `proxy_address` | Upload Structured Logs to AWS CloudWatch through a proxy. | |
| `region` | Send Structured Logs to AWS CloudWatch in a specific region. | determined by metadata |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the AWS_REGION flag here since it's other confusing that we have another mechanism for setting region.

)

// GetAWSConfigSession returns AWS config and session instances.
func GetAWSConfigSession(logger *zap.Logger, cn connAttr, cfg *Config) (*aws.Config, *session.Session, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoever merges all the AWS conn.go files into one under internal/aws will get a prize

)

const (
// this is the retry count, the total attempts would be retry count + 1 at most.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// this is the retry count, the total attempts would be retry count + 1 at most.
// this is the retry count, the total attempts will be at most retry count + 1.


const (
CleanInterval = 5 * time.Minute
MinTimeDiff = 50 // We assume 50 micro-seconds is the minimal gap between two collected data sample to be valid to calculate delta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MinTimeDiffMicros

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively stick with time package, 50 * time.Microseconds

serviceNamespace, svcNsOk := attributes[conventions.AttributeServiceNamespace]

if svcNameOk {
svcAttrMode = ServiceNameOnly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This svcAttrMode seems unnecessarily complex, isn't it just

if svcNameOk && svcNsOk {
  namespace = fmt....
} else if svcNsOk {
  namespace = serviceNamespace
} else if svcNameOk {
  namespace = serviceName
}

totalDroppedMetrics += len(metric.GetTimeseries())
continue
}
//TODO: Handle OTLib as a dimension when it's supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing this since we converted to OC, but we should be sticking to the native OTel format


pusher := newPusher(logGroupName, logStreamName, svcStructuredLog)

// For blocking queue, assuming the log batch payload size is 1MB. Set queue size to 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment in the wrong place? No queue size here

//* None of the log events in the batch can be older than 14 days or the
//retention period of the log group.
currentTime := time.Now().UTC()
utcTime := time.Unix(0, *logEvent.InputLogEvent.Timestamp*1e6).UTC()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you can use something in time. instead of magic number 1e6

…ove logging with zap logger, remove unnecessary metrics transformation to OCMetrics
Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a couple more small but important points but should be straightforward. Just a bit more, thanks!


if response != nil {
if response.RejectedLogEventsInfo != nil {
rejectedLogEventsInfo := response.RejectedLogEventsInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder that later we'll need metrics for these sort of failures

err = wrapErrorIfBadRequest(&returnError)
}
if err != nil {
emf.logger.Error("Experiences some errors when gracefully shutting down emf_exporter. Skipping to next pusher.", zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
emf.logger.Error("Experiences some errors when gracefully shutting down emf_exporter. Skipping to next pusher.", zap.Error(err))
emf.logger.Error("Error when gracefully shutting down emf_exporter. Skipping to next pusher.", zap.Error(err))


func (logEvent *LogEvent) truncateIfNeeded() bool {
if logEvent.eventPayloadBytes() > MaxEventPayloadBytes {
log.Printf("W! logpusher: the single log event size is %v, which is larger than the max event payload allowed %v. Truncate the log event.", logEvent.eventPayloadBytes(), MaxEventPayloadBytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh can you make sure to use the zap logger throughout the PR? We don't use standard library logging in the collector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. Thanks

//retention period of the log group.
currentTime := time.Now().UTC()
utcTime := time.Unix(0, *logEvent.InputLogEvent.Timestamp*int64(time.Millisecond)).UTC()
duration := currentTime.Sub(utcTime).Hours()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to round the hours? How about converting 24*14 to duration instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably have constants for the durations for the staleness check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it's better to use constants here.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Collector automation moved this from Review in progress to Reviewer approved Sep 25, 2020
@bogdandrutu bogdandrutu merged commit 434bfda into open-telemetry:master Sep 26, 2020
Collector automation moved this from Reviewer approved to Done Sep 26, 2020
dyladan referenced this pull request in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
This adds a processor that drops data according to configured memory limits.
The processor is important for high load situations when receiving rate exceeds exporting
rate (and an extreme case of this is when the target of exporting is unavailable).

Typical production run will need to have this processor included in every pipeline
immediately after the batch processor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Collector
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

7 participants