Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: add bytes limit for a complete log #17387

Closed
yutingcaicyt opened this issue Jan 5, 2023 · 22 comments
Closed

feature: add bytes limit for a complete log #17387

yutingcaicyt opened this issue Jan 5, 2023 · 22 comments
Assignees

Comments

@yutingcaicyt
Copy link
Contributor

Component(s)

pkg/stanza

Is your feature request related to a problem? Please describe.

version: v0.61.0
When I use recombine operator to aggregate log entries, I found some problems:

  1. there is a "max_batch_size" limiting the nums of entries for a complete log, however, I think the "bytes limit" is what the users want to set. With "max_batch_size", it's hard to use and difficult to limit the maximum bytes of a complete log.
  2. when the "max_batch_size" is reached, the entries in a batch would be aggregated and flushed, then the batch becomes empty. At this time, if some entries which belong to the previous log come in, they would be flushed individually. I think the better way is to aggregate these remaining entries together because they originally belong to a complete log.

Describe the solution you'd like

  1. Add a config parameter "max_log_size" to recombine the operator. This parameter means the maximum bytes of the log. If the bytes size of entries in a batch reaches the limit, then all entries in the batch will be aggregated and flushed. With this parameter, I think the old "max_batch_size" is useless and is considered to be deleted.
  2. Once the size of a batch reaches the limit and flush all entries of the batch, the "truncate" info needs to be recorded. And if some entries belonging to the previous log come in, then the "truncate" info can tell us these entries should be aggregated.

Describe alternatives you've considered

No response

Additional context

No response

@yutingcaicyt yutingcaicyt added enhancement New feature or request needs triage New item requiring triage labels Jan 5, 2023
@yutingcaicyt
Copy link
Contributor Author

@djaglowski Any idea about this feature?

@fatsheep9146 fatsheep9146 added pkg/stanza and removed needs triage New item requiring triage labels Jan 10, 2023
@github-actions
Copy link
Contributor

Pinging code owners for pkg/stanza: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

when the "max_batch_size" is reached... At this time, if some entries which belong to the previous log come in, they would be flushed individually.

This looks like a separate issue from the max_log_size proposal. Can it be addressed separately?

@djaglowski
Copy link
Member

Add a config parameter "max_log_size" to recombine the operator. This parameter means the maximum bytes of the log. If the bytes size of entries in a batch reaches the limit, then all entries in the batch will be aggregated and flushed.

Sounds reasonable. The immediate question is whether or not it should be the concern of the recombine operator to truncate the final line before batching. It's not clear to me that it should, but your proposal indicates otherwise. I may need some help understanding the considerations here.

@djaglowski
Copy link
Member

With this parameter, I think the old "max_batch_size" is useless and is considered to be deleted.

I don't see why this would be useless. Perhaps max_log_size would be better, but some may prefer to use this still. In any case, I don't see why a user couldn't chose one or the other or even both options.

@djaglowski
Copy link
Member

the "truncate" info needs to be recorded. And if some entries belonging to the previous log come in, then the "truncate" info can tell us these entries should be aggregated.

If I'm understanding correctly, you are suggesting that the truncated data should become the first line in the next batch. Is that right?

@djaglowski
Copy link
Member

Regarding truncation, an alternative would be to drop truncated data and any additional lines until is_first_entry/is_last_entry is satisfied. I think your suggestion is preferable though.

@yutingcaicyt
Copy link
Contributor Author

when the "max_batch_size" is reached... At this time, if some entries which belong to the previous log come in, they would be flushed individually.

This looks like a separate issue from the max_log_size proposal. Can it be addressed separately?

yeah, these two issues can be addressed separately.

@yutingcaicyt
Copy link
Contributor Author

Add a config parameter "max_log_size" to recombine the operator. This parameter means the maximum bytes of the log. If the bytes size of entries in a batch reaches the limit, then all entries in the batch will be aggregated and flushed.

Sounds reasonable. The immediate question is whether or not it should be the concern of the recombine operator to truncate the final line before batching. It's not clear to me that it should, but your proposal indicates otherwise. I may need some help understanding the considerations here.

Firstly, I think the best time to truncate the log is before the aggregation of entries because there would be some redundant action if we do it after the aggregation.
Secondly, we use recombine operator to aggregate the entries to a complete log, so adding the truncate action to recombine operator is intuitive. And In this case, users just need to add a config for the recombine operator instead of a new operator.

@yutingcaicyt
Copy link
Contributor Author

With this parameter, I think the old "max_batch_size" is useless and is considered to be deleted.

I don't see why this would be useless. Perhaps max_log_size would be better, but some may prefer to use this still. In any case, I don't see why a user couldn't chose one or the other or even both options.

For this, If users want to limit the bytes size of a log, 'max_log_size' is precise and simple, in which case 'max_batch_log' is perfectly replaced. Will users use 'max_batch_size' to limit other things?

@yutingcaicyt
Copy link
Contributor Author

the "truncate" info needs to be recorded. And if some entries belonging to the previous log come in, then the "truncate" info can tell us these entries should be aggregated.

If I'm understanding correctly, you are suggesting that the truncated data should become the first line in the next batch. Is that right?

Actually, this solution addresses the second problem I mentioned, which means that once a log was truncated, a mark 'isTruncated' would be recorded, and when the remaing entries which belong to previous log come in, the mark can tell that a truncate action is happened at last flush, then we can aggregate them together.

@yutingcaicyt
Copy link
Contributor Author

Regarding truncation, an alternative would be to drop truncated data and any additional lines until is_first_entry/is_last_entry is satisfied. I think your suggestion is preferable though.

yeah, I also considered this alternative, but I think dropping the data directly may not be a very good choice for a agent. Whether such trucated log is useful should be left to the users to judge.

@djaglowski
Copy link
Member

djaglowski commented Jan 11, 2023

For this, If users want to limit the bytes size of a log, 'max_log_size' is precise and simple, in which case 'max_batch_log' is perfectly replaced. Will users use 'max_batch_size' to limit other things?

We do have to consider existing users. The filelog receiver is in beta which means at least that we will try very hard to support backwards compatibility in the configuration. (More info)

I agree max_log_size is a generally superior way to define a limitation. That said, I can imagine some scenarios where max_batch_size is still useful. For example, a custom log format may use a fixed number of lines but not have a well defined pattern for is_first_entry or is_last_entry. In any case, we may be able to deprecate it eventually but I think in the near term we'll need to retain support for it for existing users.

Edit: One additional consideration here, is that max_log_size may be more compute or memory intensive. There may be users who value performance but are not constrained by a specific number of bytes.

@djaglowski
Copy link
Member

once a log was truncated, a mark 'isTruncated' would be recorded, and when the remaing entries which belong to previous log come in, the mark can tell that a truncate action is happened at last flush, then we can aggregate them together

I agree with the overall functionality but I think we should explore in the implementation whether we really need the mark. It may be enough to just truncate, flush, and insert the remainder as the first line in the new batch.

@djaglowski
Copy link
Member

dropping the data directly may not be a very good choice for a agent. Whether such trucated log is useful should be left to the users to judge.

I'm good with this. If it proves necessary in the future, it should be easy enough to add a configuration flag for this.

@Aneurysm9
Copy link
Member

How does limiting to a maximum size in bytes work? In what representation is the size measured? Does this require serialization and length checking the serialized data and then throwing away the serialized data?

@yutingcaicyt
Copy link
Contributor Author

For this, If users want to limit the bytes size of a log, 'max_log_size' is precise and simple, in which case 'max_batch_log' is perfectly replaced. Will users use 'max_batch_size' to limit other things?

We do have to consider existing users. The filelog receiver is in beta which means at least that we will try very hard to support backwards compatibility in the configuration. (More info)

I agree max_log_size is a generally superior way to define a limitation. That said, I can imagine some scenarios where max_batch_size is still useful. For example, a custom log format may use a fixed number of lines but not have a well defined pattern for is_first_entry or is_last_entry. In any case, we may be able to deprecate it eventually but I think in the near term we'll need to retain support for it for existing users.

Edit: One additional consideration here, is that max_log_size may be more compute or memory intensive. There may be users who value performance but are not constrained by a specific number of bytes.

OK,I agree with it.

@yutingcaicyt
Copy link
Contributor Author

once a log was truncated, a mark 'isTruncated' would be recorded, and when the remaing entries which belong to previous log come in, the mark can tell that a truncate action is happened at last flush, then we can aggregate them together

I agree with the overall functionality but I think we should explore in the implementation whether we really need the mark. It may be enough to just truncate, flush, and insert the remainder as the first line in the new batch.

There is another point needs to be discuss. when the bytes size of entries in a batch reaches the limit, how to deal with these entries? I think there is two way:

  1. Limit the bytes size restrictly: split the last entry into two parts, flush previous entries with the first part, then insert the remaining part into the new batch
  2. Guarantee the integrity of the log and limit the bytes size softly: just flush all the entries in the batch. In this case, the bytes size of a log may be greater than max_log_size.
    Both ways are common in the situation of "size limit".

@yutingcaicyt
Copy link
Contributor Author

yutingcaicyt commented Jan 12, 2023

How does limiting to a maximum size in bytes work? In what representation is the size measured? Does this require serialization and length checking the serialized data and then throwing away the serialized data?

I think if limit bytes size in recombine operator, there are no serialization requirements. And when the data comes to recombine operator from the file reader, the body of the entry is a string or bytes array whose bytes size can be easily obtained. Are there any issues I missed in the process?

@yutingcaicyt
Copy link
Contributor Author

@djaglowski Hello, I would like to work on this feature. Is there anything I should know before starting (Besides the contributing guidelines)? Thank you

djaglowski pushed a commit that referenced this issue Feb 2, 2023
…ig to recombine operator (#17387) (#18089)

"max_log_size" is added to the config of recombine operator. Once the total bytes size of the combined field exceeds the limit, all received entries of the source will be combined and flushed.

We've had discussion around this feature (see #17387). Here I choose the "soft limit" rather than truncating the log to get the fixed bytes size when total size exceeds the limit because I think soft limitation is enough for users on this occasion.
@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants