Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break up large bulk requests into multiple requests. #497

Closed
wants to merge 3 commits into from

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Oct 26, 2016

This PR changes the HTTP Client to try as best as it can to restrict bulk requests to 20MB or less. If a single document is larger than this it will be sent alone in a separate request.

This PR also changes the flush_size to be effectively infinite. If users have explicitly set the flush_size it will still work until that option is made obsolete.

Fixes #164, #284, #382, #453, #499

@andrewvc
Copy link
Contributor Author

I'm thinking 75MB may be too high, but I set it high because if we set it too low we'll reject legit documents that are large. Maybe we need two options max_bulk_megabytes for the maximum request and target_bulk_megabytes for the size to attempt to hit. We would still send if the bulk size winds up being larger than target_bulk_megabytes and only drop / log if the doc is larger than max_bulk_megabytes. Thoughts?

@andrewvc andrewvc changed the title Add max_bulk_megabytes option [WIP] Add max_bulk_megabytes option Oct 28, 2016

# The maximum size an individual document can be without being dropped and logged
# Elasticsearch will not accept HTTP requests > 100MB. You likely do not want to increase this size
config :max_doc_megabytes, :validate => :number, :default => 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we don't' need this setting. I think we can have the code know the max size that will be rejected by Elasticsearch and just use that internally.

Can you tell me a story of why a user would want to set this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

andrewvc#3 moves to use bytes instead of number.

# after which it will reject the request. If the batch received by this plugin exceeds this amount
# it will be split up and sent in multiple requests. A request may exceed this size if there is a single
# document larger than it. You can limit an individuals document's size with max_doc_megabytes
config :target_bulk_megabytes, :validate => :number, :default => 75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I greatly prefer base units for settings, so bytes instead of megabytes.

I'll send you a PR to add bytes to your branch.

next
end

if (bulk_body.size + as_json.size) > @target_bulk_bytes
Copy link
Contributor

@suyograo suyograo Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to call bulk_body.size every time? Maybe we memoize this size, for the in-flight bulk_body? As in, keep a running total of the current batch size. You only need to calculate size for current doc every time, and then add it to bulk_body_size. Then reset bulk_body_size when the request is complete or when an error happens. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically memoized in the internal impl.

# save for the case where a filter increases the size of an inflight batch by outputting
# events.
mod.config :flush_size, :validate => :number, :default => 500
# DEPRECATED. We now restrict bulk sizes to a target size of 20MB. Larger events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might scare users.

Maybe say "We now automatically send the best-sized bulk request where the size is chosen to minimize network latency and memory pressure. Internally, this means we will split a bulk request into multiple requests if they would exceed 20MB in size."

Main goal was to express the intent and also to avoid the word "restrict"

bulk_body = ""
bulk_responses = []
bulk_actions.each do |action|
as_json = action.is_a?(Array) ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would action be an array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_action_builder returns either a a single action or an array of them (some bulk things take up two lines, like indexing). Before this was solved via a call to .flatten

end

# Flatten array by one level only
def flatten_one_array(arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordansissel this is for extra flat arrays... ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So flat it's zero-dimensional! Computers are cool!

action.map {|line| LogStash::Json.dump(line)}.join("\n") :
LogStash::Json.dump(action)

if (bulk_body.size + as_json.size) > TARGET_BULK_BYTES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be bytesize here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. My brain is still in ruby 1.8 where utf-8 strings weren't common. +1 to .bytesize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordansissel Well, It help when your language make things breaks :)

@jordansissel
Copy link
Contributor

@andrewvc I just scanned a few old issues. Can you make your commits say that they fix #164, #284, #382, #453, #499

@jordansissel
Copy link
Contributor

Also elastic/logstash#4000

# events.
mod.config :flush_size, :validate => :number, :default => 500
# DEPRECATED. We now restrict bulk sizes to a size that balances heap size with reducing round-trips. Very large events will be sent in a single request by themselves.
mod.config :flush_size, :validate => :number, :default => 500, :deprecated => "No Longer Needed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR actually removes the feature (what we can "obsolete"), instead of deprecating it. Since deprecation is out of this "Add max_bulk_megabytes option" PR scope, can we do it in a separate one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point, but I think this is still a good use of deprecate. In this case we don't want to break configs, but we want people to stop explicitly setting this. We could leave this feature in, but that's bad because it does things no one should want to do.

I'm open to hearing a conflicting opinion here of course, but there's no good reason to set flush_size once this PR lands IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with everything but this being a good use of deprecate.
Tagging an option as deprecated and making it a noop can be very dangerous (remember this setting is non default very frequently). I don't feel comfortable in assuming flush_size is useless for everyone and simply swapping the flush heuristic from count to size will be a win for everyone.
Is for this lack of certainty that we have the deprecate -> obsolete cycle, so we can test the waters, provide a fall back for unexpected scenarios and also a smoother migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, well, I'm fine deprecating it and preserving old behavior for now. But I would like to remove this option before 6.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsvd my original intent with deprecate was, at the configuration interface, that a configuration would still be loaded successfully but we would warn a user about the deprecation. With obsolete, my intent was to fail to load the config but tell the user why.

From this, I think making it :deprecated as a no-op is fine, since the config still runs.

You have a good point about the folks setting flush_size previously, though, but to be honest, all the uses I can think of are basically workarounds for perceived for actual problems with how much we send to Elasticsearch in a single request. This patch hopefully will remove those problems (perceived or actual) with Elasticsearch bulk sizes, and in doing so, will remove the need for that band-aid -- therefore, I think it's OK to immediately turn :flush_size into a no-op.

Thoughts? Am I missing a key reason that isn't a work-around for why someone would set flush_size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsvd in your favor, the most minimal user-experience change would be to mark it deprecated, set the default to nil, but keep the existing behavior if someone sets flush_size directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'm glad like to see flush_size go away, but pretty much all production cases I've seen tune that setting in some way, which is why I'm concerned with removing it.

@andrewvc andrewvc force-pushed the fix-max-bulk-size branch 5 times, most recently from 0c0f989 to fcf99f4 Compare November 3, 2016 20:34
@andrewvc andrewvc changed the title [WIP] Add max_bulk_megabytes option Add max_bulk_megabytes option Nov 3, 2016
@@ -81,7 +81,7 @@ def self.included(mod)
# If you specify a number larger than the batch size of your pipeline it will have no effect,
# save for the case where a filter increases the size of an inflight batch by outputting
# events.
mod.config :flush_size, :validate => :number, :default => 500
mod.config :flush_size, :validate => :number, :default => 99999999, :deprecate => "This setting is no longer necessary as we now try to restrict bulk requests to sane sizes. See the 'Batch Sizes' section of the docs. If you think you still need to restrict payloads based on the number, not size, of events, please open a ticket."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we default to nil? A value of 9999999 looks weird (will show up in the docs) and confuses me, and may confuse users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point!

@@ -5,6 +5,8 @@
require 'logstash/outputs/elasticsearch/http_client/manticore_adapter'

module LogStash; module Outputs; class ElasticSearch;
TARGET_BULK_BYTES = 20 * 1024 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a comment here something maybe explaining why we made this a constant instead of a tunable? My thought is that I want to avoid having users yelling at us about a "bad decision" making this not configurable (even though we agreed to make this not a configuration setting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment!

@jordansissel jordansissel changed the title Add max_bulk_megabytes option Break up large bulk requests into multiple requests. Nov 4, 2016
@@ -81,7 +81,7 @@ def self.included(mod)
# If you specify a number larger than the batch size of your pipeline it will have no effect,
# save for the case where a filter increases the size of an inflight batch by outputting
# events.
mod.config :flush_size, :validate => :number, :default => 500
mod.config :flush_size, :validate => :number, :deprecate => "This setting is no longer necessary as we now try to restrict bulk requests to sane sizes. See the 'Batch Sizes' section of the docs. If you think you still need to restrict payloads based on the number, not size, of events, please open a ticket."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordansissel
Copy link
Contributor

Left one comment. Once this typo is fixed, you have my lgtm! :)

@andrewvc andrewvc dismissed jordansissel’s stale review November 4, 2016 22:46

Got his LGTM post typo

@elasticsearch-bot
Copy link

Andrew Cholakian merged this into the following branches!

Branch Commits
master dc61ac1, 6d83875, 68612e6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants