Timeout event generation #33

pandaadb · 2016-06-14T16:45:43Z

Hi guys!

I have seen a discussion in a different thread but it was closed. Today I came across a requirement for our usecase where we want to aggregate our data, but we do not have end events. So I googled a bit and found this and thought the timeout usecase would be ideal for me. I cloned your repo and implemented the following:

Add "timeout_code"
This field is the same as code, but it is code that will be executed for the timeout action. I am not sure if this is necessary or if I can simply reuse the code property.
Add "timeout_id" which is nothing but a string mapping towards the task_id. It defines how this will be mapped in the final event.
Add event generation for expired events.
Whenever flush is called, I now generate a new event per key in the aggregated_map. The key is added by default, as is the creation timestamp. They key is mapped to "timeout_id".
If no timeout_code is defined, nothing is happening (i am hoping this ensures backwards compatibility)
Changing the expiry:
Instead of comparing the timestamp of the creation, i now update a "last_modified" property each time there is a new event that is aggregated. This way I can aggregate over a longer period of time without having to restart the aggregation because timeout from the start event was found.

I added test cases for the changes that I have made.

You can find my changeset here: pandaadb@ba57fac

I would be happy to create a pull request if you guys think that this is a useful feature.

I am coming from a java background and am not too experienced with ruby coding, so if there are serious issues with my code, please give me a chance to correct them :)

Let me know what you think,

thanks,

Artur

The text was updated successfully, but these errors were encountered:

fbaligand · 2016-07-10T07:40:22Z

Hi @pandaadb

I have just released version 2.2.0, with a new feature : push_previous_map_as_event
I know this is not exactly what you describe in this issue.
But you might find it interesting.

You can find a good example here :
https://github.com/logstash-plugins/logstash-filter-aggregate#example-3

fbaligand · 2016-07-14T14:58:01Z

Hi @pandaadb,

Several people have the same need than you.
So I think it is time to make your issue as the next aggregate plugin enhancement !
I'm ready to help you actively in this enhancement (design orientation, code review, ...).

Firstly, are you still interested by implementing this enhancement ?

pandaadb · 2016-07-14T15:10:52Z

Hi @fbaligand

Sorry - i missed the update the last time. I am definitely still interested in that. I have since continued working on the plugin and am using it in production (so it does seem to work).

My branch/fork is here: https://github.com/pandaadb/logstash-filter-aggregate

I have added a few extra options (which I am not sure are all needed) including:

Track timeout time based on a defined timstamp field rather than the time of the platform (this is important when re-parsing old data since otherwise all data will be read within timeout and aggregated wrongly)
Track times based on an external key. This is important when e.g. using file input. Since the input can deliver different files at different times, the timestamp needs to be key specific (so that a later file does not timeout an earlier file)
flush_on_all_events - important because reparsing old data is too fast to time out events (e.g. if it happens every flush call, then you have a 5 second gap where events might advance the timeout even though it shouldn't)

I'm happy to discuss what could be useful/merged if not all. I learned a bit more about ruby since starting on that so I am hoping I didn't produce too much of a mess :)

fbaligand · 2016-07-14T16:35:11Z

Wow, that's a lot of options :)
I will study all these options. I'm not sure all these new options are useful.
But I like all initial options you suggest.

Concerning timeout_id, if I understand well, it is the field name that is added to "timeout" generated event, that's it ? Given your explanation, I'm not sure what you put inside ? Is it task_id ? Is it task creation time ?

Anyway, could you rebase your branch, so that all your commits are "after" last master commit ?

pandaadb · 2016-07-14T16:41:17Z

Hi,

yeah, timeout_id is a bit useless I think and also confusing. It forces the task_id property to be in the time'ed out event, so that the timeout event can be matched back to the id that created it.

For example:

timeout_id => "hello"
task_id => "x"

if now an event comes in, the filter looks at field "x" which for this example has the value "World".
When the timeout occurs, the filter will create an event:

event[timeout_id] = map[task_id]

which will end up looking like:

event["Hello"] = "World"

So now the timeout event can be matched back to the startevent, since we know that the field "hello" in the timeout event represents the task_id used for the start event.

So in short: timeout_id is the field name used for the task_id that is used for aggregation.

I am not sure what rebasing means? Do you mean create a new fork after your master and merge my branch in?

fbaligand · 2016-07-14T17:59:31Z

OK. I think that an optional option called "timeout_task_id_field" is relevant. It would be set on timeout event only if set in configuration.

Then I wonder : do you have an option to say that task timeout involves aggregation map is pushed as a new event ?

rebase is a git command. It does closely what you say.
To say it short, it rebases your local master on remote master and then applies all your commits since your fork.
Later, it will allow to merge your branch into "logstash-plugins" master straight-forward.

First of all, I invite you to tag your present branch.

Then, you can do these commands to add upstream remote branch :

git remote add upstream https://github.com/logstash-plugins/logstash-filter-aggregate.git
git remote -v
git fetch --all

Then, either you rebase your code on your master, or if you prefer, on a special branch that you create from your master.
To rebase, do that :
git rebase remotes/upstream/master

I prevent you, you will certainly have conflicts to resolve :)
Each time you resolve a conflict, you can do : git rebase --continue

That said, if you're not comfortable with rebase, you could do what you say : clone "logstash-plugins" master, and then apply your commits onto it.

Anyway, when you will merge/rebase, I think you will have some conflicts to resolve about timeout event creation. Because, there's some code done for that, associated with option "push_previous_map_as_event".
Don't hesitate to ask questions if you have.

pandaadb · 2016-07-15T09:48:07Z

Hi,

I will attempt the rebase later today and get back to you. About your question:

I don't have an option that says the aggregation map is pushed. Instead I have an option timeout_code which does the same as the regular code, with the exception that the code is only executed on timeout.

The timeout_code gets the aggregation map and a new event so the user has full control on what he wants to do with the aggregation map in a timeout situation (e.g. in my case I add a few fields that indicate the timeout, and aggregate some other values from within the map, so just pushing the map as an event wasn't enough for me). I could imagine a situation where the timeout_code would default into simply mirroring the aggregation map, so in case people have no desire to do their own operations, it would simply push the map as a new event.

fbaligand · 2016-07-15T17:59:53Z

Hi,

I like your 'timeout_code' idea. But I see it a little bit differently :

I would add a new explicit option called "push_map_as_event_on_timeout". It is a boolean option that is false by default. The name is willingly close to the brand new option "push_previous_map_as_event".
if this option is enabled, when timeout occurs, aggregate plugin creates a new event, with full aggregation map inside.
and if timeout_code is defined, aggregate plugin calls the code, passing only event as parameter (to make more custom stuff).
this would allow an interesting thing : make this 'timeout_code' option also available for users which enables 'push_previous_map_as_event' option.

fbaligand · 2016-07-17T09:55:51Z

Hi @pandaadb,

What do you think about my previous comment ?

Concerning the last 3 options you speak about, I think it could be the object of another issue/pull request.
Are you ok with that ?

pandaadb · 2016-07-18T08:42:51Z

@fbaligand hi

If I understand you correctly we'd want to split it in 2 phases then:

Add option to push the timeout event with 2 extra configurations. Firstly enable it, and second provide (optional) code that can modify the previously created event.
Timeout tracking as I described above (with timestamps being grouped by a key etc).

I agree with you. I think the first option is most useful. The second is quite close to my usecase and maybe not as useful to others.

One other thing: What do you think about having an option that checks timeout on each event? By default events can only time out after 5 seconds. My usecase (though triggered by reparsing old data) needed a much more agressive timeout behaviour.

Lastly, sorry I haven't gotten around to rebasing or coding anything yet :/ I hope I will get to it this week.

Thanks!

pandaadb · 2016-07-18T15:19:03Z

Hi,

so I have started work on this.

I did the following:

I tagged my branch as it was. Then I created an actual branch and committed it (so i don't lose my changes)

Then I rolled back to the point for event generation (without all the extra stuff of the time tracking based on file keys etc). So now I am doing the rebase and committing. Once I have merged succesfully I will make the changes so they match the properties that we discussed.

Once I rebase I assume I get your changes as well? (Push map as event)

Thanks,

Artur

fbaligand · 2016-07-18T17:50:44Z

So yes, you understand correctly :)
Don't forget to add new option "timeout_task_id_field" and "last_modified" property
OK to add option "flush_on_all_events" (with false as a default value)
Yes, rebase will get the last changes in "logstash-plugins/master" branch. It is precisely the goal : get the last changes since your fork, and then apply your commits.
After reading one more time, I still don't understand your new option "Track times based on an external key". Can you give some concrete example to explain ?

pandaadb · 2016-07-19T09:05:58Z

Hi,

One question from my side, I don't know what this code of yours does:

if (@push_previous_map_as_event and !@@aggregate_maps.empty?)
          previous_map = @@aggregate_maps.shift[1].map
          event_to_yield = LogStash::Event.new(previous_map)

If you could elaborate on that (it almost looks like the flush-on-all-events).

The track times based on external key works like this:

Scenario: You are reparsing 1 months worth of data. The input plugin guarantees that (if single threaded) 1 file comes in in the read order, so:

T1 XYZ
T2 XYZ
...

Where T1 < T2 and so on. So this will work ok since T1 always < T2 and the timeout does not occur.

However this has 2 problems:

Reparsing a lot of events means that the events come in really fast (1 second is 24 hours). Say you want to expire after 15 minutes. Reparsing 1 file will never expire the events because they come in so fast (and the timer you currently use is simply Time.now). For this usecase we need to track the time in the event, rather than the time in real-time. Accordingly, the element.creation_timestamp must be the timestamp in the event, not the time the event was seen.

So When E1 comes in, the event time is T1 (which might be 1 month in the past).

so the variable: @@last_eviction_timestamp = Time.now becomes:

@@last_eviction_timestamp = event['my-timestamp-field']

Input can come in out of order. The file input plugin guarantees order for one file, but it can pick files at random (e.g. parse file A from today, then file B from 1 month ago, then file C from 1 week ago, ...).

So with this in mind, if the element.creation_timestamp is the timestamp of the event, this logic will no longer work:

(element.creation_timestamp < min_timestamp)

Because the creation_timestamp could be 1 month ago, while min_timestamp is calculated based on Time.now.

Also, it can't work to just track 1 timestamp, since the nature of files coming in (as well as the nature of multithreaded, say file A and C are read at the same time) will override their timestamp.
So instead, i tracked the times in a map:

@@timestamp['my-unique-file-path'] = event['timestamp']

That way, the times are tracked on:

Based on the file path, which is unique and guaranteed in order
Based on the event's timestamp rather than when the event was seen

With that approach (in addition of expire on every event), i can reparse all the logs and expire the events.

Imagine Event E1 - EX come in within 1 second, but they are all events of 1 day (24 hours). I want to expire events that are 15 minutes apart. With the above approach the events can come in as fast as they want, because I am tracking the event's timestamp which will jump 24 hours in 1 second and correctly detect expiry based on the event's timestamp.

Ok - I hope this makes sense :) If not, let me know and I will write up a more concrete example

fbaligand · 2016-07-19T22:19:30Z

The code with "push_previous_map_as_event" is particularly done for jdbc input use case.
You can see Example # 3 for example use case : https://github.com/logstash-plugins/logstash-filter-aggregate/#example-3

It is a very specific use case, where tasks come one after the other. Firstly all task1 events, then all task2 events, etc...
So there is no "entrelaced" task events.
In this very specific use case, as soon as we detect a new task id, we know that previous aggregate map could be pushed as a new event.
So in this case : no timeout, no flush, and maximum only one map in @@aggregate_maps.

Is this more clear for you ?

pandaadb · 2016-07-20T09:45:51Z

Hi!

Oh that makes sense, thanks. Now I understand what the code does as well.

My implementation will then:

leave filter alone for push_previous_map_as_event. This preserves your usecase
Add "flush_on_all_events" default false to enable expiry for every event
change removeExpiredMaps to react on push_previous_map_as_event as well as push_map_as_event_on_timeout (so that'll be OR) to enable that.
Add timeout_code for both cases (in 1 and 3)

So my config would set
push_previous_map_as_event => false
push_map_as_event_on_timeout => true

Because i would have multiple task_ids (so my map is always of size > 1)

Sounds like a Plan :) Thanks for the clarifications!

pandaadb · 2016-07-20T16:48:07Z

Hi,

I am having git troubles I think.

I am now at a point were I merged the changes I wanted to merge, so the version I have locally has:

timeout_task_id_field option to map the task_id into the event
timeout_code for both push_previous_map_as_event as well as push_map_as_event_on_timeout
tests for push_map_as_event_on_timeout

I think this is all we wanted to merge for the first version.

However, rebase wants to continue merging the other changes I did as well (tracking timestamp on event). I don't want to merge those in, since I will have to remove them afterwards. I believe this is because my commit tree on master kind of looks like:

commit V1
commit V2
rollback to V1
commit V1 again

So obviously I can skip step 2, 3 and 4 because they cancel each other out. But I don't know how to stop rebasing after V1?

fbaligand · 2016-07-21T07:59:50Z

About your 4 implementation points, I fully agree with what you say.

fbaligand · 2016-07-21T08:12:07Z

Concerning your git problem :

finally, rebase is probably not the good solution. Sorry about that.
to make it simple, I suggest :
-- create a new branch that is a clone from upstream/master
-- add on this branch, each commit you want using cherry pick : "git cherry-pick commit1 commit2 ..."
-- commit1 and others are the commit hashes.

If it's not clear or need more help, tell me.

pandaadb · 2016-07-21T09:24:06Z

Hi Fabian!

thanks for your help :) I struggled a bit because I tried to commit a branch to the logstash-plugin (which obviously I have no access to). I managed to fix it now, and I think the first version stands:

https://github.com/pandaadb/logstash-filter-aggregate/

What I did:

I copied my rebase (to the point where I wanted it) and then aborted it
I reverted all changes for my master and committed the rebase (which automatically merged it with your master branch)
Then I did a "git merge upstream/master" and it told me all is up to date.

Would you like a pull request or would you rather look over it first?

(Edit: the docs don't match anymore, i will update them once the code is ok)

Thanks!

Artur

fbaligand · 2016-07-21T12:02:57Z

You can create a PR !
Doing that, I can review your changes, make comments, and if after, you push some changes, it will be automatically part of the PR.

Fabien

Le 21 juil. 2016 à 11:24, pandaadb notifications@github.com a écrit :

Hi Fabian!

thanks for your help :) I struggled a bit because I tried to commit a branch to the logstash-plugin (which obviously I have no access to). I managed to fix it now, and I think the first version stands:

https://github.com/pandaadb/logstash-filter-aggregate/

What I did:

I copied my rebase (to the point where I wanted it) and then aborted it
I reverted all changes for my master and committed the rebase (which automatically merged it with your master branch)
Then I did a "git merge upstream/master" and it told me all is up to date.
Would you like a pull request or would you rather look over it first?

Thanks!

Artur

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

pandaadb · 2016-07-21T12:24:24Z

Cool, I created #37 :)

Thanks!

Artur

fbaligand · 2016-08-03T22:51:17Z

@pandaadb release 2.3.0 is done with timeout event generation !

pandaadb · 2016-08-08T09:24:45Z

Yey :) I hope all works as expected and people enjoy the new feature! Pleasure working with you!

fbaligand · 2016-08-08T09:44:45Z

It was a pleasure for me too !
I had some feedback by mail saying it works fine !

Nice news :)

fbaligand · 2016-08-09T13:22:46Z

Breaking news : official logstash documentation has just been updated with your new options and your sample !
It concerns "master" and "5.0 beta" branches :
https://www.elastic.co/guide/en/logstash/master/index.html

pandaadb · 2016-08-12T12:37:32Z

That's great :)

I wonder if the documentation needs to be updated? I believe that there is a breaking change in Logstash 5+ where you can not handle the event as an array but you need to use set or put?

e.g.

So instead of saying code needs to be:

event[bla] = 'xyz'

it should be

event.set("bla", "xyz")

For example in Example 2. you have:

code => "event['sql_duration'] = map['sql_duration']"

fbaligand · 2016-08-12T17:21:37Z

Yes, there is a Breaking change in 5.0 where logstash event becomes a Java bean, and not anymore a ruby object.
Moreover, presently, aggregate plugin itself is not compatible with logstash 5.0 because of that.

Anyway, this is important that logstash official documentation has been updated because lot of people only look at this documentation and don't look at github site or plugin code.

Fabien

Le 12 août 2016 à 14:37, pandaadb notifications@github.com a écrit :

That's great :)

I wonder if the documentation needs to be updated? I believe that there is a breaking change in Logstash 5+ where you can not handle the event as an array but you need to use set or put?

e.g.

So instead of saying code needs to be:

event[bla] = 'xyz'

it should be

event.set("bla", "xyz")

For example in Example 2. you have:

code => "event['sql_duration'] = map['sql_duration']"

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or mute the thread.

fbaligand · 2016-10-02T22:41:03Z

For info, I just released version 2.3.1, with a new option :
timeout_tags, which let you define tags to add to generated event when a timeout occurs.

SolomonShorser-OICR · 2018-01-31T20:44:55Z

Track timeout time based on a defined timstamp field rather than the time of the platform (this is important when re-parsing old data since otherwise all data will be read within timeout and aggregated wrongly)

Was this ever implemented, or is there a way to do this? I am processing some historical logs and this feature would be extremely useful

fbaligand · 2018-01-31T20:56:22Z

No. There is no such a feature.
Currently, timeout and inactivity_timeout options are based on platform time.

If you need such a feature, I invite you to open a specific issue.

SolomonShorser-OICR · 2018-01-31T21:11:00Z

Ok, I'll open an issue.

fbaligand added the enhancement label Jun 26, 2016

fbaligand mentioned this issue Jun 26, 2016

Feature requests #14

Closed

fbaligand mentioned this issue Jul 14, 2016

Aggregating maps for multiple tasks #36

Closed

fbaligand added the v2.3.0 label Aug 2, 2016

fbaligand closed this as completed Aug 3, 2016

SolomonShorser-OICR mentioned this issue Jan 31, 2018

Track timeout time based on a defined timstamp field rather than the time of the platform #81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout event generation #33

Timeout event generation #33

pandaadb commented Jun 14, 2016

fbaligand commented Jul 10, 2016

fbaligand commented Jul 14, 2016 •

edited

pandaadb commented Jul 14, 2016

fbaligand commented Jul 14, 2016

pandaadb commented Jul 14, 2016

fbaligand commented Jul 14, 2016 •

edited

pandaadb commented Jul 15, 2016

fbaligand commented Jul 15, 2016

fbaligand commented Jul 17, 2016

pandaadb commented Jul 18, 2016

pandaadb commented Jul 18, 2016

fbaligand commented Jul 18, 2016 •

edited

pandaadb commented Jul 19, 2016

fbaligand commented Jul 19, 2016

pandaadb commented Jul 20, 2016

pandaadb commented Jul 20, 2016

fbaligand commented Jul 21, 2016

fbaligand commented Jul 21, 2016

pandaadb commented Jul 21, 2016 •

edited

fbaligand commented Jul 21, 2016

pandaadb commented Jul 21, 2016

fbaligand commented Aug 3, 2016

pandaadb commented Aug 8, 2016

fbaligand commented Aug 8, 2016

fbaligand commented Aug 9, 2016

pandaadb commented Aug 12, 2016

fbaligand commented Aug 12, 2016

fbaligand commented Oct 2, 2016

SolomonShorser-OICR commented Jan 31, 2018

fbaligand commented Jan 31, 2018

SolomonShorser-OICR commented Jan 31, 2018

Timeout event generation #33

Timeout event generation #33

Comments

pandaadb commented Jun 14, 2016

fbaligand commented Jul 10, 2016

fbaligand commented Jul 14, 2016 • edited

pandaadb commented Jul 14, 2016

fbaligand commented Jul 14, 2016

pandaadb commented Jul 14, 2016

fbaligand commented Jul 14, 2016 • edited

pandaadb commented Jul 15, 2016

fbaligand commented Jul 15, 2016

fbaligand commented Jul 17, 2016

pandaadb commented Jul 18, 2016

pandaadb commented Jul 18, 2016

fbaligand commented Jul 18, 2016 • edited

pandaadb commented Jul 19, 2016

fbaligand commented Jul 19, 2016

pandaadb commented Jul 20, 2016

pandaadb commented Jul 20, 2016

fbaligand commented Jul 21, 2016

fbaligand commented Jul 21, 2016

pandaadb commented Jul 21, 2016 • edited

fbaligand commented Jul 21, 2016

pandaadb commented Jul 21, 2016

fbaligand commented Aug 3, 2016

pandaadb commented Aug 8, 2016

fbaligand commented Aug 8, 2016

fbaligand commented Aug 9, 2016

pandaadb commented Aug 12, 2016

fbaligand commented Aug 12, 2016

fbaligand commented Oct 2, 2016

SolomonShorser-OICR commented Jan 31, 2018

fbaligand commented Jan 31, 2018

SolomonShorser-OICR commented Jan 31, 2018

fbaligand commented Jul 14, 2016 •

edited

fbaligand commented Jul 14, 2016 •

edited

fbaligand commented Jul 18, 2016 •

edited

pandaadb commented Jul 21, 2016 •

edited