Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataOutputAgent limits events after ordering #1444

Merged
merged 7 commits into from
Aug 8, 2016

Conversation

knu
Copy link
Member

@knu knu commented Apr 20, 2016

This addresses #1044.

@cantino
Copy link
Member

cantino commented Apr 23, 2016

If a new Agent is linked to this DataOutputAgent, it's older events won't show up because of memory[:last_event_id], right? I guess that's fine. I'm concerned this might be adding a lot of complexity, but it's also probably better than what we have?

@brianpetro
Copy link
Contributor

What if we decoupled the data output agent from the prior agent's events?
For example, store all of the required data in an event created by the data
output agent.

On Sat, Apr 23, 2016, 1:45 PM Andrew Cantino notifications@github.com
wrote:

If a new Agent is linked to this DataOutputAgent, it's older events won't
show up because of memory[:last_event_id], right? I guess that's fine.
I'm concerned this might be adding a lot of complexity, but it's also
probably better than what we have?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1444 (comment)

@knu
Copy link
Member Author

knu commented Apr 25, 2016

@cantino That's correct. You could change events_order or events_to_show to reset last_event_id, though.

@knu
Copy link
Member Author

knu commented Apr 25, 2016

@brianpetro Could you elaborate? Is it about making DataOutputAgent create events?

@brianpetro
Copy link
Contributor

@knu creating events or even using memory.

The primary reason I say this is because I occasionally have to delete data that is showing up in the data output agent. It probably wouldn't be clear to a newbie that to do this you must dig into the events of the prior agent.

Something like this might enable more sorting strategies too, but I haven't thought that through.

@knu
Copy link
Member Author

knu commented Apr 25, 2016

I'm not sure if it's related to the original problem. Could be another issue?

@brianpetro
Copy link
Contributor

@knu probably. I'm still getting a hang of where to add my $0.02

knu added 4 commits June 6, 2016 17:41
In the first run, it now checks `2 * events_to_show` events for each
source.

This also fixes the problem where older events are selected when
`events_order` is not specified by sorting events by the `id`.
`events_order` determines how to select events for outputting, whereas
`events_list_order` determines how to list selected events in the
output.
@knu knu force-pushed the data_output_agent_limits_events_after_ordering branch from b298612 to 978ec06 Compare June 7, 2016 06:13
@knu
Copy link
Member Author

knu commented Jun 7, 2016

I renamed events_order to events_list_order with a migration, and re-added events_order that works as it should. I'm not sure if events_list_order is really useful, but the old behavior has been there for awhile and we certainly do not want to break existing agents.

@knu
Copy link
Member Author

knu commented Jun 8, 2016

To explain the implementation, DataOutputAgent keeps a pool of events with the highest scores defined by events_order. Newly received events are compared with those and the pool is updated as necessary. If either events_order is changed or events_to_show is raised, the pool is discarded and recreated it by re-evaluating the most recent 2 * events_to_show events per source (this is new in the revised commits).


unless new_events.empty?
memory[:last_event_id] = new_events.last.id
events.concat(new_events)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for there to be duplicates in here from new_events + events, or is new_events guaranteed not to contain any of events?

Copy link
Member Author

@knu knu Jun 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the latter, exactly. This is the only place memory[:last_event_id] is set to a non-nil value, and it will be greater than or equal to any value in memory[:event_ids]. new_events are selected by id > memory[:last_event_id] (see above) if memory[:last_event_id] is non-nil, so there will be no duplicates between events and new_events.
If memory[:last_event_id] is nil, memory[:event_ids] is assured to be nil or empty and therefore events is empty, so there will be no duplicates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it is guaranteed there'd be no duplicates between new_events and events.

DataOutputAgent will select the last `events_to_show` entries of its received events sorted in the order specified by `events_order`, which is defaulted to the event creation time.
So, if you have multiple source agents that may create many events in a run, you may want to either increase `events_to_show` to have a larger "window", or specify the `events_order` option to an appropriate value (like `date_published`) so events from various sources are properly mixed in the resulted feed.

There is also an option `events_list_order` to control the order of events listed in the output, with the same format as `events_order`. It is defaulted to `#{Utils.jsonify(DEFAULT_EVENTS_ORDER['events_list_order'])}` so the latest entry is listed first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "a legacy option events_list_order", or do you think people will keep using it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

"There is also an option events_list_order that only controls the order of events listed in the output, without attempting to maintain a total order of received events. It has the same format as events_order."

I'm having trouble verbalizing the difference between these two subtly-different behaviors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I've updated the paragraph.

@cantino
Copy link
Member

cantino commented Jul 26, 2016

When I do an events_order of [["{{title}}"]], I seem to get it in reverse order, with z first. When this is set to events_list_order, it puts A first correctly.

@knu
Copy link
Member Author

knu commented Jul 26, 2016

Yes, that's how it works. An events_order of [["{{title}}"]] means "sort by title in ascending order", so the last N (events_to_show) events with the lexically greatest titles will be selected for output in ascending order, like A, B, … Z. Then the default value of events_list_order ([["{{_index_}}", "number", true]]) means "reverse" in one word, so the final output will look like Z, Y, … A.
I can imagine in most use cases events_order will be based on one of the timestamps each event has, and most feeds and timeline data are presented in reverse chronological order, so I'd consider the default value to be reasonable. The problem is how we document it.

@cantino
Copy link
Member

cantino commented Aug 6, 2016

Ah, that makes sense. I agree that highest ID or highest date should be first in a feed. Do you have any concerns about merging this now, such as other testing or feedback we should get?

@knu
Copy link
Member Author

knu commented Aug 8, 2016

@cantino At least simple use cases are covered by the specs and the current implementation is kind of broken for the reporters anyway, so I'm going to merge this now and wait for their feedback.

@knu knu merged commit 9b23c28 into master Aug 8, 2016
@knu knu deleted the data_output_agent_limits_events_after_ordering branch August 8, 2016 05:28
@cantino
Copy link
Member

cantino commented Aug 8, 2016

👏

@knu
Copy link
Member Author

knu commented Oct 7, 2016

I found a bug in this implementation where I assumed received_events were ordered by id asc and old events would be chosen as a result. I'll come up with a fix shortly.

@knu knu restored the data_output_agent_limits_events_after_ordering branch October 7, 2016 10:20
@knu
Copy link
Member Author

knu commented Oct 7, 2016

Should be fixed in 12cecb8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants