Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Map-reduce support #61

Closed
stas opened this Issue · 39 comments
@stas

This is more like an open question.

I need a way to reduce the amount of repetitive activities so that if I get same keys for the same object, I could develop a strategy to map/merge those into a single entry.

Ideally I would patch public_activity to handle that for me. Anyone interested in getting something like this into the core or I should not even bother pitching this.

Thanks

@farnoy
Collaborator

Could you elaborate more on how would you use it? Since many activities of the same key will have different params/owners and such, how can you benefit from that?

@stas

Suppose we have the key status, which means for an activity entry that an actor changed the status. It's not relevant what was the old value, all that is relevant is that the actor did it 3 times already.

Same for, lets say Article, if an actor updates it.

@hadees

I have a similar issue, for example a User's post is liked 100 times by other users. Currently their Activity Feed is going to be spammed with 100 activities instead of just something like Johnny and 99 others liked your post

@stas

A simple approach is to update last activity if some conditions are met. And that would be the easiest way to handle it for basic usecases.

@farnoy
Collaborator

I think I need to study map-reduce more in depth to answer this question. I thought that it was a simple operation for grouping records by fields and getting count of each group. If it's really more complex than an ORM+DB problem, I will learn it shortly and post some meaningful comments here.

@farnoy
Collaborator

I think associating a post with an activity could solve the problem, as every new comment would set the unread status (or something similar) on the notification. Also a habtm relation with join table for read/unread for each user concerned in that post (author and those who commented, for example). I think there's a gem for that, but I can't recall the name.

@pokonski
Owner

That is one hard topic, on which I contemplated from the very beginning. We can either merge them on creation or in views. In one of the apps I developed using public_activity I did the grouping in JavaScript, client-side.

  1. You render all activities in a view.
  2. Then you group them by their key+trackable_id, and merge into one activity.

Looked and behaved properly, but requires fetching many activities (though it was still fast, as the activities were very simple to render). This is the closest I got to Facebook implementation, but on client-side (facebook does this somewhere between database and displaying).

@michelson

Maybe we can use the params or a field with a parent_id activity to know how to group activities and then query only the parents activities, and on each parent activity make the subqueries, grouping , logic to display what you want.
The benefit of this approach is that you are going to display the number of activities you want , suppose that we want to display 100 activities, if you map / reduce those 100, in client or server, you may end with a different quantity of activities - grouping 100 likes end with one activity displayed -.
but quering "parent" activities you always display 100 activities with inner mapped/reduced/subqueried content.

now the question is how to set up some logic for parent / child creation , like:
time - activities created with x mins/secs/hours of distance will be child of x
repetition - if there is no other activity (different trackable & different owner) then it will be a child of x
custom logic in controller
...

makes any sense ?

@pokonski
Owner

My take was to on Activity creation find the most recent activity with the same key/trackable combination and "attach" the newly created activity to the one already existing.

Merging activities is trivial on MongoDB, thanks to its schema-less design. Everything relational requires additional columns, and probably associations.

Facebook example, when two people comment on my post Facebook merges those notifications into one:

  1. Mark comments...
  2. The activity says: "Mark commented on your post: 'thanks for sharing!'"
  3. Jenny comments, too.
  4. And now we have a merged activity "Jenny and Mark commented on your post: 'that's cool'"

So we need to merge owners into a list, which could be stored as a serialized array (but we lose querying possibilities in ActiveRecord), or as an additional association, and update :parameters with the ones from newly created activity.

This is what I did, but in views. So no merging was done in the database. This was far from perfect, as @michelson mentioned, because we have no way of knowing how many activities we are actually going to get.

@stas

I'm not a fan of any proposed approaches. What I would do is add a callback on before_save that does the merging for me. This operation will be incremental and does not require any schema modifications.

In ex. Mark triggers an activity on something related to a trackable with ID X and the event key is key1. When Jenny triggers another activity on the same trackable with ID X and the same event key1, the hook fetches previous keys (which will include Mark's activity) and since trackable and keys are the same, we can apply a merging algorithm (probably just update parameters with a new value, say, actors that include Mark and Jane.

This makes sense for trackable object but it's not relevant if you care about recipient. To preserve the recipient we gonna need schema changes.

@pokonski
Owner

@stas good thinking. I like the actors approach. To preserve recipient we could group them by trio of key, trackable and recipient. This way we only group activities directed at the same recipient.

@stas
@michelson
@mewdriller

@stas, @pokonski: I took a shot at this in the app I'm using public_activity for and found that including the time in the aggregation key was relevant too. For example, if the following actions happen:

  • Jim creates a Post on 1/1/13 at 1:00pm
  • Bob comments on that Post on 1/2/13 at 1:00pm
  • Sally comments on that Post on 1/2/13 at 2:00pm
  • Mark comments on that Post on 1/2/13 at 3:00pm
  • Jill comments on that Post on 1/2/13 at 4:00pm
  • Susan comments on that Post on 1/3/13 at 1:00pm

If the user is viewing the list of notifications on 1/4/13, you might want to group the notifications like:

  • Jim created a new Post three days ago.
  • Bob and 3 others commented on a Post two days ago.
  • Susan commented on a Post yesterday.

What it seems like to me is that there should be a hook when creating activity that defines how to create the "aggregation key" which could then be used by the group_by method. By default, that aggregation key could be generated by combining: trackable, key, recipient, and created_at (with a time granularity of days).

@pokonski
Owner

@mewdriller yes, exactly. All that makes sense. Since it's a gem it needs to provide reasonable defaults with possibility of changing them, as every one needs something slightly different.

@rubish

+1

Just came across public activity today. we have implemented a custom notifications system, basically doing what public_activity does with a few additions. Public activity does it much more elegantly than our implementation.

We only create new aggregates for user when he becomes active after few hours of inactivity, otherwise we keep aggregating the activities into older latest activity. This way user has a clear separation of things which happened while he was away and things happening now. Custom proc can generate the aggregation key as required, other option can be to return activity to aggregate current activity into.

There are a few additional issues/features which we tackled and would make sense for public activity too. Have started a new thread #82 to discuss them further.

@rubish

One more important aspect to consider would be translation keys for aggregated activities. Different projects will have different requirements. Notifications can look like:

  1. John and Bob liked your Post about 3 hours ago.
  2. Jim created a new Post three days ago.
  3. Bob commented on a Post two days ago.
  4. John, Bob and 5 others commented on your Post 5 days ago.
  5. John, Bob and 1 other liked on your Post 5 days ago.

Fallback to i18n solely based on activity key might not be enough in such a scenario, with different number of actors visible upfront and pluralization. Developers should have some method to define the translation key and interpolation parameters for each activity.

@pokonski
Owner

All this comes down to joining actors, but keeping in mind that different languages have different syntax. English is the easiest one, I think. Implementation in p_a should be vague enough for people to implement this by themselves.

Rails already has i18n support for joining arrays in human-readable format: https://github.com/svenfuchs/rails-i18n/blob/master/rails/locale/en.yml#L188

@stas

Public Activity already offers flexibility to implement an i18n or a view-based rendering for entries, I'm using views, and it works great. Don't think this is a blocker.

@rubish

Yes, definitely not a blocker by any means.

@mieko

I've extracted the code I've been using for this purpose into a library at mieko/coalesce, which I'm trying to turn into a general-purpose "object aggregation library". Contributions expanding its scope to cover all of the use-cases explored here would most certainly be welcome. It'd be nice to finally nail down this functionality, in a way which doesn't invade PA, and would be useful for other purposes.

Coalesce works something like:

g = Coalesce::Grouper.new do
  rule :create_and_comment do
    same       :id, :owner
    attr_in    :key, 'ticket.create', 'ticket.comment'
    time_delta 5.minutes, from: :first

    combine :key,  with: :smart_key
    combine :text, with: :array
  end
end

activities = [
  Activity.new(id: 5, key: 'ticket.create', owner: 10, text: 'Data 1'),
  Activity.new(id: 5, key: 'ticket.comment', owner: 10, text: 'Data 2')
]

results = g.each(activities).to_a
# => [Activity(id: 5, key: 'ticket.comment_create', owner: 10, text: ['Data 1', 'Data 2'])]

The PA-specific niceties like the :smart_key combiner allow Coalesce to massage collections into PA's view structure. Coalesce currently only works in a stream-processing manner, so non-adjacent items aren't considered for grouping, I'd like to extend this to random-access, without ballooning queries or memory usage.

The API is still up in the air, and the functionality only implements enough to cover my use-cases. The gem is pre-release, without proper versioning and packaging yet. I'm working on documentation as I have time, so anyone looking to check it out must be willing to code-dive at the moment.

With a bit of help, I think this could be turned into a system powerful enough to reasonably express all the complexities of Facebook-style aggregation, while staying maintainable. Regardless, it may be a good starting point for anyone looking for functionality like this.

@seifsallam

@mieko That's awesome. Thanks :+1:

@farnoy
Collaborator

But does your library reduce on the server side? If not, the performance will be penalised when done in Ruby.

@mieko

@farnoy Coalesce works at the Ruby object level.

I guess performance is a matter of perspective. My target is mobile and widely-compatible mobile, (think Android 2.3 on spotty 3G). When we moved to server-side aggregation, with standard Rails caching techniques, both bandwidth requirements and page-load responsiveness improved.

As far as our users are concerned, it was a definite performance win. Our servers take a little performance ding on a cache miss.

Where an application lives on the SPA-vs-traditional web app gradient will determine if this approach is appropriate.

@pokonski
Owner

@mieko nice idea! Looking forward to hearing more :)

@farnoy
Collaborator

While caching can excuse any bottlenecks temporarily, this is not a true solution for map reduce.

@mieko

While caching can excuse any bottlenecks temporarily,

Caching is actually the cornerstone performance strategy of large web deployments, but I'm digressing.

this is not a true solution for map reduce.

While the issue title contains the phrase "map-reduce", the problems described are more generalized aggregation. One forcing MapReduce (the paper) strategy to get the results discussed here, would probably be disappointed in how little is gained from not necessarily independent data. A lot of rules we use are outright sequential and heavily inter-dependent for the sake of "humanized" grouping.

Regardless, Coalesce is neither Hadoop nor a Javascript library, depending on what is expected of a true solution. Deep ORM query generation is even out of its scope. It's a still-incubating library for aggregating Ruby objects which meet certain conditions in prescribed ways.

@pokonski
Owner

@farnoy disagree. Caching is an absolute necessity in web-dev. No matter how brilliantly awesome your code is. Serving static content will beat it every time.

Like @mieko pointed out, there is no single serve-all-purposes solution, and definitely not one I can include in p_a.

I did aggregation with p_a in JS (client-side), and server-side for different projects, depending on the use-case. Doing it in the database would bind p_a implementation very closely to the postgres/mysql/oracle or some other crappy database (MongoDB).

@farnoy
Collaborator

We agreed that we can't ship a solution for everyone in p_a months ago on the mailing list.

@mieko, people doing large scale webdev will have serious problems as soon as they use your gem for aggregation.

I'm not saying MapReduce specifically. The use case you've described as an example would be infinitely better performing if developers were to store data the way your results look like.

# => [Activity(id: 5, key: 'ticket.comment_create', owner: 10, text: ['Data 1', 'Data 2'])]

text could be a parameter. text could be it's own column if it would have to be queried.

The algorithm for this bundling would be really simple using the find_by method.

This does introduce another query when inserting, but it won't crash your servers when you purge caches.

By the way, I think one PostgreSQL query can replace your gem. Using array_agg(activities.parameters -> text) with proper GROUP BY clauses should return one row with all parameters[:text] concatenated into one array.

Refs:

@mieko

I've updated my in-database format to resemble my views, and replaced the gem with the proposed query. My servers have stopped crashing. I'll leave Coalesce (now in maintenance-only mode) up as a warning to others intending to scale.

@farnoy
Collaborator

This requires a proper migration, which we'll document in the wiki, but for testing purposes, when parameters is still a serialized string, array_agg(activities.parameters::hstore -> text) should work too (not optimal!)

@xguox

I am now doing remove_duplicate like this in the controller

  def remove_duplicate
    group = Activity.all.group_by{|model| [model.trackable_id, model.trackable_type, model.owner_id, model.owner_type, model.key, model.recipient_id,model.recipient_type] }
    group.values.each do |duplicates|
      newest_one = duplicates.pop
      duplicates.each{|duplicate| duplicate.destroy}
    end
  end

How's it? and

I attempt to do this in the model PublicActivity::Activity by using callback

but fail

@farnoy
Collaborator
@farnoy
Collaborator

I successfully implemented a sample application where there are two models + activities.

One is called Author and stores only a name, the second one is called Update and has a string content and belongs to some author.

I then implemented simple functionality to share these updates (facebook style) by authors (fully done on activities, just a simple concept).

Finally, I've written a query that finds all activities with key update.shared, groups them by trackable, and then groups them by the read parameter. From there, I can select array_agg(activities.owner_id) AS owner_ids and use it to make a simple map/reduce structure:

[{["Test", "false"]=>[3, 1, 2]}, {["Test", "true"]=>[1]}]

The structure looks like this: {[update_content, read_status] => [owner_ids..]}.

In this simple form, it does not achieve anything substantial, but I think it can solve most of the aggregation problems that our users encounter.

The short version is:

@activities = PublicActivity::Activity.where(key: 'update.shared')
.group(:trackable_id, :trackable_type)
.group("activities.parameters -> 'read'")
.select("activities.parameters -> 'read' AS read,
  activities.trackable_id, activities.trackable_type, 
  array_agg(activities.owner_id) AS owner_ids")

# this produces above mentioned structure
@activities.map {|o| {[o.trackable.content, o.read] => o.owner_ids} }

My solution uses PostgreSQL and requires one change in p_a codebase plus a migration (to store parameters in native HStore).

I'd like to hear what do you think about this solution and if it makes sense for your use cases.

Ideally, we'd like a RDBMS agnostic solution, which probably could be achieved (with the exception of referencing into parameters which only Postgres has I think).

@stas

@stas

@farnoy I was expecting a more DB schema oriented approach (as I mentioned, I would see the reduce process happening at the write level rather when activities are queried/read). Reading above, It is not clear how activity generation is handling the entries which look same. Maybe I'm missing something, could you explain more?

Thanks.

P.S.: By the time I opened this ticket, I was working on @courseware which is now open source. So if you are looking for a project where to benchmark/test things, feel free to take a look at it (It has a dozen models with activities).

@farnoy
Collaborator
@pokonski
Owner

Closing this for now as we have no specific way of achieving this on multiple ORMs.

@pokonski pokonski closed this
@gauravtiwari

@pokonski @farnoy I read this thread and was wondering what JS/Ruby method you would suggest to achieve basic aggregation - like asked on the question above? Please if you can share some code - JS/Ruby.

Peter and 100 others likes this post

Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.