Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score Aware Pagination #223

Closed
dfjones opened this issue Feb 16, 2014 · 13 comments
Closed

Score Aware Pagination #223

dfjones opened this issue Feb 16, 2014 · 13 comments

Comments

@dfjones
Copy link
Contributor

dfjones commented Feb 16, 2014

The current method of pagination does not take score sorting into account. News items are primarily sorted by their created date when a page is fetched from the DB:

function getNewsItems(query, user, callback, page) {
  page = typeof page !== 'undefined' ? page: 1;
  NewsItem
  .find(query)
  .sort('-created')
  .skip((page-1)*newsItemsPerPage)
  .limit(newsItemsPerPage)
  .populate('poster')
  .exec(function (err, newsItems) {

    if(err) return callback(err);

    addVotesAndCommentCountToNewsItems(newsItems, user, callback);
  });
}

The page is then sorted by score.

Ideally we'd order by score before we paginate.

@marcofiset
Copy link
Contributor

Do you have any idea how we could achieve this without loading the whole database every time?

I can't think of something other than limiting results to posts of the last X days or less. But then a post submitted X + 1 days ago that still has a lot of interest won't show up anymore.

Perhaps store the score in the database so we can sort on it? And on every vote, update the score, or something like that. But score is also based on time so that will require more thinking.

@dfjones
Copy link
Contributor Author

dfjones commented Feb 17, 2014

Sure, here are some thoughts.

I think we will need to persist the score information somewhere. Right now the database is the obvious choice, but a cache or something like Redis with sorted sets could be options as well.

Assuming we go with the database, I think we will want to store the score either on the news item or perhaps in its own collection. When a vote is written, we should update the score as well. Since the score is time based, I think we will need a background tasks that updates the scores in the database at some frequency (every minute, for example). To keep this from writing to all news items, I think we will want to define a policy to limit the number of documents that are updated. For example, news items with a score < 1 will no longer be updated. This way, scores will naturally become frozen as an item ages. To choose the right value, I think we will want to do some analysis of the score function and what values it emits as an item ages. We don't want to prevent fresh items from being updated.

@treygriffith
Copy link
Collaborator

As far as I'm aware, there is no good way of doing this in MongoDB, although there is a planned (but not scheduled) feature supporting it: https://jira.mongodb.org/browse/SERVER-153

Constantly updating the database with new scores as time passes seems like a pretty limited solution.

My vote is to pull all the records from the database and perform the sort and slice in the application until it becomes a problem (which I don't expect until we have a lot more users than we currently do), at which point we can either do some caching (in Redis or similar) or determine the age of the set we should be fetching from the database to perform the sort and slice. (i.e. determine X in @marco-fiset's answer with a high level of confidence).

@dfjones
Copy link
Contributor Author

dfjones commented Feb 18, 2014

Even if the Mongo feature @treygriffith mentioned was implemented, running any sort of custom JavaScript in Mongo isn't a long-term scalable solution either.

I think every solution to this problem will have some sort of tradeoffs associated with them. What @treygriffith suggested will work, but won't scale long term. However, I think if we limited the query to something like 1000 news items, the query & sorting will probably be fast enough for a long time.

@treygriffith
Copy link
Collaborator

Totally true that it's not scalable. It's basically a punt on this issue until we have more data about which trade offs are most appropriate given the application's real world usage.

That said, doing a 1000 item limit would  do effectively for the first page. I'm not sure how that would work for the nth page though, since you can't do a .skip

Sent from Mailbox for iPhone

On Tue, Feb 18, 2014 at 6:58 PM, Doug Jones notifications@github.com
wrote:

Even if the Mongo feature @treygriffith mentioned was implemented, running any sort of custom JavaScript in Mongo isn't a long-term scalable solution either.

I think every solution to this problem will have some sort of tradeoffs associated with them. What @treygriffith suggested will work, but won't scale long term. However, I think if we limited the query to something like 1000 news items, the query & sorting will probably be fast enough for a long time.

Reply to this email directly or view it on GitHub:
#223 (comment)

@dfjones
Copy link
Contributor Author

dfjones commented Feb 19, 2014

I think if we are going to load 1000 items at a time, you'd use a query like the one we already have and I've copied in the original description. Then, sort all 1000 items by score, then select page N of the items from the sorted array. So, the most pages we could support would be 1000/M where M is the number of items on a page.

The big downside is that every page view has to query for and sort 1000 items.

@chall8908
Copy link

Continuing from @dfjones, posts beyond this 1000 limit could be "archived" and viewable from a separate, differently sorted (e.g. only post time) list view.

@treygriffith
Copy link
Collaborator

@dfjones I see what you're saying now.

I don't think that querying for 1000 records and sorting is actually a big problem - I imagine that most of the time spent in the query is overhead in the actual request itself, and doesn't vary much with the size of the result set (at least for 1000 results). The sort itself is very fast (I'd bet less than 2 tenths of a millisecond for 1000 items).

@dfjones
Copy link
Contributor Author

dfjones commented Feb 20, 2014

Something to keep in mind is that for each news item we have to perform a query to join in the item's vote count, comment count, and author info. So we are seemingly in the neighborhood of 3000 queries to build 1 page. (Although, Mongoose might be batching or doing something to reduce query overhead. I don't know it well enough to say one way or the other.)

This probably won't be an issue for the immediate future. But when the day comes to optimize, this probably should be the first place we look at :-)

@dfjones
Copy link
Contributor Author

dfjones commented Feb 20, 2014

Also, we should probably close this issue. I don't think I have the rights to do it. @megamattron ?

@treygriffith
Copy link
Collaborator

@dfjones The queries for vote count and comment count are batched together, so that's only one extra query each. I'm also pretty sure that Mongoose batches populate queries, so that's one more query for author info. For those queries, we should be doing pretty well (4 queries total, regardless of the number of news items).

I did just notice that a recent PR to get the latest comment time performs a separate query for each news item, so that's 899 unnecessary queries that can be eliminated.

But I agree that this issue is resolved.

@treygriffith
Copy link
Collaborator

I take back that bit about the comment counts - those are separate queries. I'm going to see what I can do about the comment counts as well as last comment time queries - Performing ~2000 queries to build the front page is really not a great idea.

@megamattron
Copy link
Member

Looks like we've got a reasonable compromise in there now, so I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants