Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort On Demand #328

Open
josephwegner opened this issue Apr 1, 2014 · 22 comments
Open

Sort On Demand #328

josephwegner opened this issue Apr 1, 2014 · 22 comments

Comments

@josephwegner
Copy link
Collaborator

Right now, when you load the homepage (or any subsequent news page), here's what happens:

  1. The 900 most recent news items are grabbed from the database (mongoose also populates the "poster" field in this process)
  2. The votes for all of those posts are retrieved from the database (inefficient "in" query)
  3. For each post, the entire set of votes is iterated over and associated back to the post record (with an average of 2 votes per post, that's 1800 votes. So looping over 1800 votes for 900 posts means 1,620,000 iterations).
  4. A score is calculated for each post (900 iterations)
  5. The posts are sorted by score (this is most likely not an efficient sorting algorithm)
  6. The list is trimmed down to the 30 relevant entries
  7. The comments are added to each post (30 separate queries)
  8. The latest comment is added to each post (30 separate queries)
  9. The posts are passed off to the view to be rendered

As you can see, this is a terribly heavy process. We're not feeling the full weight of it yet, because we don't have 900 posts, but Pullup will continue to degrade in performance as we post more topics.

Along with some improvements on the scoring process, I suggest that we update the score on demand. Currently, posts will only change order based on two events: a new vote, or a new post. The score is calculated based on age, but all posts age at the same rate so the scores should age similarly. I think we should recalculate scores every time a vote is placed or a post is added, and store that value in the database.

@treygriffith
Copy link
Collaborator

To me, this is a "don't fix it if it ain't broke" situation. I'm not even sure that at 900 news items this will be a significant factor.

Looping over arrays in javascript isn't that heavy. It's very light compared to doing database queries in terms of time to response. The current implementation isn't exactly efficient though (I hadn't thought about looping over every vote for every post, that's a little scary).

I would, however, be in favor of profiling the front-page render. I think going to town optimizing stuff is the wrong way to approach it, but if we find that one of the items in the list above is taking a really large amount of time, we might be able to optimize just that piece.

The trouble with storing the score with the news item itself is that while all posts age at the same rate, the score degrades exponentially rather than linearly. So you really do need to calculate the score for each item to get the proper order.

If we ever got a really large user base, I would say that heavy caching would be appropriate for rendering the front page (as I'm sure reddit does), but I don't think keeping score constantly really works.

@josephwegner
Copy link
Collaborator Author

If we can make that query more efficient without caching, I would be fine with that. Something needs to be done, though...

Right now I'm getting 2s load times for the main page (before assets - just the first request) which is a long time. Yesterday we were getting requests timing out in the morning, which showed the user big ugly application down pages - I'm pretty sure this was because requests were queuing up, and 2s per requests was too slow to clear them out before the request timed out.

@megamattron
Copy link
Member

I agree, it's too slow in it's current form. I was chalking that page load
time up to the heroku dyno going to sleep, but it looks like it's that slow
on page refresh too.

On Wed, Apr 2, 2014 at 9:50 AM, Joe Wegner notifications@github.com wrote:

If we can make that query more efficient without caching, I would be fine
with that. Something needs to be done, though...

Right now I'm getting 2s load times for the main page (before assets -
just the first request) which is a long time. Yesterday we were getting
requests timing out in the morning, which showed the user big ugly
application down pages - I'm pretty sure this was because requests were
queuing up, and 2s per requests was too slow to clear them out before the
request timed out.

Reply to this email directly or view it on GitHubhttps://github.com//issues/328#issuecomment-39331976
.

@dfjones
Copy link
Contributor

dfjones commented Apr 3, 2014

Here's a possible solution I thought up. I'll have to be brief now, but maybe I can flesh this out more when I have more time on my hands next week.

What we could do is find a way to cache the news list. We could even do this "cache" in mongo: create a new collection, and add documents with an article id and score as calculated by our algorithm. Every minute or so, have a task go over these documents and update their score. We could also eagerly update the document when an item's votes change.

Then the query is a very simple query over this collection, ordered by score and limited to the number of news items on a page.

@treygriffith
Copy link
Collaborator

I didn't realize the main page was taking 2s to load - that's clearly a problem.

I do think we need to profile the main page to figure out what's causing the issue - although looping votes and items is a likely culprit, and could be addressed with a cached collection.

@megamattron
Copy link
Member

@dfjones This is exactly how I've implemented this in the past, in that case I just stored the score with the news item in the DB. Then, as you said, to display the front page is a simple descending order query and who cares if the score is slightly inaccurate in between refreshes. We certainly don't have enough news item volume to notice any problems anyway, better to optimize the front page loading speed and give up some accuracy on the scoring.

@treygriffith
Copy link
Collaborator

@megamattron I'd be more in favor of a separate collection that is just a front-page cache (or first n pages) rather than storing the score in the main collection.

@josephwegner
Copy link
Collaborator Author

@treygriffith Just wondering, why would you prefer a full cache, rather than just an additional field on the news item collection? Is there a good reason for the data duplication?

There is an unfortunate requirement that we need to keep in mind here.. Heroku gives us one dyno for free - anything above that is going to be $30(ish) per month. I don't know what @megamattron's plans are for scaling, but I would like to avoid anyone having to pay for Pullup for as long as possible.

Having a scheduled job that recalculates scores every few minutes will require a separate worker dyno (it could be done on the main thread, but that will not scale well with Node's single-process architecture). I think that we have to either speed up the current live scoring process, or cache the scores on demand (ie: new vote, new post). I'll think on it more, in hopes of finding a way that the age/gravity stuff could still work with this approach.

@treygriffith
Copy link
Collaborator

You're going to have to have a background process whether you have a cache collection or not.

Having a process that is constantly updating all of the news items, as well as updating on votes, etc, starts to create problems because you can potentially have multiple actors simultaneously updating the same record.

 Imagine if we add the ability to edit self posts - anytime you save your edits at the same time that the background process is updating, or that a user is casting a vote, there will be a collision.

If we're caching data, I think it's clearer to actually use a cache than to have parts of every record be cached, and other parts not.

Sent from Mailbox for iPhone

On Wed, Apr 2, 2014 at 7:46 PM, Joe Wegner notifications@github.com
wrote:

@treygriffith Just wondering, why would you prefer a full cache, rather than just an additional field on the news item collection? Is there a good reason for the data duplication?
There is an unfortunate requirement that we need to keep in mind here.. Heroku gives us one dyno for free - anything above that is going to be $30(ish) per month. I don't know what @megamattron's plans are for scaling, but I would like to avoid anyone having to pay for Pullup for as long as possible.

Having a scheduled job that recalculates scores every few minutes will require a separate worker dyno (it could be done on the main thread, but that will not scale well with Node's single-process architecture). I think that we have to either speed up the current live scoring process, or cache the scores on demand (ie: new vote, new post). I'll think on it more, in hopes of finding a way that the age/gravity stuff could still work with this approach.

Reply to this email directly or view it on GitHub:
#328 (comment)

@dfjones
Copy link
Contributor

dfjones commented Apr 3, 2014

My initial thinking about a separate collection is that it might be slightly easier to manage. You can limit the number of documents stored there, so whatever is working over them can be limited in the time/resources it consumes. We can do this too if we just add scores to the existing documents, but it feels cleaner to me to separate this out.

As for the update worker, I vote to do the simplest thing that could possibly work. I'm not sure how this fits with heroku, but I think node has a setTimeout like function that we could use to call the updater periodically. Node is a single process, but anything that runs as an event callback should be fine.

@josephwegner
Copy link
Collaborator Author

@dfjones If we intend to run the updater every minute or so, I think we will be creating more overhead than we already have. I don't think we currently have someone loading the homepage every minute (although, maybe our crashes are only occurring with burst traffic). It is (sort of) true that setTimeout will wait for the event queue to empty out before it runs, but any new requests that come in while the updater is running will have to wait for it to finish. I'm not entirely sure (I would like to build in some profiling), but I think that most of the time is spent processing inside of Node, rather than waiting on IO - that means the thread will be blocking all other requests for most of 2 seconds.

@treygriffith I'm trying to work towards a solution that doesn't have a process constantly updating all of the news items. I'd like to find a solution that only updates the scores and/or cache when an event that would cause the order to change happens - that should solve the collision issue.

I don't yet know if such a solution is possible, but it seems like it should be. I am going to spend some time this afternoon thinking on it - hopefully there's a clever solution here that meets all of the goals here.

@megamattron
Copy link
Member

Does something like https://github.com/ncb000gt/node-cron execute on the web serving thread? If not, that really seems to me like the best solution. It'll be doing roughly the same amount of work we're doing now every time we load the news list, but only infrequently. Even every 10 mins will probably cover the level of activity we have for the forseeable future.

As for where to store the score, creating a duplicate structure to store the news just to add an additional int value seems to have dubious benefits and concrete downsides.

Please don't read the above with a negative tone, none is intended! However, in general I think an important goal for this site probably would be to try and do things in simplest way possible. It will help new users get up to speed and just keep things moving forward in general. We can always fix things when they start having problems in the future.

@treygriffith If you end up finding a simple alternative to that I'm all for it, but I'm hesitant to get involved in anything too complicated.

@dfjones
Copy link
Contributor

dfjones commented Apr 11, 2014

Ok, I can agree to keeping the score on the current news items.

As for the actual processing of scores, it is my understanding that NodeJS has a single thread. To get around this, I think we will need to use something like process.nextTick() to ensure that we do not block the thread for too much time. See this article for some detail.

Perhaps there is a library or technique we can use that is a little more convenient than calling process.nextTick()?

Edit: did some searching and found event-stream. Maybe this in combination with the cron library @megamattron linked will be a nice way to structure this work?

@ghost
Copy link

ghost commented Apr 20, 2014

As a test, I loaded my dev database with 1000 posts and randomly generated between 1-6 votes for each post (3471 votes total). With this data set response times were averaging mid 13 seconds per request. For reference, the db-seed data on my dev env generates ~330ms response times, and 500 news items/1436 votes results in 4-5s responses. These number are for a single request of the front page. (This is not the most accurate of profiles, I know, but it's somewhere to start! Others attempts at corroborating/refuting would be helpful).

Profiling (and please correct me if I'm misinterpreting these results) seems to indicate to me that the culprit is the votes.addVotesFor -> votes.addVotesToItem methods (or step # 3 in the original post):

screen shot 2014-04-19 at 5 51 50 pm

(note that when running the profiler my response times dropped to ~15s +)

Caching or pre-calculating the scores is premature IMO and we should at least attempt to improve the way votes get totaled for submissions. For the record, I definitely think that caching/saving scores between requests is a sound idea overall (at the very least we shouldn't be re-calculating them /every/ request if we can avoid it), but we should improve (within reason) the scoring functionality before we deal with caching.

Has anyone else tried loading larger datasets? If so, I'd be interested in hearing if your experience differed from my own.

Lastly, one other thing to consider which I don't think I've seen mentioned in this thread:

Perhaps we may have gone down the wrong path by declaring a reference relationship between the NewsItem and Votes models. I know this is a familiar and common relationship in RDBMS schemas, where a simple join and some aggregate functions allow one to very quickly slice up this data in many ways. But I can't help but think that it's not a good fit for MongoDB, and embedding the votes within the NewsItem and Comment models may have been a better fit (Caveat: we'd have to be cautious w/r/t document growth with this approach.). tbh I'm not entirely sure about this approach either and thought I'd just throw that out there for discussion.

@treygriffith
Copy link
Collaborator

Yikes, that's pretty bad.

As far as the design of the database (i.e. the more relational approach), I'm the one that did that, and here's my reasoning: #146 (comment)

I hadn't considered putting Votes as a sub-document on comments and news stories, which may have been the best decision. I'd definitely be willing to revisit that.

@treygriffith
Copy link
Collaborator

Or rather, I did consider it, but the benefits might outweigh the drawbacks I listed in that comment.

@ghost
Copy link

ghost commented Apr 22, 2014

I totally agree with your rational in #146.

I do kind of like the idea of embedding the vote data in the NewsItems, but can't help but think that, as the site grows, we'd simply be trading a "now" problem for a "later" problem.

I do think there is an easier solution however: Simplifying on the idea of pre-calculating/caching the scores, we could very easily store the aggregate vote count and continue to live-calculate the scores. My tests show that assembling the vote/item relationship is what is really taking most of the time.

I just pushed an example to my iss328 branch which demonstrates this (warning, this was quick and dirty so there are bugs. e.g., scores on Issues don't show.).

In short: I've added a post save hook to the vote model so every saved comment increments an aggregated post count on the associated item. Now when scoring we can skip the time consuming lookup of votes for all items. With only this change my tests show a significant performance increase.

Downsides:

  • This data is now stored in 2 places, the sum actual votes and the aggregated value in the items model. I think this is a fair trade-off for the potential performance improvement and a common use for this method of db de-normalization. (As long as votes aren't created outside of the mongoose model, where this hook is enforced).
  • If we decided to go this route, we would also need some method of calculating the sum of votes for existing items.

As I said, this is just a quick example to get some feedback on whether you all think this is worth pursuing, so any comments/concerns are appreciated.

@treygriffith
Copy link
Collaborator

I think that you're right - we shouldn't focus on caching the scores themselves, but should cache the vote count.

I took a look at your branch, I think it looks good, and probably the right way to go. It doesn't introduce too many future problems that I can think of (especially as compared to embedding Votes).

The only thing that I think should be added would be some sort of balancing routine that can occasionally (once a day?) go through the Votes collection and update items with their proper vote count. That would be a way to get to a sort of eventual consistency.

When we implement that, it would solve your second concern, as the first run of the consistency routine would update all of the existing items with the correct vote count.

@ghost
Copy link

ghost commented Apr 24, 2014

Thanks for the reply.

Given the little other feedback on this idea I think I'm going to break out my suggestion into three smaller parts:

  1. Start collecting the aggregate data on the individual items. This will do nothing other than make the data available for future use, and with the disclaimer that without step 2 it will only apply to items submitted after/if this change is accepted.
  2. Assuming part 1 is approved, we should revisit the synchronization step for pre-existing items. I'm not totally convinced we need a scheduled task to resync that regularly. After the initial sync, anything that causes these numbers to drift should be considered a bug.
  3. Refactor the sorting/scoring routine to use the aggregate field data.

When I have some free time I'll refactor my code & send a pull-request for step 1. If that is approved, we can revisit how we want to proceed.

@treygriffith
Copy link
Collaborator

That sounds like a good plan of action to me. You're right that anything
that makes the numbers drift will be a bug, but I think it's important to
have a system that can handle hiccups. One such hiccup might be the server
crashing between saving the Vote and incrementing the aggregate vote count.

Since we'll probably have to sync up initially, we can just make that
reusable enough to run every so often. Whether we need that level of data
purity is of course up for discussion.

But we can discuss all that once you've done part 1. Thanks for putting in
the time on this issue.

On Thu, Apr 24, 2014 at 3:19 PM, Carl Groner notifications@github.comwrote:

Thanks for the reply.

Given the little other feedback on this idea I think I'm going to break
out my suggestion into three smaller parts:

Start collecting the aggregate data on the individual items. This will
do nothing other than make the data available for future use, and with the
disclaimer that without step 2 it will only apply to items submitted
after/if this change is accepted.
2.

Assuming part 1 is approved, we should revisit the synchronization
step for pre-existing items. I'm not totally convinced we need a scheduled
task to resync that regularly. After the initial sync, anything that causes
these numbers to drift should be considered a bug.
3.

Refactor the sorting/scoring routine to use the aggregate field data.

When I have some free time I'll refactor my code & send a pull-request for
step 1. If that is approved, we can revisit how we want to proceed.


Reply to this email directly or view it on GitHubhttps://github.com//issues/328#issuecomment-41339676
.

@ghost
Copy link

ghost commented May 6, 2014

I sent a PR (#334) a couple of days ago for comment/discussion as I commented above.

If you all would prefer to go another way please feel free to close it.

Thanks.

@dfjones
Copy link
Contributor

dfjones commented May 12, 2014

I'd vote to merge @cgroner PR #334 and implement the rest of his plan outlined above. Having the votes calculated for each news item will speed up the score based sorting. After we have this in place, we can evaluate where we are and if we will need to take further action to improve the performance of score based sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants