Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"list submissions for multiple assignments" API fails on large data sets #1033

Closed
japomani opened this issue May 8, 2017 · 9 comments
Closed

Comments

@japomani
Copy link

japomani commented May 8, 2017

Summary:

Any tool that requires all submissions for all students will not scale and will fail. Often the request fails to complete and/or returns corrupt data. This is especially true in large courses, courses with a large number of assignments, or when several courses are cross listed into the same shell.

Steps to reproduce:

  1. GET /api/v1/courses/:course_id/students/submissions?student_ids[]='all'
  2. Large classes with large numbers of assignments will return as many as 10,000 records or more
  3. Paging limits returned data to 100 records, additional pages need to be loaded sequentially, this could be 100+ pages
  4. Complete request of all pages can take 10s of minutes! I suspect API throttling slows it down more.
  5. Request often fails or times out

Expected behavior:

The “list submissions for multiple assignments” API should include submitted_after[] and graded_after[] parameters. This will allow developers to request only submission data that are new since the last request was made, and store updates locally. This dramatically reduces the amount of data requested and will lower page load times from 10s of minutes to SECONDS

Additional notes:

Issue: There is a whole class of LTI tools that cannot be built due to the current design of the submissions API. These are tools like custom gradebooks, course dashboards, early detection of struggling students, leaderboards, gamification, etc. in large courses, courses with a large number of assignments, or when several courses are cross listed into the same shell.

The current Submissions API returns ALL submissions that fit the requested parameters, even if we've requested that data before. In large classes with a large number of assignments, this can quickly become a very large number of submissions. My current class reaches over 10,000! In addition, the Canvas API only returns 100 submissions per page. This means that large submission requests are parsed into as many as 100 API requests or more for EACH page load! The Canvas API paging system does not allow asynchronous requests, so each page must be delivered and then the next requested, sequentially. Also, in large classes, these kinds of requests occur many times a day and I suspect that so many page requests from the same course results in API throttling which further slows down the requests.

The result is that API requests for all submissions in large courses, courses with a large number of assignments, or several courses cross-listed into the same shell can take 10s of minutes! Often the request fails to complete and/or returns corrupt data.

The traditional solution for improving performance of API data is to store it locally. However, this approach is only useful for data that does not change often--submissions data are very dynamic and change frequently. Also, there is no way to know if new data is available. As a result, fresh data--the entire data set, must be requested each time a user wants to see the current status. Again, this can take 10s of minutes and may be a significant drag on Canvas servers. Another approach is to use the Live Events API to collect data locally as it is generated. This approach is not very useful in this use case. First, the Live Events API returns data on an account level, not a course level. So, to get data for a class, a developer must request data for their ENTIRE university, a solution that is way overkill and has all kinds of security issues. Second, the Live Events API is still in beta and is not available.

Solution: A simple solution for this problem is outlined below. By adding submitted_after[] and graded_after[] parameters to the “list submissions for multiple assignments” API, developers can request only submission data that are new since the last request was made, and store updates locally. In a large course, data will be requested multiple times a day, and thus local data is updated multiple times a day. This reduces the size of any individual submissions API request and dramatically reduces the amount of data requested from Canvas servers for a course. This will lower page load times from 10s of minutes to SECONDS and make a variety of innovative components possible! I think coding this change to the API will likely be trivial (famous last words : ), and only require a couple of ‘if statements’ and WHERE clauses on the DB request.

Request Parameters:

Parameter - submitted_after[]
Type - Date/Time
Description - Return only/all submissions with submitted_at dates greater than a specified date / time

Parameter - graded_after[]
Type - Date/Time
Description - Return only/all submissions with graded_at dates greater than a specified date / time

@ccutrer
Copy link
Contributor

ccutrer commented May 8, 2017

I'm not calling your entire request invalid, but I do want to point out some items that you state as fact, or assumed fact, and which may actually end up as fact, though you obviously have not tested them, so they could also be completely wrong.

  • Also, in large classes, these kinds of requests occur many times a day and I suspect that so many page requests from the same course results in API throttling which further slows down the requests. - API throttling doesn't artificially "slow" requests. It simply rejects them if you're over the limit.
  • The result is that API requests for all submissions in large courses, courses with a large number of assignments, or several courses cross-listed into the same shell can take 10s of minutes! Often the request fails to complete and/or returns corrupt data.
    • In Canvas Cloud, it is impossible for a request to take 10s of minutes. The Amazon load balancers we use will drop a request if it hasn't started responding in 60s. Given how Rails works, it's highly unlikely that we'll meet that 60s first-byte threshold, and yet continue to drag on slowly adding to the response over multiple minutes.
    • I'd love to know what kind of corrupt data your talking about, as this is a problem that should be addressed immediately.
    • You could also be referring to the sum total of the time to request all of the pages of a multi-page result. This could definitely surpass 10 minutes in a very large course.
  • This will lower page load times from 10s of minutes to SECONDS - again, you obviously haven't implemented and tested this, but are stating it as fact. It's highly likely, as the bottlenecks are mostly in the Ruby side instantiating large objects, and doing large amounts of serialization, and doing additional filtering in the initial database query could definitely help. But asserting that it will magically reduce the total time to seconds is disingenuous at best.

As for workarounds for now - it seems that you're blindly requesting every submission for every student for every assignment in a course. I highly recommend that you use the assignment_ids parameter to get submissions for one assignment at a time. Every time you have to request another page, you're essentially forcing Canvas to duplicate the majority of the work over the entire scope of submissions, and then throw all but the current page of results away. By iterating through assignments, there is far less duplicate work thrown away, and should be a decent compromise between iterating through individual submission requests.

Oh, and on the subject of paging, your proposed submitted_after/graded_after parameters definitely complicate things. If a submission is modified between requests, and would end up in the first page of results due to sorting, you may never see that change. This isn't insurmountable (by forcing a specific sort order when you use these parameters and/or using paging that's based on continuation, rather than simply skipping X number of records), but does make your feature request more complicated than you make it out to be. Though it should also be pointed out that the current incarnation of the API is subject to similar race conditions. It's just less likely to become perpetual errors, because you'll see the change the next time you fetch all of the data.

@japomani
Copy link
Author

japomani commented May 8, 2017

Cody,

Thank you for such a thoughtful reply! I accept your push back on all points : ) Testing would really be useful to clarify my assumptions.

My biggest assumption is that the largest bottleneck is trying to send 10,000 records at once via HTTP with paging making it 100 requests (these numbers are rounded, but not far off from a real use case I've been working with).

I have this assumption based on my experience, but I may be interpreting what is actually happening wrong. Here is what I experience. At the beginning of the semester when there are only a few submissions, this request will complete in around 10-30 seconds. As the semester continues and the number of submissions increases, so does the wait time. By the end of the semester I wait 30min or more to get the complete set of data in. However, I might have to get up very early in the morning to try, it seems like the later in the day the more often the request fails or it stops paging and I only have some of the data (that is what I mean by corrupting the data, the data no longer accurately represents what is in canvas.)

Reading your note, you seem to be mostly talking about a single request lasting 10 min. That is not what I'm talking about. I never have that problem. I think we are on the same page in this sentence:

You could also be referring to the sum total of the time to request all of the pages of a multi-page result. This could definitely surpass 10 minutes in a very large course.

A single page always returns quickly, the challenge comes when there are 100 pages of data. If each page took 6 seconds to get to me that is 10 min. If I could request only NEW data, rather than ALL data, several times a day, I think it would only be one or two pages. I think this would reduce the page loads from 10 of minutes to about 20 seconds or so. Of course, that is an assumption, and I can't currently test it, but it seems reasonable based on how the API performs at the beginning of the semester when there is not much data.

If I read your note right, you suggest that this might not really be the bottleneck, but the bottleneck could be on the ruby side as it queries the DB and builds the object, is that right? Would it still be an issue if the initial DB query only asked for a subset of the data based on a date/time stamp. Instead of requesting 10,000 records it might just ask for 50 or a 100, just the new records since the last time I asked for them. That data object would be relatively small.

I'm not sure I understand how your work around helps. In components like a gradebook, leaderboard, troubled student dashboard, etc., I need all submissions, for all assignments, for all students in order to compute and display the required information. I can ask for that using "list submissions for multiple assignments" but it can take a long time to arrive with large data sets. How will asking for the same data organized by Assignment ID help? Is the code optimized for that request?

Every time you have to request another page, you're essentially forcing Canvas to duplicate the majority of the work over the entire scope of submissions, and then throw all but the current page of results away.

That is true, but without "graded_after" or "submitted_after" or something like that, I think I have no way to know which data I have already asked for and no way to ask for just the new data. I would prefer to only ask for the data I don't already have. But, unless I misunderstand how the submissions API works, it is all or nothing. If I want the most recent records, I have to ask for all the records. Do I have that wrong?

Feel free to push back on any of this. I want to make sure we are talking about the same things and I find it helpful when you challenge my thinking on this. Thank you!

@ccutrer
Copy link
Contributor

ccutrer commented May 8, 2017

If I read your note right, you suggest that this might not really be the bottleneck, but the bottleneck could be on the ruby side as it queries the DB and builds the object, is that right? Would it still be an issue if the initial DB query only asked for a subset of the data based on a date/time stamp. Instead of requesting 10,000 records it might just ask for 50 or a 100, just the new records since the last time I asked for them. That data object would be relatively small.

I'm pretty sure the bottleneck is with large results (ignoring pagination), because we have to query all of the data from the database, constructing ruby objects, and then do the pagination slicing that result set (afterwhich we format the API response, which can also be intensive depending on what additional data is needed). I could be wrong, though, and where the slicing takes place in this particular request. Requests that span data across multiple object types (like assignments or students like this one) sometimes don't have a good way to push the pagination down into the database. Reducing the size of the initial query (via something like graded_after) would definitely reduce the initial set of data being queried, and thus cascade to all other steps in the request.

I'm not sure I understand how your work around helps. In components like a gradebook, leaderboard, troubled student dashboard, etc., I need all submissions, for all assignments, for all students in order to compute and display the required information. I can ask for that using "list submissions for multiple assignments" but it can take a long time to arrive with large data sets. How will asking for the same data organized by Assignment ID help? Is the code optimized for that request?

Say you have 100 students, and 100 assignments for 10,000 records. You're saying that when you address all 10,000 records (and split into 100 pages), the requests are taking ~6s each, for a total of 600s. I'm suggesting that instead of addressing everything, you iterate and do one request per assignment. Assuming 100 students, then there would be no pagination at all, and it's possible-to-likely that each of these requests would be faster, maybe 1 or 2s each, for a total of 200s. These numbers could be way off. Maybe they take 0.5s each. Or maybe they take 4s, but due to the layout of assignments vs. students, you end up having to make 200 requests, for a total time of 800s. It's just an avenue to explore. The other benefit to splitting by assignment is that you could issue multiple requests in parallel - up to one for each assignment. Even if you only do 4 in parallel, that could take a good-case 200s down to 50s. You'll have to be careful with throttling on that, though. It's incredibly unlikely to hit throttling with one request at a time, but parallel requests definitely increases the odds. A bit of experimenting will hopefully help you find a good value for number of parallel requests without hitting throttling - just be sure to handle it in your code so that if for some reason Canvas is being slower than normal, and you have a request get rejected, you retry.

@japomani
Copy link
Author

japomani commented May 8, 2017

One more thought:

If a submission is modified between requests, and would end up in the first page of results due to sorting, you may never see that change. ... Though it should also be pointed out that the current incarnation of the API is subject to similar race conditions. It's just less likely to become perpetual errors, because you'll see the change the next time you fetch all of the data.

Ya, I've thought about this. I might not get any data that updates during my request, but I think I will get it on the next request.

Let's say I ask for all the submissions at 12:00 and then record that time as my "fresh data" time, but I don't actually get the data back until 12:03 and some data came in at 12:01, I won't have all the data.

But, the next time I ask for data, say at 12:10, I can ask for everything since my last "fresh data" time, or 12:00, and the data I missed at 12:03 will come over. I might have to check for duplicates to keep my data clean.

In this way we are using some sorting in the query to make sure I can grab all the data.

@japomani
Copy link
Author

japomani commented May 8, 2017

Say you have 100 students, and 100 assignments for 10,000 records. You're saying that when you address all 10,000 records (and split into 100 pages), the requests are taking ~6s each, for a total of 600s. I'm suggesting that instead of addressing everything, you iterate and do one request per assignment. Assuming 100 students, then there would be no pagination at all, and it's possible-to-likely that each of these requests would be faster, maybe 1 or 2s each, for a total of 200s. These numbers could be way off. Maybe they take 0.5s each. Or maybe they take 4s, but due to the layout of assignments vs. students, you end up having to make 200 requests, for a total time of 800s. It's just an avenue to explore. The other benefit to splitting by assignment is that you could issue multiple requests in parallel - up to one for each assignment. Even if you only do 4 in parallel, that could take a good-case 200s down to 50s.

I like it! I'll give it a try and see if that lowers the time. Thanks so much for looking this over! I appreciate your help Cody!

@japomani
Copy link
Author

japomani commented May 10, 2017

Here are the experiments that I am going to set up

Ways to ask for submissions

  • All assignments for all students
    • Synchronous returns of pages
  • All assignments for some students
    • From 1 to 100?
    • What is optimum # of students in a request
    • Make asynchronous requests
  • All students for a single assignment
    • Asynchronous requests for each assignment

Be sure to accommodate throttling so we can ask for it again if the request is rejected

Ways to ask for analytics

  • Course level assignment summary data
    • Doesn’t really have all the data I need, missing scores for assignments, I think?
    • Get course-level assignment data
  • Course level student summary data
    • May have scores, it is hard to tell from example
    • Get course-level student summary data
  • User in a course level assignment data
    • Has submission times and scores
    • One student at a time
    • Might be faster than submissions one user at a time
    • Get user-in-a-course-level assignment data

What data do I really need?

  • Scores
  • Submitted_at
  • Or graded_at
  • Gradebook and leaderboard don’t need dates
  • Experience and progress chart do need dates

@japomani
Copy link
Author

japomani commented May 10, 2017

Here are some notes on my initial tests. It looks like requesting individual student analytic data is going to be fastest. We might get throttled in parallel, I'll have to check it out.

Ways to ask for submissions

  • All assignments for all students
    • Synchronous returns of pages
    • 154 pages, page load on a not busy day is 6-20 seconds, if average is 13 seconds, will take 33 min to load, would be longer on a busy canvas day
    • includes submitted_at and graded_at
  • All assignments for some students
    • From 1 to 100?
    • What is optimum # of students in a request
    • Make asynchronous requests
    • 1 student takes 9 seconds for first page and 3 for second
    • in sequence, this would take about 30 min
    • In paralell it would take about 12 seconds depending on throttling
    • 21 students had 21 pages, average 10 seconds each, or 3.5 min for each batch
  • All students for a single assignment
    • Asynchronous requests for each assignment
    • There are 133 assignments
    • 2 pages for each assignment, page loads between 3-6 seconds
    • Both submitt_at and graded_at are there, but graded_at seems odd
    • In sequence, this would take about 10 min
    • Could do it in parallel and finish in 10 seconds but might get throttled. Be sure to accommodate throttling so we can ask for it again if the request is rejected

Ways to ask for analytics

  • Course level assignment summary data
    • Doesn’t really have all the data I need, missing scores for assignments, I think?
    • Get course-level assignment data
    • No score data, won't work
  • Course level student summary data
    • May have scores, it is hard to tell from example
    • Get course-level student summary data
    • No score data, won't work
  • User in a course level assignment data
    • Has submission times and scores
    • One student at a time
    • Might be faster than submissions one user at a time
    • Get user-in-a-course-level assignment data
    • Has score and submitted at, but not graded at, could be used for gradebook and leaderboard! Would work for experience tracker and progress chart if we could figure out what to do with null values for submissions. Null happens when a teacher just inputs a score in the gradebook without a submission, in this case we have been using graded_at as a surrogate, but that probably isn't a good idea. Used it to solve manually adding attendance scores, but now we have a tool for that that does create submission dates. So this is kind of rare. Probably only use that tool when everything is dynamically graded. A better way to handle null submitted_at is to just set the date as now to pull it out, but that would break the algorithm
    • Very fast, .5 second, would take about 1 min or so to do all students sequentially, in parallel, it would take .5 seconds for all, as long as we don't get throttled

@japomani
Copy link
Author

japomani commented Jun 6, 2017

Cody, thank you SOOO much for your help! Using some of your ideas we were able to reduce a 40min request to 27 seconds! It is amazing! You've been a wonderful help and I appreciate all your help. Hopefully I run into you at hacknight and I can thank you in person!

@stale
Copy link

stale bot commented Jun 6, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the No Activity label Jun 6, 2019
@stale stale bot closed this as completed Jul 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants