New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"list submissions for multiple assignments" API fails on large data sets #1033
Comments
I'm not calling your entire request invalid, but I do want to point out some items that you state as fact, or assumed fact, and which may actually end up as fact, though you obviously have not tested them, so they could also be completely wrong.
As for workarounds for now - it seems that you're blindly requesting every submission for every student for every assignment in a course. I highly recommend that you use the Oh, and on the subject of paging, your proposed submitted_after/graded_after parameters definitely complicate things. If a submission is modified between requests, and would end up in the first page of results due to sorting, you may never see that change. This isn't insurmountable (by forcing a specific sort order when you use these parameters and/or using paging that's based on continuation, rather than simply skipping X number of records), but does make your feature request more complicated than you make it out to be. Though it should also be pointed out that the current incarnation of the API is subject to similar race conditions. It's just less likely to become perpetual errors, because you'll see the change the next time you fetch all of the data. |
Cody, Thank you for such a thoughtful reply! I accept your push back on all points : ) Testing would really be useful to clarify my assumptions. My biggest assumption is that the largest bottleneck is trying to send 10,000 records at once via HTTP with paging making it 100 requests (these numbers are rounded, but not far off from a real use case I've been working with). I have this assumption based on my experience, but I may be interpreting what is actually happening wrong. Here is what I experience. At the beginning of the semester when there are only a few submissions, this request will complete in around 10-30 seconds. As the semester continues and the number of submissions increases, so does the wait time. By the end of the semester I wait 30min or more to get the complete set of data in. However, I might have to get up very early in the morning to try, it seems like the later in the day the more often the request fails or it stops paging and I only have some of the data (that is what I mean by corrupting the data, the data no longer accurately represents what is in canvas.) Reading your note, you seem to be mostly talking about a single request lasting 10 min. That is not what I'm talking about. I never have that problem. I think we are on the same page in this sentence:
A single page always returns quickly, the challenge comes when there are 100 pages of data. If each page took 6 seconds to get to me that is 10 min. If I could request only NEW data, rather than ALL data, several times a day, I think it would only be one or two pages. I think this would reduce the page loads from 10 of minutes to about 20 seconds or so. Of course, that is an assumption, and I can't currently test it, but it seems reasonable based on how the API performs at the beginning of the semester when there is not much data. If I read your note right, you suggest that this might not really be the bottleneck, but the bottleneck could be on the ruby side as it queries the DB and builds the object, is that right? Would it still be an issue if the initial DB query only asked for a subset of the data based on a date/time stamp. Instead of requesting 10,000 records it might just ask for 50 or a 100, just the new records since the last time I asked for them. That data object would be relatively small. I'm not sure I understand how your work around helps. In components like a gradebook, leaderboard, troubled student dashboard, etc., I need all submissions, for all assignments, for all students in order to compute and display the required information. I can ask for that using "list submissions for multiple assignments" but it can take a long time to arrive with large data sets. How will asking for the same data organized by Assignment ID help? Is the code optimized for that request?
That is true, but without "graded_after" or "submitted_after" or something like that, I think I have no way to know which data I have already asked for and no way to ask for just the new data. I would prefer to only ask for the data I don't already have. But, unless I misunderstand how the submissions API works, it is all or nothing. If I want the most recent records, I have to ask for all the records. Do I have that wrong? Feel free to push back on any of this. I want to make sure we are talking about the same things and I find it helpful when you challenge my thinking on this. Thank you! |
I'm pretty sure the bottleneck is with large results (ignoring pagination), because we have to query all of the data from the database, constructing ruby objects, and then do the pagination slicing that result set (afterwhich we format the API response, which can also be intensive depending on what additional data is needed). I could be wrong, though, and where the slicing takes place in this particular request. Requests that span data across multiple object types (like assignments or students like this one) sometimes don't have a good way to push the pagination down into the database. Reducing the size of the initial query (via something like graded_after) would definitely reduce the initial set of data being queried, and thus cascade to all other steps in the request.
Say you have 100 students, and 100 assignments for 10,000 records. You're saying that when you address all 10,000 records (and split into 100 pages), the requests are taking ~6s each, for a total of 600s. I'm suggesting that instead of addressing everything, you iterate and do one request per assignment. Assuming 100 students, then there would be no pagination at all, and it's possible-to-likely that each of these requests would be faster, maybe 1 or 2s each, for a total of 200s. These numbers could be way off. Maybe they take 0.5s each. Or maybe they take 4s, but due to the layout of assignments vs. students, you end up having to make 200 requests, for a total time of 800s. It's just an avenue to explore. The other benefit to splitting by assignment is that you could issue multiple requests in parallel - up to one for each assignment. Even if you only do 4 in parallel, that could take a good-case 200s down to 50s. You'll have to be careful with throttling on that, though. It's incredibly unlikely to hit throttling with one request at a time, but parallel requests definitely increases the odds. A bit of experimenting will hopefully help you find a good value for number of parallel requests without hitting throttling - just be sure to handle it in your code so that if for some reason Canvas is being slower than normal, and you have a request get rejected, you retry. |
One more thought:
Ya, I've thought about this. I might not get any data that updates during my request, but I think I will get it on the next request. Let's say I ask for all the submissions at 12:00 and then record that time as my "fresh data" time, but I don't actually get the data back until 12:03 and some data came in at 12:01, I won't have all the data. But, the next time I ask for data, say at 12:10, I can ask for everything since my last "fresh data" time, or 12:00, and the data I missed at 12:03 will come over. I might have to check for duplicates to keep my data clean. In this way we are using some sorting in the query to make sure I can grab all the data. |
I like it! I'll give it a try and see if that lowers the time. Thanks so much for looking this over! I appreciate your help Cody! |
Here are the experiments that I am going to set up Ways to ask for submissions
Be sure to accommodate throttling so we can ask for it again if the request is rejected Ways to ask for analytics
What data do I really need?
|
Here are some notes on my initial tests. It looks like requesting individual student analytic data is going to be fastest. We might get throttled in parallel, I'll have to check it out. Ways to ask for submissions
Ways to ask for analytics
|
Cody, thank you SOOO much for your help! Using some of your ideas we were able to reduce a 40min request to 27 seconds! It is amazing! You've been a wonderful help and I appreciate all your help. Hopefully I run into you at hacknight and I can thank you in person! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Summary:
Any tool that requires all submissions for all students will not scale and will fail. Often the request fails to complete and/or returns corrupt data. This is especially true in large courses, courses with a large number of assignments, or when several courses are cross listed into the same shell.
Steps to reproduce:
Expected behavior:
The “list submissions for multiple assignments” API should include submitted_after[] and graded_after[] parameters. This will allow developers to request only submission data that are new since the last request was made, and store updates locally. This dramatically reduces the amount of data requested and will lower page load times from 10s of minutes to SECONDS
Additional notes:
Issue: There is a whole class of LTI tools that cannot be built due to the current design of the submissions API. These are tools like custom gradebooks, course dashboards, early detection of struggling students, leaderboards, gamification, etc. in large courses, courses with a large number of assignments, or when several courses are cross listed into the same shell.
The current Submissions API returns ALL submissions that fit the requested parameters, even if we've requested that data before. In large classes with a large number of assignments, this can quickly become a very large number of submissions. My current class reaches over 10,000! In addition, the Canvas API only returns 100 submissions per page. This means that large submission requests are parsed into as many as 100 API requests or more for EACH page load! The Canvas API paging system does not allow asynchronous requests, so each page must be delivered and then the next requested, sequentially. Also, in large classes, these kinds of requests occur many times a day and I suspect that so many page requests from the same course results in API throttling which further slows down the requests.
The result is that API requests for all submissions in large courses, courses with a large number of assignments, or several courses cross-listed into the same shell can take 10s of minutes! Often the request fails to complete and/or returns corrupt data.
The traditional solution for improving performance of API data is to store it locally. However, this approach is only useful for data that does not change often--submissions data are very dynamic and change frequently. Also, there is no way to know if new data is available. As a result, fresh data--the entire data set, must be requested each time a user wants to see the current status. Again, this can take 10s of minutes and may be a significant drag on Canvas servers. Another approach is to use the Live Events API to collect data locally as it is generated. This approach is not very useful in this use case. First, the Live Events API returns data on an account level, not a course level. So, to get data for a class, a developer must request data for their ENTIRE university, a solution that is way overkill and has all kinds of security issues. Second, the Live Events API is still in beta and is not available.
Solution: A simple solution for this problem is outlined below. By adding submitted_after[] and graded_after[] parameters to the “list submissions for multiple assignments” API, developers can request only submission data that are new since the last request was made, and store updates locally. In a large course, data will be requested multiple times a day, and thus local data is updated multiple times a day. This reduces the size of any individual submissions API request and dramatically reduces the amount of data requested from Canvas servers for a course. This will lower page load times from 10s of minutes to SECONDS and make a variety of innovative components possible! I think coding this change to the API will likely be trivial (famous last words : ), and only require a couple of ‘if statements’ and WHERE clauses on the DB request.
Request Parameters:
Parameter - submitted_after[]
Type - Date/Time
Description - Return only/all submissions with submitted_at dates greater than a specified date / time
Parameter - graded_after[]
Type - Date/Time
Description - Return only/all submissions with graded_at dates greater than a specified date / time
The text was updated successfully, but these errors were encountered: