Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix: LTI grade passback #1464

Merged
merged 3 commits into from Aug 9, 2021
Merged

Conversation

taniwallach
Copy link
Member

Bug: When $lti_check_prior is false, then the generation of the time-dependent portion of the nonce is not occurring, as it was incorrectly moved inside the if ( $lti_check_prior )block in the version in #1177. (My fault.)

This bug is causing grade passback to fail when $lti_check_prior is not enabled after the first grade-passback call until the LMS expires the nonce which is constantly being reused.

Reports:


Fix: Move the generation of $uuid_p2 back where it belongs (outside the if block).


History: The code has once done it in the correct location (see: https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4906#p14846 and https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4770#p14184).

For some reason when the $lti_check_prior control switch was added, the code ended up inside the if block. See for example the file posted in https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4770#p15256 .

Apparently that version was only tested sufficiently when $lti_check_prior=1 was set.

when $lti_check_prior is enabled. This bug was causing grade passback
to fail when $lti_check_prior was not enabled after the first call until
the LMS expired the nonce which was being reused.
@taniwallach
Copy link
Member Author

Note: In the second commit I am increasing the threshold which determines when a time difference between the WeBWorK server and the LMS is reported, as there are reports of slow Wifi connections leading to differences above 5 seconds but under 10 seconds. This is intended to reduce false alarms about the time-difference issue.

@Alex-Jordan
Copy link
Contributor

Thanks @taniwallach I will apply this to the Runestone server and ask the two cases I know of to see if things change.

@taniwallach taniwallach mentioned this pull request Jul 28, 2021
@taniwallach
Copy link
Member Author

@Alex-Jordan - Thanks.

Lots of credit to Larry Riddle whose debug data was what got me to see what is going wrong.

@Alex-Jordan
Copy link
Contributor

At least one time, there was a report of an 11 second difference between a WW clock and an LMS clock. As far as I can tell, it is a harmless difference. The WW clock is synced to NIST. I can't say what's going on with the LMS clock, but this was from some company that manages Canvas installations for multiple schools.

It's only one data point, but is it enough to suggest raising that tolerance to something like 15 seconds?

@taniwallach
Copy link
Member Author

At least one time, there was a report of an 11 second difference between a WW clock and an LMS clock. As far as I can tell, it is a harmless difference.
...
It's only one data point, but is it enough to suggest raising that tolerance to something like 15 seconds?

Done. At https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4770 Larry Riddle reported that "Canvas only allows for 1 minute ahead and 5 minutes behind." so it certainly suffices to try to detect that an issue may occur when the difference is somewhat larger, but not too close to say 50 seconds (so reports can be made of the warning before the drift passes the threshold for causing real trouble).

Copy link
Contributor

@Alex-Jordan Alex-Jordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one report from a school using Runestone hosting and Canvas as the LMS that this has fixed grade passback for them.

With your understanding of all of this, would you expect a grade passback issue with LMSs other than Canvas? My own school uses D2L Brightspace. I just tested grade passback without this patch, and it succeeds. I am guessing that Canvas's nonce handling is just more thorough/secure?

@taniwallach
Copy link
Member Author

Canvas's LTI is certainly quite strict about enforcing the non-reuse of the nonce. That seems be a reason why many of the reports of "duplicate nonce" problems were from Canvas users.

Larry Riddle also reported that the fix works for him with the Canvas LMS: https://webwork.maa.org/moodle/mod/forum/discuss.php?d=5628#p16376

I think these 2 reports justify merging this hotfix a soon as reasonably possible (and after some people look at the code change and agree to the fix).


@Alex-Jordan Your report about D2L Brightspace working without the patch seems to indicate that it is not enforcing the non-reuse of nonces. Just to be certain, you did test the pre-hotfix code when $lti_check_prior=0 was set, correct?

The thread at https://webwork.maa.org/moodle/mod/forum/discuss.php?d=5002 refers to testing a version of SubmitGrade.pm which is essentially identical to that in #1177 (other than 2 white-space characters) and Bianca Sosnovski reported that it was working well with BlackBoard. It is not clear if she had set $lti_check_prior=1 which would have avoided the bug, or whether BlackBoard is also lax about enforcing non-reuse of nonces.

About other LMS's: It depends on if and how well the LMS enforces non-reuse of nonces when processing grade submissions from LTI.

It is possible that Moodle, at least in my institution, may also not really enforce the non-reuse of the nonces properly, which could explain having failed to detect the bug during testing. So is the possibility that the testing of the "final version" simply was not thorough enough.

I do recall testing the code from #1177, but it is long enough ago - that I cannot recall the details of the tests I did. The course.conf file in the test course I used has $lti_check_prior=1 left in it at the end of the testing, and I cannot be certain how well I tested when it was set to 0. I'm pretty certain that I would not have skipped doing some testing with that setting also, but as there were several generations of changes to SubmitGrade.pm over time, it could be that I did not test the final version as thoroughly as I should have. The bug would not occur for a single grade passback test, only for a second one done with $lti_check_prior=0 and occurring sufficiently soon after the prior one for the LMS to consider a nonce-reuse as invalid. Again, it could be that our local Moodle is just not enforcing the non-reuse of nonces, so would not pick up on the bug.

I just checked on an additional server I run on which grade-passback is used quite heavily. It is still on WW 2.15 and has a somewhat older version of SubmitGrade.pm from before the $lti_check_prior control switch was added, so the bug could not have effected it.

My main server had the newer code since the middle of January 2021, and there were no reports of problems. However, I looked now and found that there were no Spring semester courses using grade-passback, and the change might have been made once the Winter semester courses using grade-passback were essentially over.

@Alex-Jordan
Copy link
Contributor

Just to be certain, you did test the pre-hotfix code when $lti_check_prior=0 was set, correct?

Yes, assuming that is something I can set to 0 at the very end of course.conf and it would still take effect.

@taniwallach
Copy link
Member Author

Just to be certain, you did test the pre-hotfix code when $lti_check_prior=0 was set, correct?

Yes, assuming that is something I can set to 0 at the very end of course.conf and it would still take effect.

Yes - setting it in course.conf takes effect.

@taniwallach
Copy link
Member Author

Tim Payer also confirmed that this fix solves problems with Canvas: https://webwork.maa.org/moodle/mod/forum/discuss.php?d=5628#p16383

I think we should try to merge this soon, as the bug causes real problems for grade-passback for Canvas.

@drgrice1
Copy link
Sponsor Member

drgrice1 commented Aug 3, 2021

There is something not right with how this is implemented. I clearly didn't review this very carefully before. The TIME_DIFF_THRESHOLD constant should not exist. There is a $NonceLifeTime variable defined in authen_LTI.conf that should be used instead. This is used for the actual verification (in the ok subroutine of LTIAdvanced.pm), and should be used for this warning as well. That would make actual errors and this warning consistent. This code that gives this warning should also be in the same location (in the ok subroutine).

@taniwallach
Copy link
Member Author

There is something not right with how this is implemented. I clearly didn't review this very carefully before. The TIME_DIFF_THRESHOLD constant should not exist. There is a $NonceLifeTime variable defined in authen_LTI.conf that should be used instead. This is used for the actual verification (in the ok subroutine of LTIAdvanced.pm), and should be used for this warning as well. That would make actual errors and this warning consistent. This code that gives this warning should also be in the same location (in the ok subroutine).

The purpose of the TIME_DIFF_THRESHOLD was to catch problems with time differences which trigger problems on grade-passback, which follow the rules on the LMS side, so do not depend on $NonceLifeTime which is something each site or even course can adjust.

At https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4770 Larry Riddle reported that "Canvas only allows for 1 minute ahead and 5 minutes behind" which was a cause of major problems with grade passback. The shorter TIME_DIFF_THRESHOLD was intended to be helpful in (1) determining when time differences were causing the "duplicate nonce errors" being reported, and (b) to pick up on clock drift early enough to prevent things from starting to break. As such, it also tests using the absolute value of the difference, while the code testing based on $NonceLifeTime is testing under the assumption of reasonably synchronized clocks and tests only for the time being too far in the past according to the WeBWorK server.

Whether this test should be where it is (near where $debug_lti_parameters provides the parameters provided during a connection attempt) or moved to ok is something we can discuss.

@drgrice1
Copy link
Sponsor Member

drgrice1 commented Aug 3, 2021

The thing is that that is not what it is actually doing. All the code that utilizes the TIME_DIFF_THRESHOLD is doing is checking the time difference in the same way already done with the NonceLifeTime variable, and then giving a warning if the time difference is more. Since the hard coded constant is 15 seconds and the default NonceLifeTime is 60 seconds this results in a warning being displayed when authentication succeeds. This is incorrect. Furthermore, this is not code that is really even related to grade passback, so it is not at all doing what you claim it is doing.

@taniwallach
Copy link
Member Author

The main code change in the PR fixes a serious bug in grade passback (which is my fault, having relocated an important line of code from an earlier version improperly). Getting that bug fix in, with or without adjusting if and how TIME_DIFF_THRESHOLD is used is important.

The change to TIME_DIFF_THRESHOLD in this PR is the insignificant change, and the value was significantly increased in the PR at @Alex-Jordan's request, in order to reduce the frequency of what are "false alarms" when the threshold is too small. Until we decide how to revise/replace/improve the code related to TIME_DIFF_THRESHOLD, I think "we" can live with the small change to the threshold here, or can even make the value somewhat larger.


About the the suitability of testing for the clock difference in a manner similar to what is being done now using TIME_DIFF_THRESHOLD and what sort of changes to make:

  1. During grade passback, the oauth_timestamp is generated on the webwork side and sent to the LMS, so trying to detect clock differences during grade passback is not possible. We can only detect the failure of the passback, but not what caused it.
  2. There is no feasible manner to truly compare clocks automatically, only to compare arriving oauth_timestamp to the local system clock.
  3. I disagree that the tests are identical:
    • The authentication test of if ($self->{timestamp} < time() - $ce->{NonceLifeTime}) will only fail when the oauth_timestamp from the LMS is too old. It will never pick up a case where the WeBWorK clock is significantly ahead of the LMS clock.
      • That is "old code" and maybe it is not a good thing that authentication will succeed when there is a significant clock difference in this direction, but that is the way things have been for a long time.
      • We can certainly decide to fix that in some manner or another, but I'm not sure the current hotfix proposal is the place to do so.
    • The ( abs($delta_time) > TIME_DIFF_THRESHOLD ) test will be true also when the WeBWorK time is ahead of the oauth_timestamp from the LMS, and that sort of difference can also cause failures of grade pass-back.
  4. Reporting "large" clock differences can also help avoid clock drift from reaching the point where authentication will fail due to the difference.

The purpose of the test using TIME_DIFF_THRESHOLD is to try to detect and report when there seems to be a reasonable chance that there is a significant clock difference between the server. It is being done where WeBWorK has access to a timestamp from the LMS side, and warning can be issued even if that difference is not "significant" enough to cause an authentication failure (accessing WeBWorK). In such cases, issuing a warning when authentication succeeds but the clocks seem likely to have a significant difference is desirable, in my opinion, even if somewhat annoying to students if it happens often.

Were we to change the triggering threshold to $NonceLifeTime then the chances of triggering a warning and the clock drift getting attention (if that is the problem) before grade passback started to fail (probably mainly to Canvas) are probably quite small, and the trigger condition would be something which varied from course to course depending on a user-determine value, even thought the WeBWorK to LMS grade-passback (at least for Canvas) has settings which are (typically) out of the WeBWorK user/administrators control.

If we want decide to change this, I can think of several approaches:

  1. Further increase the threshold to cut back on "false alarms" at the expense of deferring the detection of real clock drift problems for longer.
  2. Totally drop the test based on TIME_DIFF_THRESHOLD and give up on attempting to pick up and warn on suspected clock differences.
    • People whose sites run into recurring grade passback or authentication issues will hopefully find the information on the clock drift issue in the forums.
  3. Condition the ( abs($delta_time) > TIME_DIFF_THRESHOLD ) trigger on ($ce->{LTIGradeMode} eq 'homework') || ($ce->{LTIGradeMode} eq 'course') so that the warning will only be issued when grade-passback is in use (and as part of the full debugging output when $ce->{debug_lti_parameters} is enabled), as grade passback (to Canvas) seems to have more significant problems when clock drift occurs.
  4. Replace the hard-coded value with the value of a user controlled variable, provide a default setting in conf/authen_LTI.conf.dist and hope people use it properly.
  5. Add another setting variable (ex. $ce->{LTI_warn_about_clock_differences}) and condition on that, but keep a hard-coded threshold.

Lots of other options/combinations are possible.


I was curious to what extent there are forum posts about "Nonce Expired. Your NonceLifeTime may be too short"):

  • The only real hit was https://webwork.maa.org/moodle/mod/forum/discuss.php?d=4818
  • The author notes that he needed to raise $NonceLifeTime from 60 to 120. I suspect that such a large tolerance probably results from "clock drift" and not just from network latency or the like.
  • The other forum threads which mention this message seem relatively unrelated.

@drgrice1
Copy link
Sponsor Member

drgrice1 commented Aug 4, 2021

What you are trying to do is provide diagnostic information for a server that is not correctly configured. That may be the WeBWorK server or the Canvas server. The thing is that that is not WeBWorK's job. That information may be useful, and certainly is good to be provided if asked for, but absolutely should not be displayed in any case where it is not actually an issue, as is the case of a successful LTI authentication. This diagnostic information should only be displayed in the case that $ce->{debug_lti_parameters} is true, and not dependent on any constant (like TIME_DIFF_THRESHOLD) that has no actual significant meaning.

@taniwallach
Copy link
Member Author

I concede. The last commit drops the use of TIME_DIFF_THRESHOLD.

Copy link
Sponsor Member

@drgrice1 drgrice1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry to have been so stubborn about this, but displaying this warning when it is not an issue will make users think there is an issue. @Alex-Jordan and I both saw this warning when we had slightly slow internet connections. In that case the 15 second time diff would have been enough to prevent the warning, but 15 seconds will still not be enough in many cases. Many students have even slower connections and there are other things that can increase this time.

For instance, the way that Canvas works when you have an external tool set to open in a new window is that it presents a submit button to click on to open the external tool. The oauth timestamp (and all other necessary oauth parameters) are in hidden inputs in the form the submit button is contained in, and is generated at the time that the Canvas page is loaded. Canvas sets a timer and expires the button and form after a certain amount of time (this actually occurs after 2 minutes and 30 seconds) and removes it from the page, replacing it with a note to the user to reload the page to get a new button. So the user may wait and click the button after 30 seconds or so (or even up to 2 minutes and 30 seconds later). In that case the time difference that WeBWorK will see will definitely be more than 15 seconds. This obviously has nothing to do with network lag, and even if the now removed TIME_DIFF_THRESHOLD were set to 15 seconds this results in the warning being displayed.

In this case the time differential can be a combination of network latency when in the initial Canvas page loading, time the student takes to click on the button, as well as network latency in the WeBWorK page loading. This can easily exceed 15 seconds.

@taniwallach
Copy link
Member Author

@drgrice1 Thanks for the additional explanation. I was to stubborn to concede earlier - sorry. The modality of operation of Canvas for LTI in an external window is something of which I was not aware, and will certainly cause time differences which do not really relate to server clock differences.

Now that this has 2 approvals, I think it would be good to merge in the bug fix. It might make sense to leave all 3 commits so it is clear that the changes about TIME_DIFF_THRESHOLD are not really part of the grade-passback bug fix, but I'll leave the final call on squashing or not to whoever does the merge.

@drgrice1
Copy link
Sponsor Member

drgrice1 commented Aug 8, 2021

Unless there are objections, I will merge this tomorrow.

@Alex-Jordan
Copy link
Contributor

I have not done actual testing on the current version of this. Has anything meaningful changed with respect to addressing the grade passback bug? Or was the only thing that changed since I did testing the part about reporting the clock difference when it surpassed a threshold?

@drgrice1
Copy link
Sponsor Member

drgrice1 commented Aug 9, 2021

The only change was about reporting the clock difference. That is now only done if debugging is enabled.

@drgrice1 drgrice1 merged commit 4e195c9 into openwebwork:main Aug 9, 2021
@taniwallach taniwallach deleted the lti-submit-bug-fix branch July 15, 2022 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants