Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Unable to parse SAML2 response: Unsolicited response #7056

Closed
babolivier opened this issue Mar 10, 2020 · 25 comments · Fixed by #8248
Closed

Unable to parse SAML2 response: Unsolicited response #7056

babolivier opened this issue Mar 10, 2020 · 25 comments · Fixed by #8248
Assignees
Labels
z-bug (Deprecated Label)

Comments

@babolivier
Copy link
Contributor

Sometimes, when authenticating with passwordless login on Mozilla's SSO, the user's browser gets told to POST to /authn_response with a SAML AuthN response (as expected), but that call seems to fail with the error "Unable to parse SAML2 response: Unsolicited response: id-XXXXXXXXXXXXX".

I'm currently not sure why this happens.

@richvdh
Copy link
Member

richvdh commented Mar 17, 2020

possibly we're hitting /authn_response twice for the same request somehow?

@clokep
Copy link
Contributor

clokep commented Apr 23, 2020

Is there any consistent way to reproduce this or just logging in a bunch of times?

@jaywink
Copy link
Member

jaywink commented Apr 23, 2020

Is there any consistent way to reproduce this or just logging in a bunch of times?

If it helps, there is a bunch of these daily in the Modular Synapse sentry.

@clokep
Copy link
Contributor

clokep commented May 22, 2020

Summary:

I think this might only be an issue in worker mode due to requests coming back to a different worker (and using in memory storage which has no knowledge of the original request). I see a few ways forward:

  1. Persist the information into the database so all workers have it.
  2. Disable the checking of unsolicited checking of responses of SAML.
  3. Ensure that SAML requests and callbacks always go to the same worker (Potential bug when using SAML and workers might result in "Unsolicited response" errors #7530 has some thoughts on that).

More details

I think #7530 might actually be a duplicate of this!? My thought is the following happens:

  1. User does something to initiate a SAML flow.
  2. Synapse returns a redirect to the IdP and stores the SAML session ID in memory.
  3. The user goes through the SAML flow, blah blah.
  4. The user gets directed back to Synapse, but a different worker. 💥 You get the error from above.

Note that the SAML response is valid, just that the worker knows nothing about it.

I suspect the solution is to store this information in the database, similar to what we did for #6877.

There's already a table (user_external_ids) which has auth_provider, external_id, user_id as columns. We could likely have a table which is auth_provider, request_id, creation_time, ui_auth_session_id`.

I was curious how OIDC handled this, and that doesn't seem to persist anything in memory, so taking another look at why we have this _outstanding_requests_dict:

  • It is passed into pysaml2 and used to ensure that this is not an "unsolicited query" (you can disable this check, I'm not sure if we should however).
  • It is used to get back to the UI auth session ID after the redirection is all done.
  • (We also prune items from this list after a period of time, which is why we store creation time on it.)

Note that we could actually pass the UI auth session ID in the RelayState (since that's unused for UI auth, see #7484), so my question is: do we need to protect against unsolicited requests like this? I'm unsure of the security ramifications of disabling this.

I'd be curious if other people have ideas on how to fix this!

@richvdh
Copy link
Member

richvdh commented May 23, 2020

so my question is: do we need to protect against unsolicited requests like this?

I've never really understood its purpose. Possibly a defence in depth against CSRF attacks? We can probably remove it if that solves any problems.

However, I'm not sure that the hypothesis fits the symptoms. It's reported against mozilla's deployment of synapse, which only has one of each type of worker (except synchrotrons). If this were a problem with requests going to different workers, I would expect it to either always work or never work. I can't see how we'd end up with an intermittent bug.

@clokep
Copy link
Contributor

clokep commented May 26, 2020

However, I'm not sure that the hypothesis fits the symptoms. It's reported against mozilla's deployment of synapse, which only has one of each type of worker (except synchrotrons). If this were a problem with requests going to different workers, I would expect it to either always work or never work. I can't see how we'd end up with an intermittent bug.

A similar symptom would happen during a restart of services, I'm unsure how often that would happen on Modular instances.

@clokep
Copy link
Contributor

clokep commented Jun 1, 2020

It might also be worth fixing the known situation and seeing if it still happens, but ideally we'd want to ensure the solution works in all cases...

@richvdh
Copy link
Member

richvdh commented Jun 1, 2020

possibly. I'm hoping we'll be able to get some logs out of the modular instance to help understand what is going on.

@richvdh
Copy link
Member

richvdh commented Jun 2, 2020

ok well, I got logs for two instances of this this morning on the mozilla instance.

The first one is a complete mystery tbh. A client, from an IP address we've never seen before, suddenly pops up with a SAML session ID we've never heard of (or at least, I couldn't find in some brief grepping of the logs). I guess it's just an old session, and the user used an old link in their browser history or something. The main source of regret here is that the error message isn't better ("oops something went wrong" isn't terribly informative.)

The second one is much clearer: the user took 6 minutes to validate their email address and come back. We expire the SAML session dict after only 5 minutes. Particularly given auth0's email validation links are valid for 15 minutes, this seems... silly.

@clokep
Copy link
Contributor

clokep commented Jun 2, 2020

I wonder if the expiry time should be configurable?

@richvdh
Copy link
Member

richvdh commented Jun 2, 2020

it is. But I think the default is probably too short.

@clokep
Copy link
Contributor

clokep commented Jun 2, 2020

Looks like this is already configurable, the default is 5 minutes:

# The lifetime of a SAML session. This defines how long a user has to
# complete the authentication process, if allow_unsolicited is unset.
# The default is 5 minutes.
#
#saml_session_lifetime: 5m

Edit: Doh, you already set it is configurable. 😢

@clokep
Copy link
Contributor

clokep commented Jun 9, 2020

I put up #7664 to increase the timeout. Might not be an ideal solution, but should fix a concrete case we've seen.

@clokep
Copy link
Contributor

clokep commented Jun 25, 2020

This is happening much less after the changes in #7664. Not sure if these are people taking greater than the 15 minutes to finish validation or not. I'm unclear what the next steps might be here: try to improve the error message maybe?

@richvdh
Copy link
Member

richvdh commented Jun 29, 2020

are we still getting reports of this? I'd be inclined to close it if not.

Otherwise yes, probably need to remember where the "oops something went wrong" error message is coming from and try to make it give more clues as to what went wrong.

@babolivier
Copy link
Contributor Author

Yes, it looks like we're still seeing this (around 5-10x/day on Modular).

@richvdh
Copy link
Member

richvdh commented Jun 29, 2020

gosh. it was only a couple a day back when I investigated a few weeks ago (mind you, there was some brokenness in logging at the time).

ok then I would like to suggest a two-pronged approach:

  • investigate the logs for a representative sample of the failures to see if we can understand why they are continuing to fail
  • assuming it turns out that it is just lots of people turning up with old saml session ids, try to improve the error handling.

@babolivier
Copy link
Contributor Author

it was only a couple a day back when I investigated a few weeks ago

Sentry seems to be bucketing some separately, in this case I looked at two separate issues that each had roughly between 2 and 5 occurrences per day, maybe that explains why you were seeing less of them?

@clokep
Copy link
Contributor

clokep commented Jul 7, 2020

investigate the logs for a representative sample of the failures to see if we can understand why they are continuing to fail

I spent some time with these logs and with Sentry and couldn't really figure out if there was a correlation between old requests or something else happening.

I think improving the error handling might be useful, I'm guessing that the concern with that is that we're missing a "real" bug?

@clokep
Copy link
Contributor

clokep commented Aug 31, 2020

Now that we have better logging I looked back over the last 7 days of this error occurring on the Mozilla instance:

Note that we remove the outstanding request once a response for it is received -- this seems correct, but I'm unsure if SAML allows for a single session to be completed multiple times (assuming that they are all within the proper timeout and such).

I'm not sure what, if anything, should be done to handle these cases? Maybe we can improve the error page to say something like "Your SAML session might have timed out or already been completed. Please try again." Or something to that effect?

@richvdh
Copy link
Member

richvdh commented Sep 1, 2020

do we have any idea why people would re-use the SAML session ID?

Improving the error text seems sensible either way.

@clokep
Copy link
Contributor

clokep commented Sep 1, 2020

do we have any idea why people would re-use the SAML session ID?

My guess is that it is due to reloading a page? Or if e-mail verification is in the workflow it could be clicking on a link twice? I should note that the "re-used" SAML session IDs were within the 15 minute timeout period (and all from 2 users).

@clokep
Copy link
Contributor

clokep commented Sep 3, 2020

Since i couldn't remember the behavior the user saw here they currently just get an internal server error sent back to them (since it is part of the redirect flow the client isn't involved).

Steps to reproduce this sanely:

  1. Configure your system for SAML.
  2. Start a login with SSO (click "Sign in with single sign-on").
  3. Restart Synapse (this forcefully clears all known SAML sessions).
  4. Finish the login via the SSO flow.

You end up at a white that says "Internal server error".

When adding the OpenID code we added a template for handling some errors, I think we should re-use that here.

@clokep
Copy link
Contributor

clokep commented Sep 3, 2020

Ah this isn't entirely accurate -- we do have a saml_error.html page, but this seems to work by parsing the returned exceptions, which isn't ideal.

@babolivier
Copy link
Contributor Author

Ah this isn't entirely accurate -- we do have a saml_error.html page, but this seems to work by parsing the returned exceptions, which isn't ideal.

Note that this is the only way we have to process and display errors from Auth0, which is why it works like that :/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants