Unable to parse SAML2 response: Unsolicited response #7056

babolivier · 2020-03-10T09:32:41Z

Sometimes, when authenticating with passwordless login on Mozilla's SSO, the user's browser gets told to POST to /authn_response with a SAML AuthN response (as expected), but that call seems to fail with the error "Unable to parse SAML2 response: Unsolicited response: id-XXXXXXXXXXXXX".

I'm currently not sure why this happens.

The text was updated successfully, but these errors were encountered:

richvdh · 2020-03-17T10:39:00Z

possibly we're hitting /authn_response twice for the same request somehow?

clokep · 2020-04-23T12:21:01Z

Is there any consistent way to reproduce this or just logging in a bunch of times?

jaywink · 2020-04-23T12:42:15Z

Is there any consistent way to reproduce this or just logging in a bunch of times?

If it helps, there is a bunch of these daily in the Modular Synapse sentry.

clokep · 2020-05-22T18:55:36Z

Summary:

I think this might only be an issue in worker mode due to requests coming back to a different worker (and using in memory storage which has no knowledge of the original request). I see a few ways forward:

Persist the information into the database so all workers have it.
Disable the checking of unsolicited checking of responses of SAML.
Ensure that SAML requests and callbacks always go to the same worker (Potential bug when using SAML and workers might result in "Unsolicited response" errors #7530 has some thoughts on that).

More details

I think #7530 might actually be a duplicate of this!? My thought is the following happens:

User does something to initiate a SAML flow.
Synapse returns a redirect to the IdP and stores the SAML session ID in memory.
The user goes through the SAML flow, blah blah.
The user gets directed back to Synapse, but a different worker. 💥 You get the error from above.

Note that the SAML response is valid, just that the worker knows nothing about it.

I suspect the solution is to store this information in the database, similar to what we did for #6877.

There's already a table (user_external_ids) which has auth_provider, external_id, user_id as columns. We could likely have a table which is auth_provider, request_id, creation_time, ui_auth_session_id`.

I was curious how OIDC handled this, and that doesn't seem to persist anything in memory, so taking another look at why we have this _outstanding_requests_dict:

It is passed into pysaml2 and used to ensure that this is not an "unsolicited query" (you can disable this check, I'm not sure if we should however).
It is used to get back to the UI auth session ID after the redirection is all done.
(We also prune items from this list after a period of time, which is why we store creation time on it.)

Note that we could actually pass the UI auth session ID in the RelayState (since that's unused for UI auth, see #7484), so my question is: do we need to protect against unsolicited requests like this? I'm unsure of the security ramifications of disabling this.

I'd be curious if other people have ideas on how to fix this!

richvdh · 2020-05-23T00:39:59Z

so my question is: do we need to protect against unsolicited requests like this?

I've never really understood its purpose. Possibly a defence in depth against CSRF attacks? We can probably remove it if that solves any problems.

However, I'm not sure that the hypothesis fits the symptoms. It's reported against mozilla's deployment of synapse, which only has one of each type of worker (except synchrotrons). If this were a problem with requests going to different workers, I would expect it to either always work or never work. I can't see how we'd end up with an intermittent bug.

clokep · 2020-05-26T11:16:29Z

However, I'm not sure that the hypothesis fits the symptoms. It's reported against mozilla's deployment of synapse, which only has one of each type of worker (except synchrotrons). If this were a problem with requests going to different workers, I would expect it to either always work or never work. I can't see how we'd end up with an intermittent bug.

A similar symptom would happen during a restart of services, I'm unsure how often that would happen on Modular instances.

clokep · 2020-06-01T17:01:04Z

It might also be worth fixing the known situation and seeing if it still happens, but ideally we'd want to ensure the solution works in all cases...

richvdh · 2020-06-01T17:42:02Z

possibly. I'm hoping we'll be able to get some logs out of the modular instance to help understand what is going on.

richvdh · 2020-06-02T09:44:16Z

ok well, I got logs for two instances of this this morning on the mozilla instance.

The first one is a complete mystery tbh. A client, from an IP address we've never seen before, suddenly pops up with a SAML session ID we've never heard of (or at least, I couldn't find in some brief grepping of the logs). I guess it's just an old session, and the user used an old link in their browser history or something. The main source of regret here is that the error message isn't better ("oops something went wrong" isn't terribly informative.)

The second one is much clearer: the user took 6 minutes to validate their email address and come back. We expire the SAML session dict after only 5 minutes. Particularly given auth0's email validation links are valid for 15 minutes, this seems... silly.

clokep · 2020-06-02T11:07:27Z

I wonder if the expiry time should be configurable?

richvdh · 2020-06-02T11:28:40Z

it is. But I think the default is probably too short.

clokep · 2020-06-02T13:07:29Z

Looks like this is already configurable, the default is 5 minutes:

synapse/synapse/config/saml2_config.py

Lines 283 to 287 in b2b8699

    
                     # The lifetime of a SAML session. This defines how long a user has to 
        
                     # complete the authentication process, if allow_unsolicited is unset. 
        
                     # The default is 5 minutes. 
        
                     # 
        
                     #saml_session_lifetime: 5m

Edit: Doh, you already set it is configurable. 😢

clokep · 2020-06-09T18:55:12Z

I put up #7664 to increase the timeout. Might not be an ideal solution, but should fix a concrete case we've seen.

clokep · 2020-06-25T13:56:18Z

This is happening much less after the changes in #7664. Not sure if these are people taking greater than the 15 minutes to finish validation or not. I'm unclear what the next steps might be here: try to improve the error message maybe?

richvdh · 2020-06-29T15:24:19Z

are we still getting reports of this? I'd be inclined to close it if not.

Otherwise yes, probably need to remember where the "oops something went wrong" error message is coming from and try to make it give more clues as to what went wrong.

babolivier · 2020-06-29T15:28:40Z

Yes, it looks like we're still seeing this (around 5-10x/day on Modular).

richvdh · 2020-06-29T15:46:47Z

gosh. it was only a couple a day back when I investigated a few weeks ago (mind you, there was some brokenness in logging at the time).

ok then I would like to suggest a two-pronged approach:

investigate the logs for a representative sample of the failures to see if we can understand why they are continuing to fail
assuming it turns out that it is just lots of people turning up with old saml session ids, try to improve the error handling.

babolivier · 2020-06-29T16:15:20Z

it was only a couple a day back when I investigated a few weeks ago

Sentry seems to be bucketing some separately, in this case I looked at two separate issues that each had roughly between 2 and 5 occurrences per day, maybe that explains why you were seeing less of them?

clokep · 2020-07-07T13:24:49Z

investigate the logs for a representative sample of the failures to see if we can understand why they are continuing to fail

I spent some time with these logs and with Sentry and couldn't really figure out if there was a correlation between old requests or something else happening.

I think improving the error handling might be useful, I'm guessing that the concern with that is that we're missing a "real" bug?

clokep · 2020-08-31T13:54:39Z

Now that we have better logging I looked back over the last 7 days of this error occurring on the Mozilla instance:

7 were due to the SAML session being used outside the 15 minute timeout.
4 were due to a SAML session ID being re-used (see below).
1 was due to a server restart (which is more similar to Potential bug when using SAML and workers might result in "Unsolicited response" errors #7530 than to this issue, in my opinion).

Note that we remove the outstanding request once a response for it is received -- this seems correct, but I'm unsure if SAML allows for a single session to be completed multiple times (assuming that they are all within the proper timeout and such).

I'm not sure what, if anything, should be done to handle these cases? Maybe we can improve the error page to say something like "Your SAML session might have timed out or already been completed. Please try again." Or something to that effect?

richvdh · 2020-09-01T11:22:25Z

do we have any idea why people would re-use the SAML session ID?

Improving the error text seems sensible either way.

clokep · 2020-09-01T11:25:36Z

do we have any idea why people would re-use the SAML session ID?

My guess is that it is due to reloading a page? Or if e-mail verification is in the workflow it could be clicking on a link twice? I should note that the "re-used" SAML session IDs were within the 15 minute timeout period (and all from 2 users).

clokep · 2020-09-03T16:46:47Z

Since i couldn't remember the behavior the user saw here they currently just get an internal server error sent back to them (since it is part of the redirect flow the client isn't involved).

Steps to reproduce this sanely:

Configure your system for SAML.
Start a login with SSO (click "Sign in with single sign-on").
Restart Synapse (this forcefully clears all known SAML sessions).
Finish the login via the SSO flow.

You end up at a white that says "Internal server error".

When adding the OpenID code we added a template for handling some errors, I think we should re-use that here.

clokep · 2020-09-03T17:13:58Z

Ah this isn't entirely accurate -- we do have a saml_error.html page, but this seems to work by parsing the returned exceptions, which isn't ideal.

babolivier · 2020-09-03T19:09:15Z

Ah this isn't entirely accurate -- we do have a saml_error.html page, but this seems to work by parsing the returned exceptions, which isn't ideal.

Note that this is the only way we have to process and display errors from Auth0, which is why it works like that :/

babolivier mentioned this issue Mar 10, 2020

Auth0 passwordless auth fails if the linked clicked isn't from the first email sent #7057

Closed

neilisfragile added z-bug (Deprecated Label) mozilla labels Mar 16, 2020

clokep mentioned this issue May 22, 2020

Potential bug when using SAML and workers might result in "Unsolicited response" errors #7530

Open

clokep self-assigned this May 22, 2020

clokep mentioned this issue Jun 9, 2020

Increase the default SAML session expirary time to 15 minutes. #7664

Merged

clokep mentioned this issue Jul 28, 2020

Add additional logging for SAML sessions. #7971

Merged

clokep mentioned this issue Aug 7, 2020

Implement login blocking based on SAML attributes #8052

Merged

clokep mentioned this issue Sep 3, 2020

Improve SAML error messages #8248

Merged

clokep closed this as completed in #8248 Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to parse SAML2 response: Unsolicited response #7056

Unable to parse SAML2 response: Unsolicited response #7056

babolivier commented Mar 10, 2020

richvdh commented Mar 17, 2020

clokep commented Apr 23, 2020

jaywink commented Apr 23, 2020

clokep commented May 22, 2020 •

edited

Loading

richvdh commented May 23, 2020

clokep commented May 26, 2020

clokep commented Jun 1, 2020

richvdh commented Jun 1, 2020

richvdh commented Jun 2, 2020

clokep commented Jun 2, 2020

richvdh commented Jun 2, 2020

clokep commented Jun 2, 2020 •

edited

Loading

clokep commented Jun 9, 2020

clokep commented Jun 25, 2020

richvdh commented Jun 29, 2020

babolivier commented Jun 29, 2020

richvdh commented Jun 29, 2020

babolivier commented Jun 29, 2020

clokep commented Jul 7, 2020

clokep commented Aug 31, 2020

richvdh commented Sep 1, 2020

clokep commented Sep 1, 2020 •

edited

Loading

clokep commented Sep 3, 2020

clokep commented Sep 3, 2020

babolivier commented Sep 3, 2020

Unable to parse SAML2 response: Unsolicited response #7056

Unable to parse SAML2 response: Unsolicited response #7056

Comments

babolivier commented Mar 10, 2020

richvdh commented Mar 17, 2020

clokep commented Apr 23, 2020

jaywink commented Apr 23, 2020

clokep commented May 22, 2020 • edited Loading

Summary:

More details

richvdh commented May 23, 2020

clokep commented May 26, 2020

clokep commented Jun 1, 2020

richvdh commented Jun 1, 2020

richvdh commented Jun 2, 2020

clokep commented Jun 2, 2020

richvdh commented Jun 2, 2020

clokep commented Jun 2, 2020 • edited Loading

clokep commented Jun 9, 2020

clokep commented Jun 25, 2020

richvdh commented Jun 29, 2020

babolivier commented Jun 29, 2020

richvdh commented Jun 29, 2020

babolivier commented Jun 29, 2020

clokep commented Jul 7, 2020

clokep commented Aug 31, 2020

richvdh commented Sep 1, 2020

clokep commented Sep 1, 2020 • edited Loading

clokep commented Sep 3, 2020

clokep commented Sep 3, 2020

babolivier commented Sep 3, 2020

clokep commented May 22, 2020 •

edited

Loading

clokep commented Jun 2, 2020 •

edited

Loading

clokep commented Sep 1, 2020 •

edited

Loading