Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API can return 401 status codes for valid sessions under load, forcing users to be logged out #7169

Closed
dianabarsan opened this issue Jun 30, 2021 · 7 comments
Assignees
Labels
Priority: 0 - Critical Work on this right away. Type: Bug Fix something that isn't working as intended
Milestone

Comments

@dianabarsan
Copy link
Member

dianabarsan commented Jun 30, 2021

Describe the bug
Requests that reach API and are forwarded (proxied) to CouchDB could result in 401 status codes even if the user has an active session.
Any 401 response status will trigger a logout in webapp based on: #3919 where the solution was to hook into all Pouch<->Couch fetch requests and navigate to login whenever a 401 is received. This was introduced in 3.6.0.

To Reproduce
No reproduction steps yet, this appears to have been a one-off (?) event on a production server (very high load).

Expected behavior
Given the logistical difficulties of users being logged out, we should make sure that the we mean it for sure before logging them out.

Logs
Logs are extensive, but I extracted some relevant logs for a single user. The logs are redacted to obfuscate the instance, the user's IP and the username:

Jun 25 23:50:37 medic-api-logs | [2021-06-25 20:50:37] REQ cc2757e2-6b52-4e0a-af94-95d942bb5772 <IP ADDRESS> - GET /medic-user-<username>-meta/ HTTP/1.1
Jun 25 23:50:37 medic-api-logs | [2021-06-25 20:50:36] REQ 301dd125-8398-44f6-bfaa-917f25e7d5ee <IP ADDRESS> - GET /dbinfo HTTP/1.1
Jun 25 23:50:57 medic-api-logs | [2021-06-25 20:50:57] REQ 4e8fe94b-f598-49a3-8252-66f322745c56 <IP ADDRESS> - GET /dbinfo HTTP/1.1
Jun 25 23:51:55 couchdb-logs | [notice] 2021-06-25T20:51:55.623056Z couchdb@127.0.0.1 <0.1597.1803> 343fc5c9e8 <APP URL> <IP ADDRESS> <username> GET / 200 ok 1
Jun 25 23:51:56 medic-api-logs | [2021-06-25 20:51:56] RES 301dd125-8398-44f6-bfaa-917f25e7d5ee <IP ADDRESS> - GET /dbinfo HTTP/1.1 200 208 79920.855 ms
Jun 25 23:51:57 medic-api-logs | [2021-06-25 20:51:57] REQ c5a29400-a38c-4605-a104-236bb0ac1b1b <IP ADDRESS> - GET /medic/_local/Cwv3Wa9HPkXBPyER3xZiVg%3D%3D? HTTP/1.1
Jun 25 23:52:01 couchdb-logs | [notice] 2021-06-25T20:52:01.010647Z couchdb@127.0.0.1 <0.21377.1800> dd01fac304 <APP URL> <IP ADDRESS> <username> GET / 200 ok 0
Jun 25 23:52:01 couchdb-logs | [notice] 2021-06-25T20:52:01.009304Z couchdb@127.0.0.1 <0.15778.1799> 622650d65a <APP URL> <IP ADDRESS> <username> GET /_session 200 ok 0
Jun 25 23:52:03 medic-api-logs | [2021-06-25 20:52:03] RES 4e8fe94b-f598-49a3-8252-66f322745c56 <IP ADDRESS> - GET /dbinfo HTTP/1.1 200 208 66158.890 ms
Jun 25 23:52:03 medic-api-logs | [2021-06-25 20:52:03] RES c900f4bd-5334-4182-b300-e2061bfe5aa6 <IP ADDRESS> - GET /service-worker.js HTTP/1.1 304 - 94004.041 ms
Jun 25 23:52:05 medic-api-logs | [2021-06-25 20:52:04] REQ 162eda25-f36e-47b1-9770-a177f1073088 <IP ADDRESS> - POST /medic/_revs_diff HTTP/1.1
Jun 25 23:52:22 couchdb-logs | [notice] 2021-06-25T20:52:22.285491Z couchdb@127.0.0.1 <0.8942.1806> c941111407 <APP URL> <IP ADDRESS> <username> GET /_session 200 ok 0
Jun 25 23:52:25 medic-api-logs | [2021-06-25 20:52:25] RES cc2757e2-6b52-4e0a-af94-95d942bb5772 <IP ADDRESS> - GET /medic-user-<username>-meta/ HTTP/1.1 401 35 108187.335 ms
Jun 25 23:52:30 medic-api-logs | [2021-06-25 20:52:30] RES 822a20c1-17ac-4c27-b170-d2c66fffa14e <IP ADDRESS> - GET /_session HTTP/1.1 - - - ms
Jun 25 23:52:44 couchdb-logs | [notice] 2021-06-25T20:52:44.790280Z couchdb@127.0.0.1 <0.26685.1799> 5e39ae352b <APP URL> <IP ADDRESS> <username> GET /_session 200 ok 0
Jun 25 23:52:58 couchdb-logs | [notice] 2021-06-25T20:52:58.707715Z couchdb@127.0.0.1 <0.17967.1779> c14d919236 <APP URL> <IP ADDRESS> <username> GET /_session 200 ok 0
Jun 25 23:52:58 couchdb-logs | [notice] 2021-06-25T20:52:58.705721Z couchdb@127.0.0.1 <0.18897.1800> 4a5b0dc0cd <APP URL> <IP ADDRESS> <username> GET /_session 200 ok 0
Jun 25 23:53:04 couchdb-logs | [notice] 2021-06-25T20:53:04.484114Z couchdb@127.0.0.1 <0.25195.1804> 758d93275b <APP URL> <IP ADDRESS> <username> GET /_session 200 ok 0
....
Jun 25 23:58:51 medic-api-logs | [2021-06-25 20:58:50] RES e5a2d04d-09c3-4536-a581-29952d6ad694 <IP ADDRESS> - GET /medic/_all_docs?key=%22_design%2Fmedic-client%22 HTTP/1.1 401 35 106513.139 ms

The app/CouchDB were under such load that the simplest /dbinfo request - which is just / for offline users and is directly forwarded to CouchDB here (https://github.com/medic/cht-core/blob/3.10.x/api/src/routing.js#L195) took 80 seconds to succeed.

Some ideas:

  1. API should never respond with a 401 if the user has an active session.
  2. When API receives a 401 from a proxied CouchDB request, checks should be made to make sure that the 401 is actually accurate and Couch really means it.
  3. Webapp should poll /_session when a 401 is received, before deciding to definitively log a user out (deleting the userCtx cookie). Alternatively, the login script could also poll session before doing anything. The session endpoint will also populate the userCtx cookie if a valid session is found.

Environment

  • Instance: production instance
  • App: API & Webapp
  • Version: 3.10.x

Additional context
Over a period of a day, numerous users from a production instance were logged out.

@dianabarsan dianabarsan added Type: Bug Fix something that isn't working as intended Priority: 0 - Critical Work on this right away. labels Jun 30, 2021
@dianabarsan dianabarsan self-assigned this Jun 30, 2021
@dianabarsan dianabarsan changed the title API or CouchDB can return 401 status code for valid sessions under load API or CouchDB can return 401 status code for valid sessions under load, forcing users to be logged out Jun 30, 2021
@dianabarsan
Copy link
Member Author

Around the time of this happening, I'm seeing lots of CouchDB fails in logs, for example:

Jun 25 23:50:21 medic-api-logs | [2021-06-25 20:50:21]   [stack]: 'FetchError: request to http://localhost:5984/medic/_all_docs? failed, reason: socket hang up\n    at ClientRequest.<anonymous> (/srv/software/medic-api/md5-Gm9Gq31XLCA8Xe5FAF9g5Q==/node_modules/node-fetch/lib/index.js:1430:11)\n    at emitOne (events.js:116:13)\n    at ClientRequest.emit (events.js:211:7)\n    at Socket.socketOnEnd (_http_client.js:423:9)\n    at emitNone (events.js:111:20)\n    at Socket.emit (events.js:208:7)\n    at endReadableNT (_stream_readable.js:1064:12)\n    at _combinedTickCallback (internal/process/next_tick.js:138:11)\n    at process._tickCallback (internal/process/next_tick.js:180:9)' } 
Jun 25 23:50:11 medic-api-logs:  | [2021-06-25 20:50:10] 2021-06-25 20:50:10 ERROR: Server error: { FetchError: request to http://localhost:5984/_users/org.couchdb.user%3Amariamcamara240? failed, reason: socket hang up
Jun 25 23:50:01 medic-api-logs:  | [2021-06-25 20:50:01] 2021-06-25 20:50:01 ERROR: Server error: { FetchError: invalid json response body at http://localhost:5984/medic-purged-role-6d4f30bb2a8cc7b41fb0c6406f7f0b1d/_changes?seq_interval=25914&filter=_doc_ids&since=0&limit=25915 reason: Unexpected token < in JSON at position 0
Jun 25 23:49:46 medic-api-logs:  | [2021-06-25 20:49:46] 2021-06-25 20:49:46 ERROR: Server error: { FetchError: invalid json response body at http://localhost:5984/medic-purged-role-6d4f30bb2a8cc7b41fb0c6406f7f0b1d/_changes?seq_interval=22096&filter=_doc_ids&since=0&limit=22097 reason: Unexpected token < in JSON at position 0
Jun 25 23:41:32 medic-api-logs:  | [2021-06-25 20:41:31] 2021-06-25 20:41:31 INFO: invalid json response body at http://localhost:5984/medic/_changes?style=all_docs&filter=_doc_ids&since=42238358-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgCmlVygWLsxilpJolpZtj04DEpSQFIJtnDDZt_CWJYmkGKqUESqYY5gAyLhxvWtxNiWLKhuaFRIqmGJYAMq4cbVn8VYpiZYZppmhGJhuWxAEmGBiAFNG8-xMCih2ADk4yNUw1MDcgycAHEwP0QA2eKQww0NzUySDYmy8ADEAPvQwxsjQAbaGFqmJZqZk6WgQ8gBkLDcN8hSBgaGaQlJ5pi05oFAEWhqjQ&limit=100 reason: Unexpected token < in JSON at position 0 FetchError: invalid json response body at http://localhost:5984/medic/_changes?style=all_docs&filter=_doc_ids&since=42238358-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgCmlVygWLsxilpJolpZtj04DEpSQFIJtnDDZt_CWJYmkGKqUESqYY5gAyLhxvWtxNiWLKhuaFRIqmGJYAMq4cbVn8VYpiZYZppmhGJhuWxAEmGBiAFNG8-xMCih2ADk4yNUw1MDcgycAHEwP0QA2eKQww0NzUySDYmy8ADEAPvQwxsjQAbaGFqmJZqZk6WgQ8gBkLDcN8hSBgaGaQlJ5pi05oFAEWhqjQ&limit=100 reason: Unexpected token < in JSON at position 0
Jun 25 23:43:33 medic-api-logs:  | [2021-06-25 20:43:33] 2021-06-25 20:43:33 ERROR: Server error: { FetchError: invalid json response body at http://localhost:5984/medic-purged-role-6d4f30bb2a8cc7b41fb0c6406f7f0b1d/_changes?seq_interval=27251&filter=_doc_ids&since=14980551-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxhkOsxygWLsKcZmSebmidj04DEpSQFIJtnDDduwD2yYcZpZYppxGqmGOYAMi4cbVi0INszcwMzAJM2IVMMSQIbVww279gpsmKmluUGieQqJhuWxAEmGBiAFNG8-xEC9x2AD08yMUw0NSHUdxMAFEAP3Qww8Uw820MI8OcnMxIQsAw9ADLwPMbBLFeLlJDNzs1TyDHwAMRAahq2BkNg1SEwyTzHDpjULAKTDp9E&limit=100 reason: Unexpected token < in JSON at position 0


Jun 25 23:41:37 couchdb-logs:  | [error] 2021-06-25T20:41:37.049542Z couchdb@127.0.0.1 <0.32084.1793> 7de981f4bd rexi_server: from: couchdb@127.0.0.1(<0.2673.1791>) mfa: fabric_rpc:changes/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,534}]},{couch_changes,'-send_lookup_changes/6-fun-3-',3,[{file,"src/couch_changes.erl"},{line,621}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_changes,send_lookup_changes,6,[{file,"src/couch_changes.erl"},{line,620}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,79}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
Jun 25 23:41:38 couchdb-logs:  | [error] 2021-06-25T20:41:37.965485Z couchdb@127.0.0.1 <0.3014.1791> 7de981f4bd rexi_server: from: couchdb@127.0.0.1(<0.2673.1791>) mfa: fabric_rpc:changes/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,534}]},{couch_changes,'-send_lookup_changes/6-fun-3-',3,[{file,"src/couch_changes.erl"},{line,621}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_changes,send_lookup_changes,6,[{file,"src/couch_changes.erl"},{line,620}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,79}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
Jun 25 23:41:42 couchdb-logs:  | [error] 2021-06-25T20:41:42.413306Z couchdb@127.0.0.1 <0.29521.1791> 7de981f4bd rexi_server: from: couchdb@127.0.0.1(<0.2673.1791>) mfa: fabric_rpc:changes/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,534}]},{couch_changes,'-send_lookup_changes/6-fun-3-',3,[{file,"src/couch_changes.erl"},{line,621}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_changes,send_lookup_changes,6,[{file,"src/couch_changes.erl"},{line,620}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,79}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
Jun 25 23:41:57 couchdb-logs:  | [error] 2021-06-25T20:41:57.289498Z couchdb@127.0.0.1 <0.20893.1791> 1c039bbcd6 rexi_server: from: couchdb@127.0.0.1(<0.32542.1787>) mfa: fabric_rpc:changes/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,534}]},{couch_changes,'-send_lookup_changes/6-fun-3-',3,[{file,"src/couch_changes.erl"},{line,621}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{couch_changes,send_lookup_changes,6,[{file,"src/couch_changes.erl"},{line,620}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,79}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]

@garethbowen
Copy link
Member

The problem may be in the authentication code where we respond with a 401 no matter what CouchDB responded, so even if it timed out, or CouchDB was unavailable, or pretty much anything we will log the users out. My vote would be your first idea, to only respond with a 401 if we know for sure that the user is logged out.

@dianabarsan
Copy link
Member Author

How far back should we backport the fix?

@garethbowen garethbowen modified the milestone: 3.12.0 Jun 30, 2021
@garethbowen
Copy link
Member

We support versions 3.10.x and above, so definitely that far. That also solves the issue for this particular production instance.

I don't think we need to backport to 3.9.x at this stage.

@ngaruko ngaruko added this to the 3.12.0 milestone Jul 1, 2021
@latin-panda latin-panda added this to Ready for dev in Dev team workstream via automation Jul 2, 2021
@latin-panda latin-panda moved this from Ready for dev to Dev in progress in Dev team workstream Jul 2, 2021
@dianabarsan
Copy link
Member Author

This is ready for AT on 7169-stop-logouts-on-random-401s

The changes that are implemented are:

  1. CouchDB needs to respond with an explicit 401 to count as a logout. If _session times out or throws another error, it should not log the users out.
  2. App endpoints (excepting login) send an additional header logout-authorization with the value CHT-Core API with every 401. This is to prevent man-in-the-middle from sending 401s that log users out.

So, online and offline users should continue to work as before, authentication should work as before.

I still haven't succeeded to replicate overloading CouchDB in such a way that _session requests timeout - every time I upped the load high enough, I crashed my local CouchDB.

@dianabarsan dianabarsan moved this from Dev in progress to Ready for AT in Dev team workstream Jul 2, 2021
@ngaruko ngaruko self-assigned this Jul 5, 2021
@ngaruko ngaruko moved this from Ready for AT to AT in progress in Dev team workstream Jul 5, 2021
@ngaruko
Copy link
Contributor

ngaruko commented Jul 6, 2021

Tested on local, basically checking if the following still work

  1. login and logout
  2. session expiry gives warning to user
    image
  3. confirmed that logout-authorization header is being sent with a 401 response
    image

Feel free to merge @dianabarsan

@ngaruko ngaruko moved this from AT in progress to Ready to merge in Dev team workstream Jul 6, 2021
dianabarsan added a commit that referenced this issue Jul 6, 2021
Updates API and Webapp to:

- Only consider 401s from CouchDB to be valid missing authentication
- Throw other _session errors without changes
- Malformed _session responses result in a 500
- Don't send 401s for missing userCtx, instead send the actual authentication error or a 500.
- Add a custom header to 401 responses that Webapp checks for, and only when matched it would log users out. (prevents man-in-the-middle from logging users out with 401s)

#7169
dianabarsan added a commit that referenced this issue Jul 6, 2021
Updates API and Webapp to:

- Only consider 401s from CouchDB to be valid missing authentication
- Throw other _session errors without changes
- Malformed _session responses result in a 500
- Don't send 401s for missing userCtx, instead send the actual authentication error or a 500.
- Add a custom header to 401 responses that Webapp checks for, and only when matched it would log users out. (prevents man-in-the-middle from logging users out with 401s)

#7169

(cherry picked from commit 2486dc5)
dianabarsan added a commit that referenced this issue Jul 6, 2021
Updates API and Webapp to:

- Only consider 401s from CouchDB to be valid missing authentication
- Throw other _session errors without changes
- Malformed _session responses result in a 500
- Don't send 401s for missing userCtx, instead send the actual authentication error or a 500.
- Add a custom header to 401 responses that Webapp checks for, and only when matched it would log users out. (prevents man-in-the-middle from logging users out with 401s)

#7169

(cherry picked from commit 2486dc5)
dianabarsan added a commit that referenced this issue Jul 6, 2021
Updates API and Webapp to:

- Only consider 401s from CouchDB to be valid missing authentication
- Throw other _session errors without changes
- Malformed _session responses result in a 500
- Don't send 401s for missing userCtx, instead send the actual authentication error or a 500.
- Add a custom header to 401 responses that Webapp checks for, and only when matched it would log users out. (prevents man-in-the-middle from logging users out with 401s)

#7169

(cherry picked from commit 2486dc5)
@dianabarsan dianabarsan modified the milestones: 3.12.0, 3.11.1 Jul 6, 2021
dianabarsan added a commit that referenced this issue Jul 6, 2021
* Prevent users being logged out incorrectly (#7171)

Updates API and Webapp to:

- Only consider 401s from CouchDB to be valid missing authentication
- Throw other _session errors without changes
- Malformed _session responses result in a 500
- Don't send 401s for missing userCtx, instead send the actual authentication error or a 500.
- Add a custom header to 401 responses that Webapp checks for, and only when matched it would log users out. (prevents man-in-the-middle from logging users out with 401s)

#7169

(cherry picked from commit 2486dc5)

* Adds GitHub actions workflow to take over CI.
@dianabarsan
Copy link
Member Author

Merged to master and backported to 3.12.x, 3.11.x and 3.10.x.

Dev team workstream automation moved this from Ready to merge to Done Jul 6, 2021
@dianabarsan dianabarsan modified the milestones: 3.12.0, 3.10.4 Jul 6, 2021
@dianabarsan dianabarsan changed the title API or CouchDB can return 401 status code for valid sessions under load, forcing users to be logged out API can return 401 status codes for valid sessions under load, forcing users to be logged out Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: 0 - Critical Work on this right away. Type: Bug Fix something that isn't working as intended
Projects
No open projects
Development

No branches or pull requests

3 participants