Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leaks in websocket api #94780

Merged
merged 21 commits into from Jun 19, 2023
Merged

Fix memory leaks in websocket api #94780

merged 21 commits into from Jun 19, 2023

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Jun 17, 2023

Proposed change

While looking at home-assistant/frontend#16952 I noticed we suppress propagation of cancellation in the websocket api which can prevent the memory from being recovered on disconnect because it shields the aiohttp internals and the cancelation handler from running.

Reference cycles are now broken when the connection is finished to ensure nothing is holding on to the object so it can be garbage collected.

We now also clear the callable on the ActiveConnection to anything holding on to the connection object does not prevent the handler from being garbage collected.

The ActiveConnection will now handle the case where an unsub of a subscription fails and still unsub the rest of the subscriptions.

I also adds a lot more debug logging (guards with isEnabledFor to prevent a performance regression) and repr of the ActiveConnection and WebsocketHandler objects so any future leaks will be much easier to track down since they can now be dumped in a meaningful way with the profiler integration.

related issue home-assistant/frontend#16952 related issue #90281

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

Also adds a lot more debug logging (guards with isEnabledFor to prevent
a performance regression)

related issue #94726
@home-assistant
Copy link

Hey there @home-assistant/core, mind taking a look at this pull request as it has been labeled with an integration (websocket_api) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of websocket_api can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign websocket_api Removes the current integration label and assignees on the pull request, add the integration domain after the command.

@bdraco
Copy link
Member Author

bdraco commented Jun 17, 2023

Will need to adjust the test as it will be closed now instead of closing

@bdraco bdraco marked this pull request as ready for review June 17, 2023 19:58
@bdraco bdraco requested a review from a team as a code owner June 17, 2023 19:58
@bdraco bdraco marked this pull request as draft June 17, 2023 23:10
@bdraco
Copy link
Member Author

bdraco commented Jun 18, 2023

I will address the above comment and add a test for cancelation on write to make sure things are cleaned up

@ondras12345
Copy link
Contributor

I was doing some testing with this patch applied to see whether it fixes the memory leak. Since I don't know how to reliably trigger the frontend bug, my results have been inconclusive. However, I have encountered this error (not sure if it is related nor if it is expected):

2023-06-18 16:25:32.928 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [2623251696] Hass Term S3 mini from 192.168.1.63 (Mozilla/5.0 (Linux; Android 11; Pixel 2 Build/RP1A.201005.004.A1; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/114.0.5735.131 Mobile Safari/537.36): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":3955,"type":"result","success":true,"result":null}
2023-06-18 16:26:26.165 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 676, in ping
    await self._send_frame(message, WSMsgType.PING)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 653, in _send_frame
    self._write(header + message)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 663, in _write
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport
2023-06-18 16:26:28.165 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 676, in ping
    await self._send_frame(message, WSMsgType.PING)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 653, in _send_frame
    self._write(header + message)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 663, in _write
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

@bdraco
Copy link
Member Author

bdraco commented Jun 18, 2023

I was doing some testing with this patch applied to see whether it fixes the memory leak. Since I don't know how to reliably trigger the frontend bug, my results have been inconclusive. However, I have encountered this error (not sure if it is related nor if it is expected):

2023-06-18 16:25:32.928 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [2623251696] Hass Term S3 mini from 192.168.1.63 (Mozilla/5.0 (Linux; Android 11; Pixel 2 Build/RP1A.201005.004.A1; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/114.0.5735.131 Mobile Safari/537.36): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":3955,"type":"result","success":true,"result":null}
2023-06-18 16:26:26.165 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 676, in ping
    await self._send_frame(message, WSMsgType.PING)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 653, in _send_frame
    self._write(header + message)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 663, in _write
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport
2023-06-18 16:26:28.165 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 676, in ping
    await self._send_frame(message, WSMsgType.PING)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 653, in _send_frame
    self._write(header + message)
  File "/usr/local/lib/python3.11/site-packages/aiohttp/http_websocket.py", line 663, in _write
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

Thats probably a good thing as it means the cancellation is propagating now and the transport is closing. Unfortunately aiohttp is a big noisy there. I'll see if we can clean that up

@bdraco
Copy link
Member Author

bdraco commented Jun 18, 2023

@ondras12345 There was still a path were the websocket wasn't being cleaned up. I ended up with a chain of finally blocks to make sure it does always get to .close() when something fails

@bdraco
Copy link
Member Author

bdraco commented Jun 18, 2023

leak is gone for me after latest changes

I tested by calling WebSocketHandler with profiler.dump_log_objects

I now only have a few in memory instead of hundreds

2023-06-18 10:54:32.282 CRITICAL (SyncWorker_14) [homeassistant.components.profiler] WebSocketHandler object in memory: <homeassistant.components.websocket_api.http.WebSocketHandler object at 0x7f80a2aa1b90>
2023-06-18 10:54:32.282 CRITICAL (SyncWorker_14) [homeassistant.components.profiler] WebSocketHandler object in memory: <homeassistant.components.websocket_api.http.WebSocketHandler object at 0x7f808b67b6d0>
2023-06-18 10:54:32.283 CRITICAL (SyncWorker_14) [homeassistant.components.profiler] WebSocketHandler object in memory: <homeassistant.components.websocket_api.http.WebSocketHandler object at 0x7f80902f1f10>
2023-06-18 10:54:32.283 CRITICAL (SyncWorker_14) [homeassistant.components.profiler] WebSocketHandler object in memory: <homeassistant.components.websocket_api.http.WebSocketHandler object at 0x7f80a25a4e10>

@bdraco bdraco changed the title Propagate cancellation in websocket api Propagate cancellation in websocket api to fix memory leak Jun 18, 2023
@ondras12345
Copy link
Contributor

I am currently running 2023.6.2 with changes from this PR (f69c4ab). However, I haven't been able to trigger the frontend bug since I last restarted core. Annoyingly, it seems to be quite random and I haven't figured out how to reproduce it reliably.

@bdraco
Copy link
Member Author

bdraco commented Jun 18, 2023

Screenshot_2023-06-18_at_11_04_27_AM

Loading the frontend with this clicked (in chrome inspector) should trigger the issue

@ondras12345
Copy link
Contributor

I am able to get it to send one persistent_notification/subscribe message per second. Nowhere near as many as I am getting when it happens "naturally" (>1000 per second).
Neither WebSocketHandler (4) nor WSMessage appear to be growing.

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

I took it one step further and protected all the attributes on the handler to make sure nothing external is holding a reference to them

@ondras12345
Copy link
Contributor

Currently running version 7a9e023
I haven't been able to trigger the bug yet.

I do have the debug log from last time it happened (before applying the latest patches), but there's way too many lines (155MB). What messages in particular should I be looking for? I don't want to post the whole file as it probably contains access tokens and other sensitive info.

I am also experimenting with a simple python script that floods the API with persistent_notification/subscribe messages, but I haven't had much success so far.

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

I added a repr to the object so you can do a dump on the WebSocketHandler and see where they are coming from

2023-06-19 11:59:51.918 CRITICAL (SyncWorker_5) [homeassistant.components.profiler] WebSocketHandler object in memory: <WebSocketHandler closing=False authenticated=True description=J. Nick Koston from <snip> (Home Assistant/2023.3 (io.robbie.HomeAssistant; build:2023.459; iOS 16.5.0))>
2023-06-19 11:59:51.918 CRITICAL (SyncWorker_5) [homeassistant.components.profiler] WebSocketHandler object in memory: <WebSocketHandler closing=False authenticated=True description=J. Nick Koston from  <snip> (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36)>

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

I'm pretty sure the push to change to replace self.send_message with _connect_closed_error is going to solve the leak for you

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

@ondras12345 I'm pretty confident the last set of changes will finally solve this. If you get it to leak again after the latest changes, please post the output of

service: profiler.dump_log_objects
data:
  type: WebSocketHandler
Screenshot 2023-06-19 at 12 23 31 PM

@bdraco bdraco changed the title Propagate cancellation in websocket api to fix memory leak Fix memory leak in websocket api Jun 19, 2023
@bdraco bdraco changed the title Fix memory leak in websocket api Fix memory leaks in websocket api Jun 19, 2023
Co-authored-by: Martin Hjelmare <marhje52@gmail.com>
@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

Github actions seems to be a bit flakey today. Going to merge in dev to get another run just to be sure.

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

retest is good. no leaks

@bdraco
Copy link
Member Author

bdraco commented Jun 19, 2023

thanks for the review

@bdraco bdraco merged commit 1206f2c into dev Jun 19, 2023
50 of 51 checks passed
@bdraco bdraco deleted the websocket_api_let_cancel_raise branch June 19, 2023 23:27
@bdraco
Copy link
Member Author

bdraco commented Jun 20, 2023

I want to add some more coverage for disconnects after creating a subscription and cleanup the typing a bit more but I didn't want to make this any larger so I'll open a new pr to add the additional coverage and typing if the Wi-Fi works inflight tonight

1 similar comment
@bdraco
Copy link
Member Author

bdraco commented Jun 20, 2023

I want to add some more coverage for disconnects after creating a subscription and cleanup the typing a bit more but I didn't want to make this any larger so I'll open a new pr to add the additional coverage and typing if the Wi-Fi works inflight tonight

@bdraco
Copy link
Member Author

bdraco commented Jun 20, 2023

#94891

@github-actions github-actions bot locked and limited conversation to collaborators Jun 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants