Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runaway thread: select() continually waking up #217

Closed
ut3 opened this issue May 2, 2017 · 8 comments

Comments

Projects
None yet
2 participants
@ut3
Copy link

commented May 2, 2017

A select() call introduced here is going awry. After a few hours of uptime, that select() wakes up continually at a rate much higher than its 1 second timeout, effectively bogging down my system with a runaway thread.

I can reproduce this in both:

  • A local build of git 0fa827e
  • An install of downloaded wok.noarch.deb where that file has md5sum 95fc5cc7887ff78e7cd040df1bdad373

I arrived at this specific select() call by attaching GDB to my python interpreter while my CPU usage was spiked. Sadly, I did not save the callstack I saw, but I will endeavor to recreate it upon request.

To help with debugging, I made the python change shown in attached file debug-output.diff.txt. That change yields the output shown in wok-error.log.txt

Host info:

$ uname -a && grep Ubuntu /etc/issue && python -V && ldd --version | grep ldd
Linux galactica 4.10.0-20-generic #22~16.04.1-Ubuntu SMP Thu Apr 20 17:43:29 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.2 LTS \n \l
Python 2.7.12
ldd (Ubuntu GLIBC 2.23-0ubuntu7) 2.23

I have not yet tried to correlate this runaway thread with UI interactions or timeouts. I have not yet tried to reproduce this problem using local builds of prior software revisions.

@ut3 ut3 changed the title Runaway thread: select() not clearing ready FDs Runaway thread: select() continually waking up May 2, 2017

@danielhb

This comment has been minimized.

Copy link
Contributor

commented May 2, 2017

@ut3 thanks for this report. Yeah, this select() is being run once per second, but it shouldn't be causing any out of memory problems or leaks. I believe there is some cleanup that isn't being done properly and, running it after several hours, it adds up.

I'll take a look into it when I have the change. Any gdb traces or logs related to this issue is appreciated.

@ut3

This comment has been minimized.

Copy link
Author

commented May 2, 2017

I am attempting to recapture the GDB stacktrace. This issue does not appear to impact memory utilization--ie I do not believe a leak is present. Rather, upon manifestation, select() wakes up for reason(s) other than a timeout with much higher frequency than once per second. In my attached patch "debug-output.diff.txt" I changed that 1 second timeout to 10 seconds solely to make logging output clearer.

@ut3

This comment has been minimized.

Copy link
Author

commented May 2, 2017

Output of top while CPU is pegged:
bin-top

Some information from gdb, strace, and psef:
wok-gdb-strace-psef.txt

I have made a positive correlation between this issue's manifestation and closing my desktop web browser's wok tab. That is, the select() call begins waking up at high rate as soon as I close my wok tab.

And, to reiterate, in the following log notice how quickly the select() call is waking up beginning at 14:37:19

2017-05-02 14:36:06: select woke with 0 ready FDs
2017-05-02 14:36:16: select woke with 1 ready FDs
2017-05-02 14:36:26: select woke with 0 ready FDs
2017-05-02 14:36:36: select woke with 0 ready FDs
2017-05-02 14:36:46: select woke with 1 ready FDs
2017-05-02 14:36:56: select woke with 0 ready FDs
2017-05-02 14:37:06: select woke with 0 ready FDs
2017-05-02 14:37:16: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs
2017-05-02 14:37:19: select woke with 1 ready FDs

@danielhb

This comment has been minimized.

Copy link
Contributor

commented May 2, 2017

"I have made a positive correlation between this issue's manifestation and closing my desktop web browser's wok tab."

This is a very good lead. I am on it right now.

@danielhb

This comment has been minimized.

Copy link
Contributor

commented May 2, 2017

@ut3 I have a possible solution. Do you mind testing in your environment and see if the CPU behaves nicer?


diff --git a/src/wok/pushserver.py b/src/wok/pushserver.py
index 6cad2cf..0467fd5 100644
--- a/src/wok/pushserver.py
+++ b/src/wok/pushserver.py
@@ -113,17 +113,21 @@ class PushServer(object):
                         break
 
                     if sock == self.server_socket:
-
                         new_socket, addr = self.server_socket.accept()
                         self.connections.append(new_socket)
                     else:
                         try:
                             data = sock.recv(4096)
+                            if not data:
+                                self.connections.remove(sock)
+                                sock.close()
                         except:
                             try:
                                 self.connections.remove(sock)
                             except ValueError:
                                 pass
+                            finally:
+                                sock.close()
 
                             continue
                         if data and data == 'CLOSE':
@ut3

This comment has been minimized.

Copy link
Author

commented May 3, 2017

I will try now and report back.

Edit: Testing appears successful. I will report again in a few hours.

@ut3

This comment has been minimized.

Copy link
Author

commented May 3, 2017

@danielhb : Your patch has been in place for a few hours, during which multiple admins have interacted with wok and the problem has not manifested. I believe your patch solves the problem. Thank you for fixing this so promptly.

@danielhb

This comment has been minimized.

Copy link
Contributor

commented May 3, 2017

@alinefm alinefm closed this in 9769e0e May 17, 2017

patchew-importer pushed a commit to patchew-project/kimchi-wok that referenced this issue May 25, 2018

Bug fix kimchi-project#217: Fixing select() waking up
There was a condition in the close of the Websocket that wasn't
being covered properly by the pushserver prior to this patch. The
fix was to put an additional condition to close our remote socket
reference when we receive no data from it, instead of relying just
in the Except call to do it.

While we're at it, a cleanup was made in the code to remove a
condition of a 'CLOSE' message that was being sent in the first
versions of the Pushserver but didn't make upstream.

Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.