-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoshBot seems to be running a job that uses high CPU #203
Comments
This looks like the PowerShell job that PoshBot spawns to connect to Slack's RTM API over websockets. This job should always be running. PoshBot will start it when it connects to Slack and continually reads the output stream from it. Looks like you're receiving this messages from Slack:
https://api.slack.com/events/goodbye We're not handling that message type in the Slack backend so it sounds like Slack is closing the connection but we don't have logic to handle that correctly. Since it isn't a known (to PoshBot) message type we're ignoring it. https://github.com/poshbotio/PoshBot/blob/master/PoshBot/Implementations/Slack/SlackBackend.ps1#L347 Instead, we should probably add some proactive reconnect logic. Thanks for the detailed issue @pauby. This helps a lot! |
Thanks @devblackops. I'm assuming there is no workaround for it just now? The only solution I've implemented is to turn off PoshBot and that is not sustainable 😄 |
Correct. No solution right now but I'll see how easy it will be to properly handly the |
I really need get time to dig into the code and help out! |
We seem to have experienced this a couple of times over the weekend since the 0.12.0 release if that helps at all. |
@devblackops The comment by @wozzo seems to indicate this is something that's been changed in 0.12.0. |
Hmmm, I'm now getting same message as well. (0.12.0, PS 5.1, Server 2012/R2 and 2016). Service restarts OK. I can supply more diags info if it helps (or should I create a new issue ticket?)
|
@pauby @tenwiseman About how often is the |
I think this is also the problem i am having: it started 3/4 and I was using poshbot 0.11.8 but it looked like we were getting rate limited by slack's api "too many requests", I upgraded poshbot to 0.12.0 - and now it looks like a bad token from slack (it might have been changed to troubleshoot last week, but didn't get updated for the next release. below are the logs (well part of it) today I did notice the ramp up of resource consumption: and multiple spin-off powershell tasks, mentioned by @devblackops
my psversion table
|
Hi Brandon, I'd only just changed the logging level to 'debug' for the last run, which stayed functional for 8hours. Previously it stayed functional for about an hour, before the bot stopped responding to commands, clearing it's 'presence' beacon in the client (presumably before it says 'goodbye'), and the socket connection disappears from netstat. I haven't see the excessive CPU load issues others have mentioned though, but my current (test) instance is just one slack user and a couple of bots on different machines. Thanks :) |
@devblackops Here is the log I have for a problem day:
The goodbye is received once and nothing then happens until I restart it the next day. |
I switched PoshBot back on last night as I was going to try and troubleshoot the problem more. Like @donfouts I started getting 429 Too Many Requests from Slack. I see PoshBot successfully connecting to the API, then disconnecting, then connecting over and over again. My intention today was to reinstall PoshBot 0.12.0 just in case something has gone wrong. I also wanted to downgrade to 0.11.8 to see if that fixes it too. I haven't done either of those yet but will do shortly. I don't want to pollute this issue with something unrelated so I mention it only as it has already been mentioned. Let me know if you want me to raise a separate issue for this. I also haven't posted logs for the same reason. |
Just to update my above comment - I've now reinstalled 0.12.0 and I'm not getting the 429 response from Slack or the constant connecting and disconnecting that I experienced before. It could be that Slack had an issue? |
I can repro the CPU issue with this snippet that just connects to the websocket and outputs json to the output stream. https://gist.github.com/devblackops/efef2a4e20e542d31a83b175a535ed17 Steps:
This is not reproducing the |
@devblackops I'm not sure this helps but I'll add it anyway. I went back to 0.11.8 of the module and get the constant connect / disconnect I mentioned above. I know that 0.11.8 didn't have TLS 1.2 (I think?) turned on so I'm thinking maybe it's this that is causing the connect / disconnect? There is nothing in the logs that mentions it, in fact it says it connects to the Slack API:
Hopefully this isn't a red herring and helps ... in some way 😄 Note that going back to 0.12.0 connects fine, we just have the high CPU issue. |
The If people can build from src, try that out, and let me know if it helps, I'd appreciate it! |
@devblackops I have deployed this now. Will let you know if it ends up with high CPU tomorrow. |
I've got an update but I'm not quite sure if it's good or bad. On Saturday (4 April) the server that PoshBot runs on started reporting high CPU again. I picked this up today and it appears it's a PowerShell PoshBot job. I had to kill it in Task Manager to get the server back to being responsive again so i didn't any troubleshooting done. When I did kill it I got this message:
So I know it's a PoshBot job. There is absolutely nothing in the logs that I can see so I'm a little confused. I'm happy to leave this as is and keep an eye on it again and this time get the time to troubleshoot it. It's just sods law this happened over a weekend. |
I have to report this again today. The server reported high CPU at 12:07 (BST). I've looked at the server now and Task Manager is showing the job is this: And the user tab is showing this: (the user in Orange is the same user that is running PoshBot). If I try to use PoshBot in Slack I get no response. When I kill the process in Task Manager I get this in the PoshBot console. Here is what appears in the logs:
And process in Task Manager appears again: If I try to use PoshBot now it works fine. This is exactly the behaviour that was experienced before the fix for Slack being disconnected. The difference now appears to be that nothing is being logged to indicate the problem. @devblackops Any thoughts? |
@pauby I'm seeing this in my environment as well now so it seems like new behavior Slack is rolling out where the 'goodbye' event is being sent every 24 hours or so. I've been working on changing the Slack backend to use a runspace pool to get better logging of what is going on with the connection, and making improvements to the reconnect logic. The good news is I see this in my environment now (before I was just basing it on what others have said). Hopefully I can get this fixed soon! |
@devblackops I've also been seeing this problem with PoshBot running as a service on PSVersion 5.1.14393.3471 on Windows 2016 server. I enabled debug mode for the logs and did see a 'goodbye' message in the logs the last time it happened. I've enabled the fix in 224b80d and will let you know how that goes. I was wondering if PoshBot could try a Ping and Pong message with Slack every few minutes and if a reply does not come back from slack in a certain amount of time, do a disconnect/reconnect. This might cover the cases where a 'goodbye' message is not seen in the logs. |
This should hopefully be fixed in this latest commit: 9016bc9. Can everyone pull latest, rebuild, and try it out? |
@devblackops I will try to pull latest and rebuild based on commit: 9016bc9 and let you know how it goes. My bot did fail after having applied the changes in commit: 224b80d and seeing a "goodbye" message in the logs. However, my Poshbot service and processes were still running. I've attached a log file (obfuscated details) if that helps. It shows messages when receiving the "goodbye" message and then shows there were no messages until after I killed the slackbackend powershell process. It then shows the bot reconnecting and supporting a !status command before I manually stopped the poshbot service. Hopefully I can get the latest build done today and get you some feed back in a day or two. |
I'm also experiencing this problem (or a variant of it) - certainly i see my bot disconnect from Slack at the same time as the receive job hangs and spikes the CPU, then if I kill that powershell process it will reconnect. This was daily until 14/4, but since then it is happening roughly every 100 seconds. I've built the latest commit and tried that, but I see exactly the same issue. The logs stop too, and only restart with a reconnect, but I see no [goodbye] messages in there - and wouldn't expect to receive one every 100 seconds anyway. I wondered if I could also be hitting rate limits, but I get no errors to suggest that. What I can say is this behaviour is consistent whether I'm using 0.12.0, 0.12.1 or even the script you made for the TLS issue at https://gist.github.com/devblackops/28afca76db2e1b5ed4fbd3eedf7c94cc
|
@TomWoodling This looks to be the same underlying issue but it's odd that you get disconnected every 100 seconds. The error you see above is coming from the Can you think of anything in your environment that would possibly be interfiewing with the connection? |
@devblackops I've been thinking about that and I might have the answer to the 100s part. My Powershell versions were 5.1.14393.3471 on server or 5.1.18362.628 on desktop - updated one to v7 and now it can hold the connection. 5 minutes and counting so far. |
@devblackops I haven't built the new version yet, but I am now getting disconnected about as often as @TomWoodling reports. I was able to be connected for at least 8 hours before today. Nothing has changed in my config for at least two days when I implemented the fix to see goodbye messages. |
@devblackops I've built against the latest commit and still have Poshbot failing after about 100 seconds. PSVersion 5.1.14393.3471 on Windows 2016 server. Nothing in the logs to indicate why. using resource monitor, I see the websocket connection close. |
@DunlapOverbelt I can confirm that running the bot in PSVersion 7.0.0 fixes that - as I've just been finding out it runs side-by-side with PS v5.1 so it's not a huge deal to install it. Then all you need to do is start the bot normally , but using pwsh.exe instead (or directly in the PS v7 shell). It also stops the 100sec disconnects for the 0.12.0 PoshBot release from PSGallery. |
@TomWoodling yea.. I need to get this up and running ASAP.. so I'm going to go ahead and install powershell 7 on my server that runs poshbot. There are a few things that appear to break with powershell7 though if using modules I've already downloaded. I'd still like to figure out why it just stopped working for Powershell 5.1.. it doesn't make sense that it worked for hours before and then just stopped working for more than a few seconds days later. @devblackops If I'm going to run with powershell 7, do I need to rebuild from source using powershell 7 to build Poshbot? |
@DunlapOverbelt Nope. You don’t need to rebuild to run in pwsh 7 if you’ve already done that. I’ll do some testing with PowerShell 5.1. 6 and then 7 have been my daily drivers for awhile. |
@devblackops @TomWoodling I've got Poshbot Running under powershell 7, however.. two of my main plugins won't work now.. I use Posh-SSH and OperationsManager modules in my plugins. |
OK, I see the same behavior in PowerShell 5.1. After exactly 100 seconds, the PowerShell process for the websocket job is a steady 6% CPU and has stopped working. Digging into it.
|
@pauby @DunlapOverbelt @TomWoodling @tenwiseman @wozzo Everyone, I'm hoping this is finally squashed in 468cfa9. Please rebuild from source and post back here if the issues goes away. This post pointed me in the right direction: https://stackoverflow.com/questions/40502921/net-websockets-forcibly-closed-despite-keep-alive-and-activity-on-the-connectio. It appears to be an issue in .NET Framework (not core). |
@devblackops wow - that is quite the obscure issue indeed, great find! The new commit is working well for me under PS 5.1 - 15 minutes and no drops so far on a test bot, so it's already a massive improvement. That being said, I'll stick to PS 7 for production from now on 😹 On that note, I've managed to log a few of the goodbye messages on a bot running the previous commit and each time it's handled the reconnect gracefully, no manual restarts in nearly 3 days of running now. |
@devblackops I've built PoshBot from 468cfa9 and have confirmed that Poshbot stays connected for longer than 100 seconds on PSVersion 5.1.14393.3471 on Windows 2016 server. I'll leave it running and let you know if I see a disconnect over night. At least I don't need to re-write my plugins that depend on PS 5.1 modules with the WindowsCompatibility module yet! And Dilbert works again! |
@devblackops confirming, things look good with PoshBot from commit 468cfa9 build. Logs also show reconnected after a goodbye message. Good to go! I did notice during the build that requirements.psd1 lists PSSlack 1.0.2 but the PoshBot.psd1 lists PSSlack 1.0.1 as required modules. not sure if that need updated before this commit roles to 0.12.1 version. Thanks for figuring this out! |
Awesome news. Thanks for testing! I'll fix the version for PSSlack. |
@devblackops I've actually just seen this so I've deployed 468cfa9 now. |
Excellent @pauby. Please let me know if it resolves your issue. |
Closing this out as |
I'm not actually sure what's happening this one so I'm raising the issue to see if I can get more info on how to troubleshoot it and confirm it's related to PoshBot.
I recently deployed the commit from this issue. Ever since then, at approximately the same time (within a couple of hours) a PowerShell session get's spawned that consumes a high amount of CPU. Our monitoring system then is alerting us that the CPU is being pegged at 100%. When I log in I see the following:
Some things to note:
Now this could just be any PowerShell session, however when I terminate that process I get this in PoshBot:
So it looks like a job running in PoshBot. However when I run
get-schedule
for PoshBot I get this:The two jobs that run every day both run without problems. So it's not them. The 30 day one isn't due to run yet.
There is little in the PoshBot logs, however I do notice this at approximately the time the alerts come through about high CPU:
And finally, in the event logs of the server I get this at 18:18:15:
and at 18:24:16:
I only add the event logs as I've seen spike when WMI misbehaves but I'm not sure if this is also a red herring.
The logs are at 'DEBUG' level and there is not much in there. So I'm looking to find out if:
The only step left to me was to stop PoshBot from running and to see if that session runs again.
(apologies I didn't complete the issue template as this was much more than that).
Your Environment
The text was updated successfully, but these errors were encountered: