New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage with no TICK scripts #1950
Comments
Can you look at the kapacitor stats to see what it is upto? You can enable the stats in kapacitor config and then publish the stats to influx to monitor the same using:
|
We have determined what was causing this high kapacitor CPU usage - it was actually high on the both instances. Influxdb and kapacitor both run on each instance as mentioned above - the influxdb data persisted thru new instance creation in each availability zone however the /var/lib/kapacitor directory did not. When kapacitor was started on a new instance it would self-register to influxdb subscription for all databases and retention policies (5 subscriptions in our configuration). Since there is no self unregister, we accumulated over 70+ subscriptions in our influxdb. Whether kapacitor was processing or rejecting the other subscriptions is not known, but kapacitor was working much more than it needed to be. We stopped kapacitor, dropped all subscriptions from influxdb, re-located the /var/lib/kapacitor directory to persistent storage, re-configured and restarted kapacitor. Kapacitor self-registered 5 subscriptions only, we've rebooted and created a new instance and kapacitor has not registered additional subscriptions. Our load avg on the instances has dropped from 7-10 range to < 1 on both instances. Once we found the large number of subscriptions but still did not know this was the cause, this documentation was helpful: As well as finding this issue: |
Our monitoring stack consists of 2 availability zones each with an instance (
r4.large
) running kapacitor. Both run kapacitor 24x7, but only one is ever the primary with our TICK scripts enabled while the other is running with no tasks. We have automation that runs every 5 minutes to handle who is the primary and if the primary has failed or the instance was terminated, we add and enable the TICK scripts on the secondary.The issue we are experiencing is that the instance in one of our availability zones constantly has a higher CPU usage than our other instance, regardless of whether it is running the TICK scripts or not. We've isolated the issue to be kapacitor using an unacceptable amount of CPU, but cannot determine why. Debug logging does not output anything other than
lvl=debug msg="linking subscription for cluster" service=influxdb cluster=localhost cluster=localhost
every minute. We have also tried terminating and bringing up new instances in its place several times with no change.The 2 instances are mostly identical. The exact same AMI, same version of kapacitor (1.5.0), the same configuration files which are shared via EFS. The only difference we can tell is the availability zone they exist.
Below is a screenshot of our monitoring of the CPU usage on the 2 instances. The blue is our problematic machine and the green is the normal, primary machine. In this case, the green is running our TICK scrips while the blue has none enabled. The drop in usage for blue at around 15:31 is when we disable kapacitor on the secondary machine. Once this is disabled, you can see the usage drops to mirror what the primary is doing.
Please let me know if you need any more information. We have tried everything and are unable to determine what is causing this high kapacitor usage.
The text was updated successfully, but these errors were encountered: