New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New python service #8455

Merged
merged 26 commits into from Jun 30, 2018

Conversation

Projects
None yet
8 participants
@murrant
Member

murrant commented Mar 22, 2018

Currently has a file handle leak (and will eventually run out of handles) related to the self update process.

Either need to fix that or rip out self-update and leave that up to cron or something.

DO NOT DELETE THIS TEXT

Please note

Please read this information carefully. You can run ./scripts/pre-commit.php to check your code before submitting.

Testers

If you would like to test this pull request then please run: ./scripts/github-apply <pr_id>, i.e ./scripts/github-apply 5926

@murrant murrant referenced this pull request Mar 22, 2018

Closed

[RFC] fix: poller-service locking does not work #8450

1 of 1 task complete
@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Mar 22, 2018

I got this running under python3.6; no changes really needed, it merged pretty cleanly. I'll add notes to the relevant lines, and keep an eye on the FD state.

@@ -123,7 +123,8 @@
}
global $device;
foreach (dbFetch("SELECT * FROM `devices` WHERE status = 1 AND disabled = 0 $where ORDER BY device_id DESC", $sqlparams) as $device) {
$devices = dbFetch("SELECT * FROM `devices` WHERE status = 1 AND disabled = 0 $where ORDER BY device_id DESC", $sqlparams);

This comment has been minimized.

@TheMysteriousX

TheMysteriousX Mar 22, 2018

Contributor

Query has changed slightly, but thats all.

require __DIR__ . '/includes/init.php';
try {

This comment has been minimized.

@TheMysteriousX

TheMysteriousX Mar 22, 2018

Contributor

Removed this section entirely mergedb was unable to be imported. Does it do anything?

def connect(self):
try:
import MySQLdb

This comment has been minimized.

@TheMysteriousX

TheMysteriousX Mar 22, 2018

Contributor

Replaced with this to avoid having to install the MySQL developer kit and a C compiler:

        try:
            import pymysql
            MySQLdb = pymysql
        except ImportError:
            try:
                import MySQLdb

            except ImportError:
                print("ERROR: missing the mysql python module:")
                print("Either through your os software repository or 'pip install mysqlclient'")
                raise
info("Discovering device {}".format(device_id))
self.call_script('discovery.php', ('-h', device_id))
info('Discovery complete {}'.format(device_id))
except subprocess.CalledProcessError as e:

This comment has been minimized.

@TheMysteriousX

TheMysteriousX Mar 22, 2018

Contributor

~~~This doesn't appear to be firing - ran out of time to pin down the root cause, but it seems that discovery.php is returning 0 when failing to discover under some circumstance:~~~

Root cause found.

@@ -151,3 +152,8 @@
}
logfile($string);
if (count($devices) == 0) {

This comment has been minimized.

@TheMysteriousX

TheMysteriousX Mar 22, 2018

Contributor

if ($devices_discovered == 0) {

@murrant

This comment has been minimized.

Member

murrant commented Mar 22, 2018

I'll see if I can update this so it merges later.

@murrant

This comment has been minimized.

Member

murrant commented Mar 23, 2018

@TheMysteriousX If you want you can do a pull request against my branch.

git remote add murrant https://github.com/murrant/librenms.git
git fetch murrant
git checkout -b new-python-service murrant/new-python-service
git push -u origin new-python-service

All set up now to make changes. After changes, commit and push.

Then go on github and go to your branch and click new pull request.
For the base select murrant/librenms and new-python-service and click Create Pull request.
I'll review your changes quick and merge them in.

If changes have been made in my branch to pull them in:

git fetch murrant
git checkout new-python-service (if not already on it)
git rebase murrant/new-python-service

After a rebase you have to --force to push to your branch

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Mar 23, 2018

Will do.

Preliminary results under python3.4+ are promising:
screenshot 2018-03-23 17 48 58

This is with maintenance scheduled every 2 minutes.

So I think the best option for the auto update is to detect any version of python lower than that to disable the maintenance queue, and document the limitation.

I'll leave it running over the weekend to make sure that the descriptor leak is really absent on 3.4+, and not just slow.

@laf

This comment has been minimized.

Member

laf commented Apr 3, 2018

Is this ready for testing (i,e up to date with any pull requests into Tony's repo).

Any docs for it?

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 3, 2018

There's pull a from me waiting for @murrant to merge, then I think it's good to go for some testing.

Under python<3.4 the automatic restart is disabled, ~~~so if there is a critical schema update it'll break like it does with the current poller service.~~~

Rethinking this, what is the actual impact of the service not restarting? The poller processes don't persist beyond the lifetime of each invocation, so the only impact is that if there's a change to the poller service itself, it doesn't get picked up? My gut feeling is that any such change would be pretty substantial, and would probably require a notification to be posted first, like was done with the PHP minimum version change.

Docs are included, and have a note on how to get python3.4 working on RHEL/CentOS 7 quickly.

Not sure how easy it is on debian/arch, but it would be worth someone investigating if they have a system handy.

@laf

This comment has been minimized.

Member

laf commented Apr 4, 2018

Rethinking this, what is the actual impact of the service not restarting? The poller processes don't persist beyond the lifetime of each invocation, so the only impact is that if there's a change to the poller service itself, it doesn't get picked up? My gut feeling is that any such change would be pretty substantial, and would probably require a notification to be posted first, like was done with the PHP minimum version change.

The wrappers we have change very little but they have and do change so requiring people to restart the service manually could be a pain. Would be easier to just say this doesn't work with python < 3.4?

Once the PR is merged let me know. I can test this over a 4 node distributed setup.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 4, 2018

If this were a new thing, then yeah, I'd say stick it with a minimum version. The old poller service has the same restart limitation though, and has that fun locking bug.

If someone deploys it onto python <3.4, they're no worse off, and may even be better off if they're hitting the bug.

@laf

This comment has been minimized.

Member

laf commented Apr 4, 2018

Doesn't mean we should follow what's been done before. These things just generate questions especially when it does break so we should avoid that.

No harm in saying the minimum version is 3.4. We're already pushing people to ditch older versions of PHP.

@laf

This comment has been minimized.

Member

laf commented Apr 9, 2018

Is this ready for testing now?

@adam-bishop

This comment has been minimized.

adam-bishop commented Apr 9, 2018

Yup, the bits are finished (outside of any bugs being found); the only change outstanding is just to change the documentation to require rather than recommend python3.4+, but that doesn't impact testing.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 9, 2018

(Above comment was from me; used the wrong account again).

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

Was just about to test this on distributed polling but I have a question.

Does daily.sh still need to run via cron?

How do you disable things like service and billing from running as I only want those to run on one poller (well billing at least).

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 10, 2018

Daily.sh does not need to run by cron - there's a maintenance queue that handles that.

Not sure I follow your second question - do you mean you have a specific poller you want those to run on?

Otherwise for billing, the poller service should select a single poller when it's time to run the billing.

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

Not sure I follow your second question - do you mean you have a specific poller you want those to run on?

Otherwise for billing, the poller service should select a single poller when it's time to run the billing.

Yeah at present I run billing on one of the poller nodes so it only runs once.

What I want to make sure is that it continues to do so. I.e I've 4 pollers, billing should only run once otherwise it will generate 4 x the amount of data it should. Same for alerts.php.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 10, 2018

It should do that - only one poller does the work assignment, and once per billing/alerting cycle it will generate a single billing/alerting job to be run on a single poller.

Jobs cannot be worked my more than one poller.

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

Let's see then :) Btw are you on discord / irc?

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

Ok seems to be working fine.

image

Looks like all pollers have dropped load by about 1.

CPU hasn't changed but I think this might be because it looks like the devices are spread out over the 5 minute window where as before they are polled as fast as possible.

The /pollers/tab=pollers/ page no longer works which is the insight into how the pollers have done. It's not a major show stopper but it is a big loss into if the pollers have all completed recently so it would be good to get that back working.

I don't seem to be getting any logs into syslog though.

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

Actually I'm not sure this is honoring the poller groups.

I've got devices pinned to specific pollers which are being polled by the wrong ones. I'm going to turn this off for now.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 10, 2018

Let's see then :) Btw are you on discord / irc?

No, but I can be. I'll add it to my autojoins later (username:adamb).

Should be easy enough to fix the poller stats, I'll take a look at what the existing distributed poller does.

Logging is weird, what OS are you running under? If you check journaltcl (journalctl -xe) do you see anything there?

I'm surprised by the distributed polling groups not functioning - I only have a 2 system setup, but it did appear to be coordinating properly. Can you share your distributed poller config?

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

I'm surprised by the distributed polling groups not functioning - I only have a 2 system setup, but it did appear to be coordinating properly. Can you share your distributed poller config?

Poller 1

$config['distributed_poller_name']                       = 'poller-1';
$config['distributed_poller_group']                      = '0';
$config['distributed_poller_memcached_host']    = "192.168.1.1";
$config['distributed_poller_memcached_port']    = 11211;
$config['distributed_poller_host']    = "192.168.1.1";
$config['distributed_poller_port']    = 11211;
$config['distributed_poller'] = TRUE;
$config['service_poller_workers']              = 24;     # Processes spawned for polling
$config['service_services_workers']            = 1;      # Processes spawned for service polling
$config['service_discovery_workers']           = 2;     # Processes spawned for discovery

$config['redis_host']                          = '192.168.1.1';  # IP or hostname of your redis (or redis sentinel) instance
$config['redis_db']                            = 0;            # Database number (see requirements)
$config['redis_pass']                          = '<random>';         # Redis auth password
$config['redis_port']                          = 6379;         # Port listening
$config['redis_socket']                        = null;         # UNIX domain socket path (conflicts with host and port options)

Poller 2 is the same

Poller 3

$config['distributed_poller_name']                       = 'poller-3';
$config['distributed_poller_group']                      = '2,3';
$config['distributed_poller_memcached_host']    = "192.168.1.1";
$config['distributed_poller_memcached_port']    = 11211;
$config['distributed_poller_host']    = "192.168.1.1";
$config['distributed_poller_port']    = 11211;
$config['distributed_poller'] = TRUE;
$config['service_poller_workers']              = 16;     # Processes spawned for polling
$config['service_services_workers']            = 0;      # Processes spawned for service polling
$config['service_discovery_workers']           = 2;     # Processes spawned for discovery

$config['redis_host']                          = '192.168.1.1';  # IP or hostname of your redis (or redis sentinel) instance
$config['redis_db']                            = 0;            # Database number (see requirements)
$config['redis_pass']                          = '<random>';         # Redis auth password
$config['redis_port']                          = 6379;         # Port listening
$config['redis_socket']                        = null;         # UNIX domain socket path (conflicts with host and port options)

Poller 4 is the same.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Apr 10, 2018

I can see a case where the queues don't get created properly - '2,3' gets created as a queue literally called '2,3' rather than parsed and split into two queues. That should be a quick fix. I'd expect that to result in devices not getting polled at all though.

I can't find a path where work would get mis-assigned though - I'll set up a couple of extra nodes to match your config and see if I can replicate.

@laf

This comment has been minimized.

Member

laf commented Apr 10, 2018

I can see a case where the queues don't get created properly - '2,3' gets created as a queue literally called '2,3' rather than parsed and split into two queues. That should be a quick fix. I'd expect that to result in devices not getting polled at all though.

I can't find a path where work would get mis-assigned though - I'll set up a couple of extra nodes to match your config and see if I can replicate.

I don't mind retesting to save the hassle of you spinning up more nodes.

The idea with our setup is we have about 4k devices pinned to poller-1 and poller-2. Then the rest on poller-3 and poller-4.

However with this, I was seeing poller-3 and poller-4 trying devices pinned to 1 and 2.

@murrant

This comment has been minimized.

Member

murrant commented Jun 26, 2018

@laf it takes up to 5 minutes to update the stats/poller table.

@laf

This comment has been minimized.

Member

laf commented Jun 26, 2018

It's been a lot longer than that :)

@murrant

This comment has been minimized.

Member

murrant commented Jun 26, 2018

Hmm, maybe the SQL is failing.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Jun 26, 2018

What's being put into the db - is node_id still none?

@laf

This comment has been minimized.

Member

laf commented Jun 26, 2018

Yup still None.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Jun 26, 2018

Could you grep your log for "Could not import .env" please, make sure it's cleanly importing the node_id from disk.

@laf

This comment has been minimized.

Member

laf commented Jun 26, 2018

I'm actually just running this via screen right now so don't have that info :(

@murrant

This comment has been minimized.

Member

murrant commented Jun 27, 2018

Failing to parse .env should probably be fatal.

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Jun 28, 2018

The code looks to be fine. I could find 4 ways to break it:

  • Corrupting .env (manually)
  • Installing dotenv instead of python-dotenv, or not having python-dotenv installed at all
  • Having .env inaccessible by the 'librenms' user account due to *nix permissions or selinux
  • Having duplicate rows in the SQL tables (because the SQL queries are inconsistently switching between poller_name and node_id).

Any of those sound possible @laf?

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Jun 28, 2018

murrant#18 makes the SQL queries consistently use node_id, and makes failure to load the node_id
fatal.

I wouldn't recommend upgrading all instances right away as if it is a configuration problem, the poller won't start - do one that has redundancy, then figure out what the root cause is from there.

@murrant

This comment has been minimized.

Member

murrant commented Jun 28, 2018

Two of those will cause LibreNMS in general to be broke.

We could add a requirements.txt and update the docs to use it.

TheMysteriousX and others added some commits Jun 28, 2018

Fatal .env loading and SQL query fix (#18)
* bugfix: SQL query should be against column with unique constraint (node_id)

* Make failure to load a Node ID fatal
Correct 2 invocations of exit()

* Update service.py
@laf

Tested all seems fine. I'll merge this in the next 24 hours.

One thing that does need looking at is the checkDeviceLastPolled() function is now too sensitive. I've constantly got a list of devices in the output from validate that haven't polled in the last 5 minutes. It looks like it's because they are started 5 minutes apart for polling so goes over the 5 minutes allowed in that function. It doesn't appear to be an is aside from it's annoying.

@laf laf removed the Blocker 🚫 label Jun 29, 2018

@laf laf changed the title from New python service to WIP New python service Jun 29, 2018

@laf laf changed the title from WIP New python service to New python service Jun 29, 2018

@TheMysteriousX

This comment has been minimized.

Contributor

TheMysteriousX commented Jun 29, 2018

I noticed that too - I had a change that fixed it, but @murrant got there first with #8848

@laf

This comment has been minimized.

Member

laf commented Jun 29, 2018

Ace. I'll merge that in as well :)

@laf laf added the Feature label Jun 29, 2018

@laf laf dismissed their stale review via 0ac18bf Jun 30, 2018

@laf

laf approved these changes Jun 30, 2018

Tested on a 4 poller setup, works perfectly fine :)

@laf laf merged commit 0ba76e6 into librenms:master Jun 30, 2018

2 of 3 checks passed

WIP work in progress
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details

@laf laf deleted the murrant:new-python-service branch Jun 30, 2018

@laf

This comment has been minimized.

Member

laf commented Jun 30, 2018

Merged - big thanks to @murrant and @TheMysteriousX for all of their work on this one :)

@murrant

This comment has been minimized.

Member

murrant commented Jul 9, 2018

@TheMysteriousX are you on Discord or IRC? There is a bug with the stats and I'm not sure what to do about it.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 7, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.