Catch up on missed scans #64

jvehent · 2015-12-04T03:18:13Z

If the scanners crash, they need to pick up incomplete work. One way to do that is to periodically run this query:

select pg_notify('scan_listener', ''||id ) from scans where completion_perc=0;

which will resend scan notifications to target that are still at 0% completion.

0xdiba · 2015-12-04T09:33:24Z

Yeah I also had in mind to do that also for non-acknowledged scans ( may happen in a period when the scanners are redeployed but the api is up and running. )

Both are postgres module internal changes so they will not affect the worker functions.

jvehent · 2015-12-04T19:22:55Z

I'm tracking down a bug where all scans are acked, but very few complete:

observatory=> select count(*) from scans where ack='false';
 count 
-------
     0
(1 row)

 completion_perc | count  
-----------------+--------
               0 | 194690
              20 |    443
              40 |      2
             100 |  18520

The scanners are receiving notifications, but scan don't happen. The current load average on the scanners is close to 0%.

jvehent · 2015-12-04T19:48:44Z

Nevermind that previous comment: I had an issue in my script and was calling the scan API with an empty target. Since the validatedomain function doesn't yet verify the target, the scanner were trying to scan empty targets and failing in an unexpected way. I'm preparing a patch for validate domain now.

This issue remains.

jvehent · 2015-12-14T14:31:31Z

Now that it's been running for a while, here are some real world stats:

observatory=> select ack, count(*) from scans group by ack;
 ack |  count  
-----+---------
 f   |     281
 t   | 4189855
(2 rows)

observatory=> select completion_perc, count(*) from scans group by completion_perc;
 completion_perc |  count  
-----------------+---------
               0 | 3278345
              20 |    9169
              40 |    1428
             100 |  900753

So it seems like scans gets acknowledge, picked up by a scanner goroutine, but never complete. Do you think limiting the number of scanner in a sync group would help?

0xdiba · 2015-12-14T14:44:04Z

It depends on what the problem, preventing the scan from completing, is.
If it involves concurrent database connections the sync group would help.

Do we have the syslog files of the running containers to check if any errors have been logged?

Regardless of that I am preparing a patch which will catch up on both unacknowledged and half-complete scans and re-queue them.

0xdiba · 2015-12-14T19:44:34Z

check out 0c68439 .
What do you think we should do with the half-complete scans ( 0<completion_perc<100 )?

If we decide to re-queue them after a specific amount of time ( eg 5-6 mins )
we must take care ( delete or verify ) of the trusts and workers' analyses created by them.

jvehent · 2015-12-14T19:50:00Z

I'd say abandon them. If a scanner starts and does some of the work, but crashed after completion_perc>0, there must be a reason and we should track those in the logs, and/or return feedback to the caller. That's a topic for another issue.

0xdiba self-assigned this Dec 4, 2015

0xdiba mentioned this issue Dec 14, 2015

Pgmessagingv2 #76

Merged

jvehent closed this as completed in #76 Dec 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch up on missed scans #64

Catch up on missed scans #64

jvehent commented Dec 4, 2015

0xdiba commented Dec 4, 2015

jvehent commented Dec 4, 2015

jvehent commented Dec 4, 2015

jvehent commented Dec 14, 2015

0xdiba commented Dec 14, 2015

0xdiba commented Dec 14, 2015

jvehent commented Dec 14, 2015

Catch up on missed scans #64

Catch up on missed scans #64

Comments

jvehent commented Dec 4, 2015

0xdiba commented Dec 4, 2015

jvehent commented Dec 4, 2015

jvehent commented Dec 4, 2015

jvehent commented Dec 14, 2015

0xdiba commented Dec 14, 2015

0xdiba commented Dec 14, 2015

jvehent commented Dec 14, 2015