-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch up on missed scans #64
Comments
Yeah I also had in mind to do that also for non-acknowledged scans ( may happen in a period when the scanners are redeployed but the api is up and running. ) Both are postgres module internal changes so they will not affect the worker functions. |
I'm tracking down a bug where all scans are acked, but very few complete: observatory=> select count(*) from scans where ack='false';
count
-------
0
(1 row) completion_perc | count
-----------------+--------
0 | 194690
20 | 443
40 | 2
100 | 18520 The scanners are receiving notifications, but scan don't happen. The current load average on the scanners is close to 0%. |
Nevermind that previous comment: I had an issue in my script and was calling the scan API with an empty target. Since the validatedomain function doesn't yet verify the target, the scanner were trying to scan empty targets and failing in an unexpected way. I'm preparing a patch for validate domain now. This issue remains. |
Now that it's been running for a while, here are some real world stats: observatory=> select ack, count(*) from scans group by ack;
ack | count
-----+---------
f | 281
t | 4189855
(2 rows) observatory=> select completion_perc, count(*) from scans group by completion_perc;
completion_perc | count
-----------------+---------
0 | 3278345
20 | 9169
40 | 1428
100 | 900753 So it seems like scans gets acknowledge, picked up by a scanner goroutine, but never complete. Do you think limiting the number of scanner in a sync group would help? |
It depends on what the problem, preventing the scan from completing, is. Do we have the syslog files of the running containers to check if any errors have been logged? Regardless of that I am preparing a patch which will catch up on both unacknowledged and half-complete scans and re-queue them. |
check out 0c68439 . If we decide to re-queue them after a specific amount of time ( eg 5-6 mins ) |
I'd say abandon them. If a scanner starts and does some of the work, but crashed after completion_perc>0, there must be a reason and we should track those in the logs, and/or return feedback to the caller. That's a topic for another issue. |
If the scanners crash, they need to pick up incomplete work. One way to do that is to periodically run this query:
which will resend scan notifications to target that are still at 0% completion.
The text was updated successfully, but these errors were encountered: