Feature: report issues as discovered instead of buffering #227

rbailey-godaddy · 2021-10-13T14:52:38Z

This is a reimplementation of PR #108 which streams output (including JSON).

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

Tests (if applicable)
Documentation (if applicable)
Changelog entry
A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

Yes (backward compatible)
No (breaking changes)

Issue Linking

What's new?

Issues are now written to the console as soon as they are detected, while the scan is in progress, instead of buffering all of them and reporting only when the scan has completed. This is generally backward compatible:

A user who starts a scan and walks away, returning only after the scan has completed, will not detect any changes in behavior.
A caller who references only the issues property (and not the scan() method) will see no changes in behavior.
A caller who references only the scan() method (and not the issues property) will see no changes in behavior, although the method is now a generator that returns issues in succession rather than a function which returns a list. A caller who does len(scanner.scan()) (or anything else that assumes the result is a list) should use scanner.issues (which is a list) instead.
A caller using both scan() and issues, where all scan() references occur before all issues references, will see no changes in behavior.

These cases are not backward compatible:

A caller who invokes scan() multiple times, expecting to see the scan re-run each time, will observe different behavior. The first series of invocations identifies (and caches) issues, but subsequent invocations re-use the cached issues instead of actually rescanning the target. It looks the same (if you allow for generator vs. list) but you are not actually doing repeated scans.
A caller who references issues and then calls scan() will see that the scan() call now will iterate over the cached issue list instead of rescanning. Once the caller has referenced the issues property, any subsequent scan() calls should be replaced with references to issues.
In particular, consuming some (partial) number of issues from scan() and then referencing issues will return the full list of issues, but cause scan() to generate from the cached issues list, meaning that the initial issues will appear to be duplicated. The caller should either use issues exclusively, or ensure all references to issues occur after all calls to scan().

This change was accomplished using the following strategy:

ScannerBase methods which previously appended to the _issues member have been converted to generators which yield individual issues; the lowest-level methods still append to _issues (so as to cache a record of all issues) in addition to yielding the issue. In particular, scan() is now a generator.
A new _completed member (and matching completed property) track whether a scan is still in progress (i.e. whether or not the scan() generator is exhausted); this is initially False and set to True by scan() when it finally exits.
The issues property now tests to see if the scan is still in progress, and forcibly drains the generator if needed prior to returning the issues list (which otherwise could be incomplete).
echo_output() now loops on scan() instead of issues in order to obtain issues individually as they are detected, allowing output to be produced as soon as it is encountered; some code reordering was performed to ensure no references to issues occur before all references to scan() have been made.
The output blocks of echo_output() were refactored to output individual issues instead of consolidating output to a single large result at the end -- in the most complicated of these, JSON output is handled by generating it in pieces: an initial part, a part for each issue, and a terminal part -- the resulting output is unchanged, but it now appears incrementally;
Unit testing generally required only minor fixups. The JSON-related unit tests were modified to perform end-to-end validation of the output stream to ensure we're generating valid JSON that correctly reflects the original input data.

sushantmimani

LGTM 🦅

mypy is happy and unit tests have been fixed up so they pass again (mostly by liberal application of `list()` to force execution of the generator). This still needs work to fix the logic where it is assumed that the `issues` property is a list, namely where we try to take the length of it or consume it more than once.

Add new `issue_count` property to scanner class. This is incremented as issues are yielded by the low level routines, and allows the caller to tell how many issues we got without inspecting the pipeline multiple times. We needed one non-obvious fix in util.py to ensure we have consumed all issues before checking the count. (Before, it would have been zero nearly always.)

Previously we marshalled all issues into a single `echo()` call, which resulted in buffering all output until the completion of the scan. Now we use a loop (like "compact") and issue a separate `echo()` for each issue, allowing early output to be produced prior to the generation of later issues. This also gets rid of the unwanted "empty output for no issues" behavior introduced by an earlier commit.

This works because only one part of the output is generated, so we can output a (partial) static blob to start, then an issue at a time, and finally terminate the object to make it valid. Because this is a little hinky, the corresponding unit tests have been adjusted to actually generate (and capture) the entire stdout stream. We then deserialize (to verify syntax) and compare the resulting dict with the original input data to confirm there was no content corruption.

Address the last remaining wart; make `echo_result()` buffer issues (even as it streams them to console) and return the list so it's possible to write all issues to files at the end if desired. This returns `write_outputs()` to its original semantics, namely consumption of a List rather than a Generator. Arguably it might be nicer to create files as we go, but that would constitute a behavior change (files not all done right at the end) and need some code refactoring.

Revise test to reflect use of `-1` vs `None` to indicate scan not started

This commit restores the `._issues` member (a list) and reverts the `.issues` property to a list instead of a generator. Several instances of caller references to `.issues` have been restored to their original form. Behind the covers, a new `._completed` member (and matching property) tracks whether or not the scan has completed; `.issues` checks this and forces the scan to run to completion if needed. The low level generators now append to `._issues` instead of incrementing `._issue_count` (which is no longer needed), and `scan()` clears state information when it starts and then sets `._completed` when finished. To make it all work, `echo_result()` now loops on `.scan()` instead of `.issues` so output is still "live" although `.issues` is not.

If `scan()` is called (again) after having completed a scan, return immediately without doing anything. The proper way to look at issues (again) is to consult `.issues`. This limits, but does not eliminate, the potential for craziness if a caller decides to interleave calls to `.scan()` and references to `.issues`.

We squelch "too many instance attributes" on `ScannerBase` because we really really need a `_completed` flag so we tell pylint to suck it up.

@tarkatronic

Slow burn from @tarkatronic review comment. If `scan()` is called again, yield from the cached `._issues` list instead of doing nothing. This looks like the old behavior except it doesn't actually rescan.

tarkatronic

Some of this is probably just a bit of pontification (specifically the thread safety stuff), but I do want to be sure we don't add more side-effects where they did not previously exist.

tartufo/scanner.py

tests/test_cli.py

The real object doesn't have `issue_count` any longer.

Cache issues inside `scan()` instead of the lower-level routines (which now are generators without side effects).

Multiple competing calls to `scan()` are now guaranteed to execute sequentially.

tarkatronic

Thanks for putting in all the hard work on this @rbailey-godaddy!

sushantmimani

LGTM 🦅

rbailey-godaddy requested a review from a team as a code owner October 13, 2021 14:52

rbailey-godaddy requested review from beardedeagle, sushantmimani and sjacoby-godaddy October 13, 2021 14:52

rbailey-godaddy marked this pull request as draft October 13, 2021 14:52

rbailey-godaddy changed the title ~~Son of live output~~ Feature: report issues as discovered instead of buffering Oct 13, 2021

rbailey-godaddy marked this pull request as ready for review October 13, 2021 15:33

rbailey-godaddy mentioned this pull request Oct 13, 2021

feat(scan): use generator to display live output #108

Closed

sushantmimani approved these changes Oct 13, 2021

View reviewed changes

sushantmimani requested a review from tarkatronic October 13, 2021 22:29

rbailey-godaddy force-pushed the son-of-live-output branch from cccc2b1 to d07ca54 Compare October 14, 2021 22:48

rbailey-godaddy added the enhancement New feature or request label Oct 19, 2021

rbailey-godaddy force-pushed the son-of-live-output branch from d07ca54 to 032ed0b Compare October 19, 2021 15:07

rbailey-godaddy changed the base branch from main to v3.0.0-dev October 20, 2021 19:34

rbailey-godaddy force-pushed the son-of-live-output branch from b9eebb3 to 9952ab5 Compare October 20, 2021 19:37

rbailey-godaddy changed the base branch from v3.0.0-dev to v3.x October 20, 2021 23:54

rbailey-godaddy requested a review from sushantmimani October 21, 2021 19:56

rscottbailey added 13 commits October 21, 2021 16:42

Fix issue_count logic

b67d23e

Revise test to reflect use of `-1` vs `None` to indicate scan not started

Add CHANGELOG entry

03e880e

black fixup

b980f11

rebase fixups

632a928

linter fixups

e9fb058

We squelch "too many instance attributes" on `ScannerBase` because we really really need a `_completed` flag so we tell pylint to suck it up.

Fix scan() re-call

459dbc0

Slow burn from @tarkatronic review comment. If `scan()` is called again, yield from the cached `._issues` list instead of doing nothing. This looks like the old behavior except it doesn't actually rescan.

linter fixups

1f43307

rbailey-godaddy force-pushed the son-of-live-output branch from 477a1fd to 1f43307 Compare October 21, 2021 20:43

tarkatronic mentioned this pull request Oct 22, 2021

Add a progress meter #237

Closed

tarkatronic suggested changes Oct 25, 2021

View reviewed changes

tartufo/scanner.py Show resolved Hide resolved

tartufo/scanner.py Outdated Show resolved Hide resolved

tartufo/scanner.py Outdated Show resolved Hide resolved

tests/test_cli.py Outdated Show resolved Hide resolved

rscottbailey added 3 commits October 25, 2021 18:07

Remove unnecessary mock properties

9db392a

The real object doesn't have `issue_count` any longer.

Hoist issue accumulation

1df4c0e

Cache issues inside `scan()` instead of the lower-level routines (which now are generators without side effects).

Add locking for scan() method

361d33d

Multiple competing calls to `scan()` are now guaranteed to execute sequentially.

rbailey-godaddy requested a review from tarkatronic October 26, 2021 14:12

tarkatronic approved these changes Oct 26, 2021

View reviewed changes

sushantmimani approved these changes Oct 26, 2021

View reviewed changes

tarkatronic merged commit d9a01bd into v3.x Oct 26, 2021

tarkatronic deleted the son-of-live-output branch October 26, 2021 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: report issues as discovered instead of buffering #227

Feature: report issues as discovered instead of buffering #227

rbailey-godaddy commented Oct 13, 2021 •

edited

Loading

sushantmimani left a comment

tarkatronic left a comment

tarkatronic left a comment

sushantmimani left a comment

Feature: report issues as discovered instead of buffering #227

Feature: report issues as discovered instead of buffering #227

Conversation

rbailey-godaddy commented Oct 13, 2021 • edited Loading

PR Type

Backward Compatibility

Issue Linking

What's new?

sushantmimani left a comment

Choose a reason for hiding this comment

tarkatronic left a comment

Choose a reason for hiding this comment

tarkatronic left a comment

Choose a reason for hiding this comment

sushantmimani left a comment

Choose a reason for hiding this comment

rbailey-godaddy commented Oct 13, 2021 •

edited

Loading