Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: report issues as discovered instead of buffering #227

Merged
merged 17 commits into from
Oct 26, 2021

Conversation

rbailey-godaddy
Copy link
Contributor

@rbailey-godaddy rbailey-godaddy commented Oct 13, 2021

This is a reimplementation of PR #108 which streams output (including JSON).

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

  • Tests (if applicable)
  • Documentation (if applicable)
  • Changelog entry
  • A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

  • Yes (backward compatible)
  • No (breaking changes)

Issue Linking

What's new?

Issues are now written to the console as soon as they are detected, while the scan is in progress, instead of buffering all of them and reporting only when the scan has completed. This is generally backward compatible:

  • A user who starts a scan and walks away, returning only after the scan has completed, will not detect any changes in behavior.
  • A caller who references only the issues property (and not the scan() method) will see no changes in behavior.
  • A caller who references only the scan() method (and not the issues property) will see no changes in behavior, although the method is now a generator that returns issues in succession rather than a function which returns a list. A caller who does len(scanner.scan()) (or anything else that assumes the result is a list) should use scanner.issues (which is a list) instead.
  • A caller using both scan() and issues, where all scan() references occur before all issues references, will see no changes in behavior.

These cases are not backward compatible:

  • A caller who invokes scan() multiple times, expecting to see the scan re-run each time, will observe different behavior. The first series of invocations identifies (and caches) issues, but subsequent invocations re-use the cached issues instead of actually rescanning the target. It looks the same (if you allow for generator vs. list) but you are not actually doing repeated scans.
  • A caller who references issues and then calls scan() will see that the scan() call now will iterate over the cached issue list instead of rescanning. Once the caller has referenced the issues property, any subsequent scan() calls should be replaced with references to issues.
  • In particular, consuming some (partial) number of issues from scan() and then referencing issues will return the full list of issues, but cause scan() to generate from the cached issues list, meaning that the initial issues will appear to be duplicated. The caller should either use issues exclusively, or ensure all references to issues occur after all calls to scan().

This change was accomplished using the following strategy:

  • ScannerBase methods which previously appended to the _issues member have been converted to generators which yield individual issues; the lowest-level methods still append to _issues (so as to cache a record of all issues) in addition to yielding the issue. In particular, scan() is now a generator.
  • A new _completed member (and matching completed property) track whether a scan is still in progress (i.e. whether or not the scan() generator is exhausted); this is initially False and set to True by scan() when it finally exits.
  • The issues property now tests to see if the scan is still in progress, and forcibly drains the generator if needed prior to returning the issues list (which otherwise could be incomplete).
  • echo_output() now loops on scan() instead of issues in order to obtain issues individually as they are detected, allowing output to be produced as soon as it is encountered; some code reordering was performed to ensure no references to issues occur before all references to scan() have been made.
  • The output blocks of echo_output() were refactored to output individual issues instead of consolidating output to a single large result at the end -- in the most complicated of these, JSON output is handled by generating it in pieces: an initial part, a part for each issue, and a terminal part -- the resulting output is unchanged, but it now appears incrementally;
  • Unit testing generally required only minor fixups. The JSON-related unit tests were modified to perform end-to-end validation of the output stream to ensure we're generating valid JSON that correctly reflects the original input data.

@rbailey-godaddy rbailey-godaddy requested a review from a team as a code owner October 13, 2021 14:52
@rbailey-godaddy rbailey-godaddy marked this pull request as draft October 13, 2021 14:52
@rbailey-godaddy rbailey-godaddy changed the title Son of live output Feature: report issues as discovered instead of buffering Oct 13, 2021
@rbailey-godaddy rbailey-godaddy marked this pull request as ready for review October 13, 2021 15:33
Copy link
Contributor

@sushantmimani sushantmimani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦅

mypy is happy and unit tests have been fixed up so they pass again
(mostly by liberal application of `list()` to force execution of the
generator).

This still needs work to fix the logic where it is assumed that the
`issues` property is a list, namely where we try to take the length
of it or consume it more than once.
Add new `issue_count` property to scanner class. This is incremented as
issues are yielded by the low level routines, and allows the caller to
tell how many issues we got without inspecting the pipeline multiple
times.

We needed one non-obvious fix in util.py to ensure we have consumed all
issues before checking the count. (Before, it would have been zero
nearly always.)
Previously we marshalled all issues into a single `echo()` call, which
resulted in buffering all output until the completion of the scan.

Now we use a loop (like "compact") and issue a separate `echo()` for
each issue, allowing early output to be produced prior to the generation
of later issues. This also gets rid of the unwanted "empty output for no
issues" behavior introduced by an earlier commit.
This works because only one part of the output is generated, so we can
output a (partial) static blob to start, then an issue at a time, and
finally terminate the object to make it valid.

Because this is a little hinky, the corresponding unit tests have been
adjusted to actually generate (and capture) the entire stdout stream.
We then deserialize (to verify syntax) and compare the resulting dict
with the original input data to confirm there was no content corruption.
Address the last remaining wart; make `echo_result()` buffer issues
(even as it streams them to console) and return the list so it's
possible to write all issues to files at the end if desired. This
returns `write_outputs()` to its original semantics, namely consumption
of a List rather than a Generator.

Arguably it might be nicer to create files as we go, but that would
constitute a behavior change (files not all done right at the end) and
need some code refactoring.
Revise test to reflect use of `-1` vs `None` to indicate scan not started
This commit restores the `._issues` member (a list) and reverts the
`.issues` property to a list instead of a generator. Several instances
of caller references to `.issues` have been restored to their original
form.

Behind the covers, a new `._completed` member (and matching property)
tracks whether or not the scan has completed; `.issues` checks this and
forces the scan to run to completion if needed.

The low level generators now append to `._issues` instead of incrementing
`._issue_count` (which is no longer needed), and `scan()` clears state
information when it starts and then sets `._completed` when finished.

To make it all work, `echo_result()` now loops on `.scan()` instead of
`.issues` so output is still "live" although `.issues` is not.
If `scan()` is called (again) after having completed a scan, return
immediately without doing anything. The proper way to look at issues
(again) is to consult `.issues`.

This limits, but does not eliminate, the potential for craziness if a
caller decides to interleave calls to `.scan()` and references to
`.issues`.
We squelch "too many instance attributes" on `ScannerBase` because we
really really need a `_completed` flag so we tell pylint to suck it up.
Slow burn from @tarkatronic review comment. If `scan()` is called again,
yield from the cached `._issues` list instead of doing nothing. This
looks like the old behavior except it doesn't actually rescan.
Copy link
Contributor

@tarkatronic tarkatronic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this is probably just a bit of pontification (specifically the thread safety stuff), but I do want to be sure we don't add more side-effects where they did not previously exist.

tartufo/scanner.py Show resolved Hide resolved
tartufo/scanner.py Outdated Show resolved Hide resolved
tartufo/scanner.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
The real object doesn't have `issue_count` any longer.
Cache issues inside `scan()` instead of the lower-level routines (which
now are generators without side effects).
Multiple competing calls to `scan()` are now guaranteed to execute
sequentially.
Copy link
Contributor

@tarkatronic tarkatronic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting in all the hard work on this @rbailey-godaddy!

Copy link
Contributor

@sushantmimani sushantmimani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦅

@tarkatronic tarkatronic merged commit d9a01bd into v3.x Oct 26, 2021
@tarkatronic tarkatronic deleted the son-of-live-output branch October 26, 2021 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants