New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify files that were scanned #120

Closed
lgolding opened this Issue Mar 10, 2018 · 15 comments

Comments

Projects
None yet
2 participants
@lgolding
Contributor

lgolding commented Mar 10, 2018

Originally, the run.files dictionary was specified to only include entries for files that were scanned in the course of the run. Since then, we have allowed the run.files dictionary to contain many other kinds of files, to enable us to embed the contents of those files within the log file:

  • response files
  • attachments
  • the contents of the standard streams.

The problem is that now we don't have a way to identify, within the run.files dictionary, those files that were scanned. @michaelcfanning and I have discussed this in the past week or so but we neglected to file the issue.

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Mar 12, 2018

Contributor

@michaelcfanning We've discussed two options:

  1. Add a Boolean property file.scanTarget, true if the file was a scan target.
  2. Add property run.scanTarget to contain just the scan targets, and leave run.file to hold everything else. Both properties are string-to-file-dictionaries.

I like Option 2 better because I don't think a file has any business knowing what role it played.

Contributor

lgolding commented Mar 12, 2018

@michaelcfanning We've discussed two options:

  1. Add a Boolean property file.scanTarget, true if the file was a scan target.
  2. Add property run.scanTarget to contain just the scan targets, and leave run.file to hold everything else. Both properties are string-to-file-dictionaries.

I like Option 2 better because I don't think a file has any business knowing what role it played.

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Mar 12, 2018

Contributor

@michaelcfanning Whether we add a property to the file object, or group the different kinds of file objects into different properties of the run object, what do we want to do about files that are traced through in the course of scanning another file? Like when you find an issue in foo.h when you were instructed to scan foo.c?

Even the original language, which said that run.files should include files that were "scanned in the course of the run", didn't address this question.

Contributor

lgolding commented Mar 12, 2018

@michaelcfanning Whether we add a property to the file object, or group the different kinds of file objects into different properties of the run object, what do we want to do about files that are traced through in the course of scanning another file? Like when you find an issue in foo.h when you were instructed to scan foo.c?

Even the original language, which said that run.files should include files that were "scanned in the course of the run", didn't address this question.

@michaelcfanning michaelcfanning removed the bug label Mar 26, 2018

@lgolding lgolding added bug design-improvement and removed bug labels Mar 26, 2018

@michaelcfanning

This comment has been minimized.

Show comment
Hide comment
@michaelcfanning

michaelcfanning Mar 28, 2018

Contributor

Note that this relates to #108 and we should resolve both at one blow. i.e., you may identify all scan targets by pointing to a specific revision/commit. This would make sense for a source scanner, for example, where it may be prohibitively expensive to hash/document all files scanned (which could number in thousands and more).

Contributor

michaelcfanning commented Mar 28, 2018

Note that this relates to #108 and we should resolve both at one blow. i.e., you may identify all scan targets by pointing to a specific revision/commit. This would make sense for a source scanner, for example, where it may be prohibitively expensive to hash/document all files scanned (which could number in thousands and more).

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 5, 2018

Contributor

@michaelcfanning It's related to #108 in that if you are specifying a source control commit, then you can identify the version of all the files that were scanned. But you don't know that you scanned all of them. Maybe it's a C# scanner, and it didn't scan the C++ files. Or maybe you chose not to scan the test files.

I agree that enumerating tens of thousands of files is prohibitive. Let's see what the files dictionary gives us:

  1. Hashes. We don't need those if we have a commit reference.
  2. Embedded file contents. We don't need that if we can extract the files from the VCS.
  3. Encoding. Well, ok, we still need that. But if a team knows the encoding of all the files it analyzes, maybe it can get along without it.
  4. MIME type. We still need that, but a viewer can fall back to guessing from the filename extension.

So sure, if a team using a VCS is prepared to get along without #3 or #4, then they can refrain from emitting the list of scanned files into the dictionary.

But I don't think that gets us off the hook. What if a team wants that extra information in the files dictionary? Maybe they only have a few hundred files and they're just fine with the extra file size. I think we still have to provide a mechanism for identifying the scanned files. In that case I still like my Option #2 above (add run.scanTargets to hold the files that were scanned, and leave run.files for everything else).

That does still leave the question I posed in the next comment: what do we do about files that were traced through rather than fully scanned? See that comment for details.

Contributor

lgolding commented Apr 5, 2018

@michaelcfanning It's related to #108 in that if you are specifying a source control commit, then you can identify the version of all the files that were scanned. But you don't know that you scanned all of them. Maybe it's a C# scanner, and it didn't scan the C++ files. Or maybe you chose not to scan the test files.

I agree that enumerating tens of thousands of files is prohibitive. Let's see what the files dictionary gives us:

  1. Hashes. We don't need those if we have a commit reference.
  2. Embedded file contents. We don't need that if we can extract the files from the VCS.
  3. Encoding. Well, ok, we still need that. But if a team knows the encoding of all the files it analyzes, maybe it can get along without it.
  4. MIME type. We still need that, but a viewer can fall back to guessing from the filename extension.

So sure, if a team using a VCS is prepared to get along without #3 or #4, then they can refrain from emitting the list of scanned files into the dictionary.

But I don't think that gets us off the hook. What if a team wants that extra information in the files dictionary? Maybe they only have a few hundred files and they're just fine with the extra file size. I think we still have to provide a mechanism for identifying the scanned files. In that case I still like my Option #2 above (add run.scanTargets to hold the files that were scanned, and leave run.files for everything else).

That does still leave the question I posed in the next comment: what do we do about files that were traced through rather than fully scanned? See that comment for details.

@michaelcfanning

This comment has been minimized.

Show comment
Hide comment
@michaelcfanning

michaelcfanning Apr 13, 2018

Contributor

Let's step back a moment. The prominent driver for scan targets is to provide information on how much coverage an analysis tool actually provided. So if you run 'scanner.exe *.dll' and this information is persisted to the invocation object, what files did *.dll actually expand to as analysis took place? This information can be important for compliance reasons, otherwise, the SARIF-persisted data is insufficient evidence to demonstrate goals are met.

We probably need a phone call to talk this through.

Contributor

michaelcfanning commented Apr 13, 2018

Let's step back a moment. The prominent driver for scan targets is to provide information on how much coverage an analysis tool actually provided. So if you run 'scanner.exe *.dll' and this information is persisted to the invocation object, what files did *.dll actually expand to as analysis took place? This information can be important for compliance reasons, otherwise, the SARIF-persisted data is insufficient evidence to demonstrate goals are met.

We probably need a phone call to talk this through.

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 13, 2018

Contributor

@michaelcfanning Let's try a little longer to resolve it here. I think we're close.

You are right about the original motivation for identifying scan targets. The VCS information tells you nothing about which files were scanned -- it only tells you what version of files were scanned -- which is why I didn't understand your suggestion (two comments up) to couple #108 with this issue.

What's wrong with my original proposal, in the second comment from the top?

Add property run.scanTarget to contain just the scan targets, and leave run.file to hold everything else. Both properties are string-to-file-dictionaries.

Contributor

lgolding commented Apr 13, 2018

@michaelcfanning Let's try a little longer to resolve it here. I think we're close.

You are right about the original motivation for identifying scan targets. The VCS information tells you nothing about which files were scanned -- it only tells you what version of files were scanned -- which is why I didn't understand your suggestion (two comments up) to couple #108 with this issue.

What's wrong with my original proposal, in the second comment from the top?

Add property run.scanTarget to contain just the scan targets, and leave run.file to hold everything else. Both properties are string-to-file-dictionaries.

@michaelcfanning

This comment has been minimized.

Show comment
Hide comment
@michaelcfanning

michaelcfanning Apr 14, 2018

Contributor

I'm worried about your proposal. Now we've bifurcated the files objects and complicated look-up, 'first, check the scan targets, if you don't find it their, consult the files object'. we have a new validation requirement 'don't duplicate content between scanTargets and files'.

that's my main concern. also, in thinking this through for source scanning, we have a new corner case: if you are analyzing a bunch of source files, it may be prohibitively expensive to persist/store file names + hashes for all this content. in this case, you might prefer to save a time-stamp of an encapsulating directory.

the absence of a timestamp on the file object seems like a clear bug, it is as good as a hash in some scenarios. what about directory information (with timestamp + hash details). too much of a corner case?

Contributor

michaelcfanning commented Apr 14, 2018

I'm worried about your proposal. Now we've bifurcated the files objects and complicated look-up, 'first, check the scan targets, if you don't find it their, consult the files object'. we have a new validation requirement 'don't duplicate content between scanTargets and files'.

that's my main concern. also, in thinking this through for source scanning, we have a new corner case: if you are analyzing a bunch of source files, it may be prohibitively expensive to persist/store file names + hashes for all this content. in this case, you might prefer to save a time-stamp of an encapsulating directory.

the absence of a timestamp on the file object seems like a clear bug, it is as good as a hash in some scenarios. what about directory information (with timestamp + hash details). too much of a corner case?

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 14, 2018

Contributor

@michaelcfanning I agree with your concern. I was trying to be a purist; as I said in my proposal, I didn't think a file had any business knowing what role it played, as my proposal #1 would have had it. Even if that were ok, adding a Boolean property like isScanTarget would open the door to other properties like isStandardStream, isAttachment, isScreenShot, isResponseFile, etc.... or to an equivalent enumeration.

Well, what is wrong with that? What do you think of an enumerated property file.role, with values "scanTarget", "standardStream", "attachment", "screenShot", and "responseFile"? (Yes, maybe it's an array if "screenShot" is a kind of "attachment" ;-)).

Contributor

lgolding commented Apr 14, 2018

@michaelcfanning I agree with your concern. I was trying to be a purist; as I said in my proposal, I didn't think a file had any business knowing what role it played, as my proposal #1 would have had it. Even if that were ok, adding a Boolean property like isScanTarget would open the door to other properties like isStandardStream, isAttachment, isScreenShot, isResponseFile, etc.... or to an equivalent enumeration.

Well, what is wrong with that? What do you think of an enumerated property file.role, with values "scanTarget", "standardStream", "attachment", "screenShot", and "responseFile"? (Yes, maybe it's an array if "screenShot" is a kind of "attachment" ;-)).

@michaelcfanning

This comment has been minimized.

Show comment
Hide comment
@michaelcfanning

michaelcfanning Apr 14, 2018

Contributor

@lgolding If we have this, I think I prefer an enum. What's your opinion?

you know, I have hoping we could have an attachment simply include a mechanism for highlighting regions but maybe we need to something distinct for images with bounding rects. Hm. Sorry to mix topic discussion on this thread.

Contributor

michaelcfanning commented Apr 14, 2018

@lgolding If we have this, I think I prefer an enum. What's your opinion?

you know, I have hoping we could have an attachment simply include a mechanism for highlighting regions but maybe we need to something distinct for images with bounding rects. Hm. Sorry to mix topic discussion on this thread.

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 14, 2018

Contributor

@michaelcfanning Yes, I prefer the enumeration. It's more compact and easier to extend.

I'm not aware of a scenario for highlighting regions in attachments. But if there is a scenario, here's an idea: introduce property run.annotations of type annotation[]. Now you can associate a message with an arbitrary set of regions in any file mentioned in the run.files dictionary.

You'd still need special treatment for screenShot's rectangles. And I think screenShot should have an array of rectangles, each with a message!

Let's move this discussion over to the screenShot issue.

Contributor

lgolding commented Apr 14, 2018

@michaelcfanning Yes, I prefer the enumeration. It's more compact and easier to extend.

I'm not aware of a scenario for highlighting regions in attachments. But if there is a scenario, here's an idea: introduce property run.annotations of type annotation[]. Now you can associate a message with an arbitrary set of regions in any file mentioned in the run.files dictionary.

You'd still need special treatment for screenShot's rectangles. And I think screenShot should have an array of rectangles, each with a message!

Let's move this discussion over to the screenShot issue.

@michaelcfanning

This comment has been minimized.

Show comment
Hide comment
@michaelcfanning

michaelcfanning Apr 14, 2018

Contributor

@lgolding ship the enum. we'll follow up on attachments/annotations/bounding rects for images elsewhere.

Contributor

michaelcfanning commented Apr 14, 2018

@lgolding ship the enum. we'll follow up on attachments/annotations/bounding rects for images elsewhere.

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 14, 2018

Contributor

@michaelcfanning Is file.role an array (file.roles)?

Contributor

lgolding commented Apr 14, 2018

@michaelcfanning Is file.role an array (file.roles)?

@michaelcfanning

This comment has been minimized.

Show comment
Hide comment
@michaelcfanning

michaelcfanning Apr 14, 2018

Contributor
  1. we need a resultFile role (file wasn't an explicit scan target but we found an issue there). btw - this file is where the relevant scan data is, might be worth thinking about renaming this idea
  2. only need for multiple roles is when a scan target also is used to show data for the result. otherwise we can't go capture the set of files literally associated with a result (in that they may need to be retrieved in order to diagnose some output).

not a big fan of an array for this single scenario but not sure we can avoid it.

Contributor

michaelcfanning commented Apr 14, 2018

  1. we need a resultFile role (file wasn't an explicit scan target but we found an issue there). btw - this file is where the relevant scan data is, might be worth thinking about renaming this idea
  2. only need for multiple roles is when a scan target also is used to show data for the result. otherwise we can't go capture the set of files literally associated with a result (in that they may need to be retrieved in order to diagnose some output).

not a big fan of an array for this single scenario but not sure we can avoid it.

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 14, 2018

Contributor

Re (1), don't we also need a "traced through" role? You scanned foo.c, traced into foo.h, and from there into bar.h, and found an issue. foo.c = scanTarget, bar.h = resultFile (or whatever), foo.h = ???.

Re (2), I'm not getting it. Are you referring to a file that might be the scanTarget for one result, but was only "traced through" in the course of detecting another result?

Contributor

lgolding commented Apr 14, 2018

Re (1), don't we also need a "traced through" role? You scanned foo.c, traced into foo.h, and from there into bar.h, and found an issue. foo.c = scanTarget, bar.h = resultFile (or whatever), foo.h = ???.

Re (2), I'm not getting it. Are you referring to a file that might be the scanTarget for one result, but was only "traced through" in the course of detecting another result?

@lgolding

This comment has been minimized.

Show comment
Hide comment
@lgolding

lgolding Apr 19, 2018

Contributor

Closed by 1f7f748.

Contributor

lgolding commented Apr 19, 2018

Closed by 1f7f748.

@lgolding lgolding closed this Apr 19, 2018

@lgolding lgolding self-assigned this Apr 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment