Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to represent results from deep in revision control history? #564

Open
bradlarsen opened this issue Mar 3, 2023 · 6 comments

Comments

@bradlarsen
Copy link

I work on Nosey Parker, a detector of hardcoded secrets in textual data and specifically Git history. Big idea: git repos in, hardcoded secrets found in history out.

Rudimentary SARIF output support was recently added. It seems to work for results that come from scanning plain files found on the filesystem. But I'm having trouble figuring out how to usefully represent results from blobs found in Git repositories (the majority of results from Nosey Parker).

When Nosey Parker encounters a Git repository, it scans the entire history, not specific commits. So including SARIF versionControlProvenance on the run object doesn't really make sense.

Additionally, when Nosey Parker scans a Git repository, it does so by enumerating all blobs (i.e., content-addressed bytestrings) rather than crawling the Git commit history. Any results reported from these blobs don't (currently) have commit or path information associated with them; all they have is the blob ID (a 40-character hex string) and the location within that blob. The blob referenced in a result is also unlikely to exist as simple file on the filesystem.

What I'm having trouble with is figuring out how to represent the location of one of these results from a Git blob. Each one kind of has a physical location: there is:

  1. The filesystem path to the Git repository
  2. A blob found within that repository
  3. The line and column information of the finding range

The difficulty is with (2): there's not a way to spell that as a URI!

The combination of (1) and (2) could perhaps be represented with nested artifacts, kind of like a tar file, but I don't know how sensible this will be.

What's going to be the best way to represent these findings from Git history in SARIF? Thank you.

@KalleOlaviNiemitalo
Copy link

Re nested artifacts, you could treat as an archive not the whole Git repository but rather a pack file or a loose object file. So you'd have one artifactLocation for the repository directory, a second one for the pack file, and a third one for the object within the pack. The url of the third one would be just the object id.

But what software is going to understand these artifact locations? … I suppose you cannot have any kind of automated remediation anyway.

@bradlarsen
Copy link
Author

Re nested artifacts, you could treat as an archive not the whole Git repository but rather a pack file or a loose object file. So you'd have one artifactLocation for the repository directory, a second one for the pack file, and a third one for the object within the pack. The url of the third one would be just the object id.

That's an idea. Figuring out which packfile a blob came from would require some replumbing in Nosey Parker.

But what software is going to understand these artifact locations?

Indeed! I'm trying to find a way to represent the current result information—essentially a (repo url, blob, source range)—in SARIF format in a way that existing tools would do a reasonable job rendering. The primary SARIF consumer I'm interested in (at least initially) is GitHub.

I suppose you cannot have any kind of automated remediation anyway.

Correct. When a real secret is leaked, showing up in Git history, in a public S3 bucket, etc, there is no code change that will remediate it. In the immediate term, you should invalidate the credential. Longer term, you should determine how it was leaked in the first place (e.g., maybe the developers have no safe mechanism for getting secrets where they need to be) and make changes to prevent it happening again.

@michaelcfanning
Copy link
Contributor

SARIF v2.2. is coming and this important issue will be discussed at tomorrow's TC.

It seems clear that SARIF 2.1 hasn't thoroughly covered all source code provenance scenarios properly. Besides the problems of scanning the complete history of a GIT repository, the standard can't accommodate version control systems that are versioned on the file path itself (i.e., \myDirectory\myFile#1, \myDirectory\myFile#2, \myDirectory\myFile#3).

The most compact, non-breaking way to handle this would be to allow for a version control details object to be associated with code in the artifacts table. It is already possible in SARIF to render a results for non-source-controlled artifacts with identical disk location but two different versions (this was designed to allow for results fired against temporary build-generated files which might be overwritten through-out the build). It is not much of a stretch to allow for artifact entries to come with an optional version control details instance. The absence of this data would prompt a consumer to look for a general VCD entry on the current run object, as usual.

You already have the file blob sha, is that correct? If so, can that blob sha be used to construct a link to the hosted file contents?

Conceptually, we could consider blessing the GIT file blob sha as another value that can be persisted in an artifact hashes array. This will only be helpful, though, if we can use that data at consumption time to do something useful (like browse the file contents in the browser).

@bradlarsen
Copy link
Author

Interesting; thanks for the update and response!

You already have the file blob sha, is that correct?

Yes, that's correct. In Nosey Parker, when scanning for patterns, Git-style blob IDs are what we have first and foremost; getting things like pathname and commit ID require extra work.

If so, can that blob sha be used to construct a link to the hosted file contents?

Nosey Parker's findings have two bits of reliable metadata: the path to the local Git repository clone, and the blob ID (a 40-character hexadecimal shasum). As far as I know, there is not any canonical way of specifying that as a URL to a generic Git repository, e.g., there is not a URL scheme like file:///path/to/the/clone/blobs/BLOB_ID that Git natively understands.

However, it is possible to do useful things given the path to the Git repository clone and a blob ID:

  • You can do (cd path/to/the/clone && git show BLOB_ID) to get the blob contents
  • If you inspect path/to/the/clone and discover that it has an upstream repository hosted at GitHub, GitLab, etc, you can construct a permalink that those systems understand to show the blob in a browser

@bradlarsen
Copy link
Author

Conceptually, we could consider blessing the GIT file blob sha as another value that can be persisted in an artifact hashes array. This will only be helpful, though, if we can use that data at consumption time to do something useful (like browse the file contents in the browser).

In the case of this hypothetical new type of artifact hash, to do anything useful, in addition to the Git blob ID, you would also need the path to the repository clone that it was found within.

As I mentioned above, I don't believe there is a way to construct a special generic git URL to reference said blob, but you can use the two pieces of information to at least retrieve the entire blob contents.

@bradlarsen
Copy link
Author

@michaelcfanning What was the takeaway from the TC meeting in July with respect to git blob provenance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants