Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pre-commit ignoring untracked files #352

Merged
merged 11 commits into from
May 31, 2022

Conversation

jwilbur-godaddy
Copy link
Contributor

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

  • Tests (if applicable)
  • Documentation (if applicable)
  • Changelog entry
  • A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

  • Yes (backward compatible)
  • No (breaking changes)

Issue Linking

closes #331

What's new?

  • Fix non-scanning of new files. It just took a few extra git flags to show untracked files in the diff against HEAD.

@rbailey-godaddy
Copy link
Contributor

This commit is admirably simple but hurts my brain to think about it. A few questions/requests/comments...

This is perhaps a bit expansive for a unit test, but can you validate the following scenario?

date | shasum > file-a.txt
date | shasum > file-b.txt
git add file-a.txt
tartufo pre-commit

We expect that a finding will be generated for file-a.txt, but not for file-b.txt. I worry that the change as proposed at this point will flag both files.

@jwilbur-godaddy
Copy link
Contributor Author

This commit is admirably simple but hurts my brain to think about it. A few questions/requests/comments...

This is perhaps a bit expansive for a unit test, but can you validate the following scenario?

date | shasum > file-a.txt
date | shasum > file-b.txt
git add file-a.txt
tartufo pre-commit

We expect that a finding will be generated for file-a.txt, but not for file-b.txt. I worry that the change as proposed at this point will flag both files.

Hey Scott,

I will add that comment you requested. That's a good idea. As for the rest, I have a test case that I have shared elsewhere, which I will paste below. For clarification, this does not scan all files in the repo. Because of the recent addition of cached=True on the same exact line that I changed, this will only apply to untracked files that were added to the index, meaning that this will not result in tartufo scanning unstaged changes. I think the problem is that a file still counts as "untracked" even after it has been added to the index. It only becomes tracked once it has actually been committed.

Test Case

  1. Clone tartufo
  2. Run openssl genrsa > private_key.pem to place an RSA private key right in the tartufo repository. (If you do this in Powershell, the file will be UTF-16-LE encoded, so tartufo will NOT catch this (I am going to report this as a bug, too). You can re-encode it in VS Code.)
  3. After ~line 571 (for chunk in self.chunks) in scanner.py, add print(chunk.file_path) so you can see what files are being scanned.
  4. One line ~909 of scanner.py, remove the flags.
  5. Run poetry run python .\tartufo_main_.py pre-commit. Observe that it DOES NOT scan private_key.pem. This is because there are no flags telling it to include untracked files and their contents.
  6. One line ~909 of scanner.py, reintroduce the flags.
  7. Run the command from step 5 again. Observe that it DOES NOT scan private_key.pem. This is because private_key.pem is not in the Git index.
  8. Stage private_key.pem.
  9. Run the command from step 5 again. Observe that it DOES scan private_key.pem because it is now staged.
    "does that cause us to screen non-staged untracked files as well"
    No, it does not, because of the cached=True flag that was introduced recently on that same line.

@rbailey-godaddy
Copy link
Contributor

The lightbulb goes on...

I think the problem is that a file still counts as "untracked" even after it has been added to the index. It only becomes tracked once it has actually been committed.

This was totally non-obvious to me. Can you extend your comment to include this bit by way of explanation regarding why the chosen flags are present? Otherwise, I'm happy.

Copy link
Contributor

@rbailey-godaddy rbailey-godaddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm expecting to see that those URLs will trigger entropy findings that will need to be added to pyproject.toml lol...

@rbailey-godaddy
Copy link
Contributor

@jwilbur-godaddy can you cherry-pick 78b27ee from the tartufo repo into your branch? This should fix the issues I noted earlier in my review.

* Reformat with black; fixes failed CI test
* Add signature; silences alert on long URLs in comments
@pmevzek-godaddy
Copy link

This commit is admirably simple but hurts my brain to think about it.

I fully agree with this, and personally I wouldn't see it favorably this gets merged (even if it seems to solve a problem, I still think there is a code-smell here).

At the very least, there is a terminology problem. A new file being staged/cached IS NOT UNTRACKED.
This comes straight from git-status manual:

Short Format
In the short-format, the status of each path is shown as one of these forms

       XY PATH
       XY ORIG_PATH -> PATH

...

       X          Y     Meaning
       -------------------------------------------------
                [AMD]   not updated
       M        [ MTD]  updated in index
       T        [ MTD]  type changed in index
       A        [ MTD]  added to index
       D                deleted from index
       R        [ MTD]  renamed in index
       C        [ MTD]  copied in index
       [MTARC]          index and work tree matches
       [ MTARC]    M    work tree changed since index
       [ MTARC]    T    type changed in work tree since index
       [ MTARC]    D    deleted in work tree
                   R    renamed in work tree
                   C    copied in work tree
       -------------------------------------------------
       D           D    unmerged, both deleted
       A           U    unmerged, added by us
       U           D    unmerged, deleted by them
       U           A    unmerged, added by them
       D           U    unmerged, deleted by us
       A           A    unmerged, both added
       U           U    unmerged, both modified
       -------------------------------------------------
       ?           ?    untracked
       !           !    ignored
       -------------------------------------------------

Now let us try with a new file being committed:

$ git init test
$ cd test
$ git status --porcelain
$

# empty, as expected

$ > a_new_file
$ git status --porcelain
?? a_new_file

# per manual above, THIS is an untracked file for now

# let us stage it now

$ git add a_new_file
$ git status --porcelain
A  a_new_file

# per documentation above, the file is "ADDED" it is NOT untracked anymore.

Maybe the confusion is coming from git and/or libgit2 itself (maybe the flag has to be read as "untracked OR newly added" and not just "untracked"), but at the very least I don't think it is right to say that a new added file (to cache/stage) is untracked. It is not untracked anymore once staged.

@jwilbur-godaddy
Copy link
Contributor Author

Okay, so do you just want me to change the comment? What are you asking for?

@hong-godaddy
Copy link
Contributor

Can we get additional reviews on this? Would love to get this merged and have a release

@jgowdy
Copy link
Contributor

jgowdy commented May 18, 2022

Do we have a unit test we can run against the existing main branch that fails to detect something in the index that is untracked, and that with these new flags it passes, and do we have a unit test that demonstrates that with the flags that untracked files not in the index are NOT scanned? I see the steps to reproduce above, but we should codify those in tests to validate the change we are making.

@jgowdy
Copy link
Contributor

jgowdy commented May 18, 2022

This also needs a changelog entry per the PR template

@jgowdy
Copy link
Contributor

jgowdy commented May 23, 2022

Can we test the negative case also? i.e. That this change doesn't cause untracked files not staged to the index to be scanned?

@tarkatronic
Copy link
Contributor

Can we test the negative case also? i.e. That this change doesn't cause untracked files not staged to the index to be scanned?

I've got to agree with Jeremiah on this one. I really like this solution, and its elegance, but I definitely think we want that other test just to be extra careful and cautious. Looking at the test you already wrote, it looks like it should even be a decently simple one to add!

@jwilbur-godaddy
Copy link
Contributor Author

@jgowdy @tarkatronic I don't understand how this section of code does not deliver what you are asking for:

https://github.com/jwilbur-godaddy/tartufo/blob/bcc2716fe10757808fa48e097bb1ab911a898b8c/tests/test_scan_local_repo.py#L51-L55

@tarkatronic
Copy link
Contributor

Ah ha. I had missed that part. Can you split that out to a separate unit test so that the two conditions are tested (and can potentially fail) on their own?

@tarkatronic
Copy link
Contributor

One other thing, it looks like black and pylint are failing in the CI pipeline -- could you take a look and see if you can get those passing locally? Then we can get this merged and released!

repo.index.add("tests/data/config/" + file_name)
repo.index.write() # This actually writes the index to disk. Without it, the tracked file is not actually staged.
result = runner.invoke(cli.main, ["--entropy-sensitivity", "1", "pre-commit"])
self.assertNotEqual(result.exit_code, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally realized what struck me as odd here. An exit code of 0 would indicate a successful scan with no issues found. So what I'm not sure about here is, how does this verify that the file was scanned? Since you're writing a sha256 digest to the file, that should get flagged for entropy, though I notice you modify the entropy sensitivity. Shouldn't we be letting it fail, and checking for an exit code greater than 0? Or even better maybe capture json output from the command and ensure that the filename in question is in the output?

Copy link
Contributor Author

@jwilbur-godaddy jwilbur-godaddy May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you miss the assertNotEqual? I am checking that the return code is not zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops haha yes I did, sorry. ☕ has clearly not kicked in yet!

Copy link
Contributor

@tarkatronic tarkatronic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks for all the work and putting up with us through the process @jwilbur-godaddy!

@jgowdy can you take another look and make sure your concerns are covered?

Copy link
Contributor

@jgowdy jgowdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black still doesn't seem happy, but my concerns about tests are satisfied. Can we fix the black linting and then we're g2g

@jwilbur-godaddy
Copy link
Contributor Author

Done! Thank you, all!

@jwilbur-godaddy
Copy link
Contributor Author

Just a heads up, I am starting paternity leave tomorrow, so I will not be able to follow up on this after today. I also do not have write access to this repo, so I will not be able to merge this in.

@sushantmimani sushantmimani merged commit 20e7b51 into godaddy:main May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pre-commit hook: High entropy strings not detected in new files by 3.0.0
9 participants