Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #89 Base64 and hexadecimal entropy score override #223

Merged
merged 2 commits into from Oct 18, 2021

Conversation

namithasind
Copy link
Contributor

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

  • Tests (if applicable)
  • Documentation (if applicable)
  • Changelog entry
  • A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

  • Yes (backward compatible)
  • No (breaking changes)

Issue Linking

#89

What's new?

  • Added 2 new options -b64/--b64-entropy-score and -hex/--hex-entropy-score which can be used to override the hard coded base64 and hexadecimal entropy scores
  • Modified the unit tests
  • Updated Readme with the new options and descriptions

Copy link
Contributor

@rbailey-godaddy rbailey-godaddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rbailey-godaddy
Copy link
Contributor

rbailey-godaddy commented Oct 5, 2021

I'm curious -- what's the use case for this enhancement?

I've always treated the original (now default) values as more or less magic, although thinking about it I guess I can see that a hex digit 0-f has 16 unique values and represents 4 bits of data. And base64 represents 24 bits of input in 4 encoded output characters, so each character we see represents 6 bits of data. The "magic" part to me is "why do we use thresholds that are 75% of the raw bit rate?" Is this "if the data looks at least 75% random it's a problem?"

That's the lead-up to wondering why somebody would want to override these numbers and how they would make intelligent choices and what the implications of those choices would be. It also makes me think (although not strongly enough to withdraw my approval) that these are the wrong knobs for this problem space.

The potential entropy of a hex digit string or a base64 character string is just math (see above) -- and changing those properties makes about as much sense as trying to legislate the value of pi. ;-) Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.

This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

@rbailey-godaddy
Copy link
Contributor

I'm curious -- what's the use case for this enhancement?

I've always treated the original (now default) values as more or less magic, although thinking about it I guess I can see that a hex digit 0-f has 16 unique values and represents 4 bits of data. And base64 represents 24 bits of input in 4 encoded output characters, so each character we see represents 6 bits of data. The "magic" part to me is "why do we use thresholds that are 75% of the raw bit rate?" Is this "if the data looks at least 75% random it's a problem?"

That's the lead-up to wondering why somebody would want to override these numbers and how they would make intelligent choices and what the implications of those choices would be. It also makes me think (although not strongly enough to withdraw my approval) that these are the wrong knobs for this problem space.

The potential entropy of a hex digit string or a base64 character string is just math (see above) -- and changing those properties makes about as much sense as trying to legislate the value of pi. ;-) Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.

This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

FYI. I edited this after sending, and I think subscribers only saw the original version. So I'm spamming you all again. Apologies.

@rbailey-godaddy
Copy link
Contributor

Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.
This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

It occurs to me that "sensitivity" is a bad name, since I think higher numbers are less sensitive than lower numbers, which is counterintuitive. My interpretation is that this number (0-100) is the probability that the string in question is totally random (aka encrypted, at a fuzzy level), and therefore higher numbers (above 75) make tartufo more reluctant to report a given string, while lower numbers make tartufo more likely to report that string.

BTW, one reason this is nicer than tweaking scores -- it's easier to reason about. The range is 0-100 because it's a probability, and you can't talk about probabilities greater than 1.0 (aka 100%) or less than 0.0 (aka 0%). I still have a curious urge to understand where the original 75% came from, but both trufflehog and the original blog it references seem silent on the subject.

@namithasind
Copy link
Contributor Author

I'm curious -- what's the use case for this enhancement?

I've always treated the original (now default) values as more or less magic, although thinking about it I guess I can see that a hex digit 0-f has 16 unique values and represents 4 bits of data. And base64 represents 24 bits of input in 4 encoded output characters, so each character we see represents 6 bits of data. The "magic" part to me is "why do we use thresholds that are 75% of the raw bit rate?" Is this "if the data looks at least 75% random it's a problem?"

That's the lead-up to wondering why somebody would want to override these numbers and how they would make intelligent choices and what the implications of those choices would be. It also makes me think (although not strongly enough to withdraw my approval) that these are the wrong knobs for this problem space.

The potential entropy of a hex digit string or a base64 character string is just math (see above) -- and changing those properties makes about as much sense as trying to legislate the value of pi. ;-) Perhaps this would be more manageable -- if not more understandable -- to have a "sensitivity" argument with a range of 0-100 and a default of 75 (so people don't have to mess with a decimal point) and then compute hex_entropy_score = 4.0 * float(x) / 100.0 and b64_entropy_score = 6.0 * float(x) / 100.0.

This way at least there's only one thing to twiddle, the effects are consistent (at least to a computer scientist or mathematician lol), and we can frame the discussion in terms of entropy generally without getting bogged down in the actual representation, which isn't supposed to matter anyway.

We have repositories that have secrets that needs to be moved out of GitHub before we migrated it to GHEC. Tartufo could not catch all of the secrets. For example we had a password like password: <randomAlphanumericStringOfLength16> which tartufo could not catch. I added Regex to catch these but I came across a few more Regex couldn't catch. I was also pointed to the code that sets the string length to 20. I was playing around with this value to see if that can be made configurable. I was not able to catch secret strings of lower length just by altering it. Then I realized that might be because shorter strings have lower entropy value. That's when I started looking into making entropy score configurable and came across the issue already reported in the repo. This feature is useful when you want to catch secrets like the one I mentioned above and Regex isn't able to catch it.
I agree that it is not possible to guess the entropy values that will work, right away. Need to play around the default to see if the results are more effective. A downside for altering entropy value to a lower value would be that you have more number of false positives that are reported by tartufo. But I think this feature will enable user to do a more thorough scan.

Copy link
Contributor

@tarkatronic tarkatronic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do really like the idea that @rbailey-godaddy presented of working with a percentage, rather than a mysterious number, and I would love to see that enhancement eventually. That said, this should be a very useful change for experimentation and learning more about effective values! Great work on this @namithasind!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Entropy b64/hex Score Override
4 participants