Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

real entropy #4

Closed
oprogramador opened this issue Feb 4, 2020 · 10 comments
Closed

real entropy #4

oprogramador opened this issue Feb 4, 2020 · 10 comments

Comments

@oprogramador
Copy link

IMO Shannon entropy isn't a good measurement because a given string repeated 100 times has the same entropy as repeated only once.
Of course, repeating the same sequence doesn't increase much the amount of information but in some level increases.

IMO:

  • abcd -> log_2 (4) which gives 2
  • abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcd (abcd repeated 100 times) -> log_2 (4 + log_2 (100)) = 3.41

https://www.shannonentropy.netmark.pl/calculate

@oprogramador
Copy link
Author

oprogramador commented Feb 4, 2020

or another example - according to Shannon, the entropy of a is 0 and the entropy of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa is 0 as well

@oprogramador
Copy link
Author

Or:
abcdefabcdef -> 2.58496
abcdefcbafed -> 2.58496

@oprogramador
Copy link
Author

Or:
01 -> 1
0100001011110101000000010000100100001100110101001100011101110100110101011110110111110001110111110100 -> 1

@oprogramador
Copy link
Author

01 -> 1
00001 -> 0.72193

@nickdeis
Copy link
Owner

Hey @oprogramador,
Thank you for the compelling issue. I'm currently researching into this. I have added this plugin to a few of the larger projects I work on. I think the current problem is that the false positives tend to be actual words.
This isn't an issue until you have large inline strings with things like paragraphs (like auto-gen) docs.
I'm currently trying to think of a good solution to this. Let me know what your thoughts are.
I'm going to keep brainstorming. Maybe some NLP?
Cheers,
Nick

@oprogramador
Copy link
Author

oprogramador commented Feb 27, 2020

@nickdeis

that's my solution https://github.com/oprogramador/eslint-plugin-no-credentials/blob/master/src/calculateStrongEntropy.js

multiplying the Shannon entropy plus 1 and zipped data length minus 20 (because it's always at least 20)

@oprogramador
Copy link
Author

@nickdeis
Copy link
Owner

nickdeis commented Mar 1, 2020

Super interesting. Wouldn't entropy and compression rates be colinear? I suppose this ends up being a weighted measure of entropy and string length. Any reference material used to come up with this?

@nickdeis
Copy link
Owner

Closing as over a year old

@oprogramador
Copy link
Author

@nickdeis

I invented my own approach in my library to have a relatively good measurement of information quantity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants