Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Normal compressor for NCD #21

Closed
orsinium opened this issue Mar 3, 2019 · 7 comments

Comments

@orsinium
Copy link
Collaborator

commented Mar 3, 2019

Please, have a look at the original NCD paper. The normal compressor has to have some properties:

  1. idempotency: C(xx) == C(x)
  2. monotonicity: C(xy) >= C(x)
  3. symmetry: C(xy) == C(yx)
  4. distributivity: C(xy) + C(z) <= C(xz) + C(yz)

No of any real compressors have all these properties. More of this, compressor can't be revertable and be idempotent or symmetric at the same time.

I guess, only one compressor has all of these conditions: set(text). Question 1: is it true? If the answer is "yes", let's make it and in further discussion ignore at least idempotency to have a difference between "a" and "aaaaaa".

I've made realization of some algorithm that has monotonicity, symmetry, and distributivity. It just counts all elements in a sequence and gets the square root of all counts. Sum of the values are size of "compressed" data.

textdistance.nc_ncd._compress('test')
# {'t': 1.4142135623730951, 'e': 1.0, 's': 1.0}

Question 2: is it correct implementation of normal compressor (except idempotency)?

Question 3: do you have any ideas on how to implement other normal compressors?

I'll be happy to hear any ideas about it.

@orsinium

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 3, 2019

@rudi-cilibrasi, could you have a look at these points, please? Nobody in the world would explain it better than you :)

@orsinium

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 3, 2019

@svetlana21, you was interested in the NCD. Do you have any thoughts about it?

@rudi-cilibrasi

This comment has been minimized.

Copy link

commented Mar 3, 2019

Greetings, my friends. Indeed I had the same confusion some years ago. I found out by reading my advisor Paul Vitanyi and Ming Li's book called "Introduction to Kolmogorov Complexity" that it is normal to write (in)equalities in that specific branch of math while omitting certain additive logarithmic factors that are bounded by e.g. log of input file/string size. See for example https://books.google.com/books?id=PgR1DmReI30C&pg=PA32&lpg=PA32&dq=normal-compressor+logarithmic-additive&source=bl&ots=0umOxx4-76&sig=ACfU3U0QUHf-0VhmzwCmmNga1-4XcsH0fA&hl=en&sa=X&ved=2ahUKEwj_zM-9qObgAhUJE3wKHSN2Dg4Q6AEwAHoECAoQAQ#v=onepage&q=normal-compressor%20logarithmic-additive&f=false where it is written explicitly. Thank you for the question!

@rudi-cilibrasi

This comment has been minimized.

Copy link

commented Mar 3, 2019

About the detailed questions, it has not been published yet but I am happy to share my thoughts. Personally, I believe the NCD idea is most useful when you throw away the requirement that compressors actually encode data; it's a good way to increase reliability of conclusions but not necessary as much as just computing the length in a compressor that could in theory but does not encode. This enables you to keep log probability bit counts in floating point format and get more accurate results. For example, any compressor used as in the published research is byte oriented so that means it is rounded to the nearest group of 8 bits which corresponds to a lot of lost precision or accuracy UNLESS you somehow connect specifically to a compressor's internal math model to try to find out how many bits are used within the last byte. Then there is the question of fractional bits. To me it is better to realize data compressors are simply families of very flexible and popular biased statistical distributions that learn empirically from streams of data so can be combined and compared. Sometimes even in the original NCD paper "Clustering By Compression" we do preprocessing on midi files or the radio astronomy data that is not reversible but we still are able to get useful results. So the actual encoding / decoding is not as important to the NCD calculation as for example if bits are rounded to bytes or bits are kept in integer or fraction. In this sense it is just a way of combining statistical models. So in the context of non-encoding but length-generating compressors, I think yours is potentially useful. But I guess the size calculation seems a bit odd and I am not sure it is very sensitive. It seems like it may not really detect useful symmetries on first glance. A more natural simple compressor you can try is e.g. calculate the unigram statistics using counts then use Claude Shannon's -p log p formula summed over all entries to get a distance. This is about the same as comparing KL-divergence or KL-distance for empirically sampled unigram distributions. Good luck!

@svetlana21

This comment has been minimized.

Copy link

commented Mar 4, 2019

@svetlana21, you was interested in the NCD. Do you have any thoughts about it?

Hello. Sorry, but I don't. Don't have enough knowledge of the field

@orsinium

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

Thank you for a great explanation!

So the actual encoding/decoding is not as important to the NCD calculation

Nice to see I was on the right way :)

calculate the unigram statistics using counts then use Claude Shannon's -p log p formula summed over all entries to get a distance.

Entropy works perfectly here! Thank you. However, there is one issue: the entropy of "aaaa" equal to 0. So, if we try to get Entropy-based NCD for "aaaa" and "bbbb" we get division by zero. I've solved it by adding 1 to the Entropy, but this is, obviously, not a scientist's way :)

@orsinium orsinium referenced this issue Mar 9, 2019
@orsinium

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 9, 2019

I've made blog post about Entropy and NCD.

@orsinium orsinium closed this in #20 Mar 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.