Discussion: Normal compressor for NCD #21

Closed
opened this issue Mar 3, 2019 · 7 comments

Collaborator

orsinium commented Mar 3, 2019 • edited

 Please, have a look at the original NCD paper. The normal compressor has to have some properties: idempotency: `C(xx) == C(x)` monotonicity: `C(xy) >= C(x)` symmetry: `C(xy) == C(yx)` distributivity: `C(xy) + C(z) <= C(xz) + C(yz)` No of any real compressors have all these properties. More of this, compressor can't be revertable and be idempotent or symmetric at the same time. I guess, only one compressor has all of these conditions: `set(text)`. Question 1: is it true? If the answer is "yes", let's make it and in further discussion ignore at least idempotency to have a difference between "a" and "aaaaaa". I've made realization of some algorithm that has monotonicity, symmetry, and distributivity. It just counts all elements in a sequence and gets the square root of all counts. Sum of the values are size of "compressed" data. ```textdistance.nc_ncd._compress('test') # {'t': 1.4142135623730951, 'e': 1.0, 's': 1.0}``` Question 2: is it correct implementation of normal compressor (except idempotency)? Question 3: do you have any ideas on how to implement other normal compressors? I'll be happy to hear any ideas about it.

orsinium added help wanted compression_based labels Mar 3, 2019

Collaborator Author

orsinium commented Mar 3, 2019

 @rudi-cilibrasi, could you have a look at these points, please? Nobody in the world would explain it better than you :)
Collaborator Author

orsinium commented Mar 3, 2019 • edited

 @svetlana21, you was interested in the NCD. Do you have any thoughts about it?

rudi-cilibrasi commented Mar 3, 2019

 Greetings, my friends. Indeed I had the same confusion some years ago. I found out by reading my advisor Paul Vitanyi and Ming Li's book called "Introduction to Kolmogorov Complexity" that it is normal to write (in)equalities in that specific branch of math while omitting certain additive logarithmic factors that are bounded by e.g. log of input file/string size. See for example https://books.google.com/books?id=PgR1DmReI30C&pg=PA32&lpg=PA32&dq=normal-compressor+logarithmic-additive&source=bl&ots=0umOxx4-76&sig=ACfU3U0QUHf-0VhmzwCmmNga1-4XcsH0fA&hl=en&sa=X&ved=2ahUKEwj_zM-9qObgAhUJE3wKHSN2Dg4Q6AEwAHoECAoQAQ#v=onepage&q=normal-compressor%20logarithmic-additive&f=false where it is written explicitly. Thank you for the question!

rudi-cilibrasi commented Mar 3, 2019

 About the detailed questions, it has not been published yet but I am happy to share my thoughts. Personally, I believe the NCD idea is most useful when you throw away the requirement that compressors actually encode data; it's a good way to increase reliability of conclusions but not necessary as much as just computing the length in a compressor that could in theory but does not encode. This enables you to keep log probability bit counts in floating point format and get more accurate results. For example, any compressor used as in the published research is byte oriented so that means it is rounded to the nearest group of 8 bits which corresponds to a lot of lost precision or accuracy UNLESS you somehow connect specifically to a compressor's internal math model to try to find out how many bits are used within the last byte. Then there is the question of fractional bits. To me it is better to realize data compressors are simply families of very flexible and popular biased statistical distributions that learn empirically from streams of data so can be combined and compared. Sometimes even in the original NCD paper "Clustering By Compression" we do preprocessing on midi files or the radio astronomy data that is not reversible but we still are able to get useful results. So the actual encoding / decoding is not as important to the NCD calculation as for example if bits are rounded to bytes or bits are kept in integer or fraction. In this sense it is just a way of combining statistical models. So in the context of non-encoding but length-generating compressors, I think yours is potentially useful. But I guess the size calculation seems a bit odd and I am not sure it is very sensitive. It seems like it may not really detect useful symmetries on first glance. A more natural simple compressor you can try is e.g. calculate the unigram statistics using counts then use Claude Shannon's `-p log p` formula summed over all entries to get a distance. This is about the same as comparing KL-divergence or KL-distance for empirically sampled unigram distributions. Good luck!

svetlana21 commented Mar 4, 2019

 @svetlana21, you was interested in the NCD. Do you have any thoughts about it? Hello. Sorry, but I don't. Don't have enough knowledge of the field
Collaborator Author

orsinium commented Mar 4, 2019

 Thank you for a great explanation! So the actual encoding/decoding is not as important to the NCD calculation Nice to see I was on the right way :) calculate the unigram statistics using counts then use Claude Shannon's -p log p formula summed over all entries to get a distance. Entropy works perfectly here! Thank you. However, there is one issue: the entropy of "aaaa" equal to 0. So, if we try to get Entropy-based NCD for "aaaa" and "bbbb" we get division by zero. I've solved it by adding 1 to the Entropy, but this is, obviously, not a scientist's way :)
referenced this issue Mar 9, 2019
Merged
Collaborator Author