Warning in emlinkMARmov in case of a poor match #20

rbagd · 2017-12-05T19:54:45Z

I've been looking quite extensively into fastLink algorithm and notably the edge cases. In doing so, I think I've encountered a bug in emlinkMARmov in the case of a poor match. If you unzip the attached file, you may run

library(fastLink)
load("fastlink.Rdata")
emlinkMARmov(patterns, nobs.a, nobs.b)

If you try the above snippet out, you will obtain in most cases a warning that p.old is of different length than p.new. It doesn't happen always due to probable randomization. The reason is that after a few iterations, p.m = 0 and consequently num.prod = 0. Due to zero division, p.gamma.k.m becomes NaN and sorting on NaN leads to an empty vector - hence, the difference in length of p.old and p.new. Since p.old and p.new are of different lengths, it doesn't really make sense to subtract one from the other as you are recycling.

I was thinking that p.m = 0 essentially means there is no match, hence it doesn't really matter if we discontinue EM-algorithm and get out of the loop, i.e. if (p.m == 0) break. I was wondering if you have some other ideas around it.

By the way, fastLink is very nice work. Thanks a lot for it.

fastlink.zip

The text was updated successfully, but these errors were encountered:

tedenamorado · 2017-12-06T00:36:43Z

Hi,

Thanks a lot for your feedback! We really appreciate it.

As you rightly point out, in your example, the problem is that you do not have a single suitable match for the one observation in one of your datasets. The latter leads to numerical underflow. We are constantly working on more robust ways to adjust for numerical underflow. In the meantime, we will add one warning message to describe the problem you encountered.

By the way, p.m is the overall probability of finding a match among all the possible pairs in the cross product of two datasets. p.m most of the time will be small, but that does not mean there is not a match in the data. The patterns and the probability of each pattern of being a match (zeta.j) are way more informative. Note that zeta.j includes p.m in its calculation, so if p.m is equal to zero, then that is a good indication that there is little overlap in terms of matches between datasets.

Finally, to give you a better recommendation on how to proceed, we would love to know a little bit more about the type of merge you are doing. It does not have to be specific but a brief description will suffice.

We are really glad that you find fastLink of some help!

Ted

rbagd · 2017-12-12T15:43:53Z

Thanks for the input @tedenamorado. In my use case, dfA is a kind of a blacklist and I am looking for records in dfB that match one or more columns within that blacklist (I do matching separately by different sets of columns). False match is very undesirable in such a case so I'd rather ignore anything numerically suspicious just to be sure. I noticed indeed that zeta.j is typically zero for such edge cases, so no match ever happens anyway (hopefully) and the warning seems like not really a problem for my use case.

tedenamorado · 2017-12-24T03:12:51Z

Again, thanks a lot for your feedback. Believe me, every day we are trying to think of better ways to detect and handle false positives =)

We have pushed an updated version of emlinkMARmov that includes a more informative warning so that people facing knife-edge cases like the one you brought to our attention. In addition, we will release version 0.3 on CRAN really soon. The new release, besides incorporating your suggestion, will include new functions which will allow the comparison of numeric variables and confusion tables.

Please, if you find any further issues when using fastLink do not hesitate to let us know. We deeply care about what people using fastLink have to say.

Happy holidays!

kosukeimai mentioned this issue Dec 6, 2017

Some improvements #1

Open

kosukeimai closed this as completed Dec 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning in emlinkMARmov in case of a poor match #20

Warning in emlinkMARmov in case of a poor match #20

rbagd commented Dec 5, 2017

tedenamorado commented Dec 6, 2017

rbagd commented Dec 12, 2017

tedenamorado commented Dec 24, 2017

Warning in emlinkMARmov in case of a poor match #20

Warning in emlinkMARmov in case of a poor match #20

Comments

rbagd commented Dec 5, 2017

tedenamorado commented Dec 6, 2017

rbagd commented Dec 12, 2017

tedenamorado commented Dec 24, 2017