Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning in emlinkMARmov in case of a poor match #20

Closed
rbagd opened this issue Dec 5, 2017 · 3 comments
Closed

Warning in emlinkMARmov in case of a poor match #20

rbagd opened this issue Dec 5, 2017 · 3 comments

Comments

@rbagd
Copy link

rbagd commented Dec 5, 2017

I've been looking quite extensively into fastLink algorithm and notably the edge cases. In doing so, I think I've encountered a bug in emlinkMARmov in the case of a poor match. If you unzip the attached file, you may run

library(fastLink)
load("fastlink.Rdata")
emlinkMARmov(patterns, nobs.a, nobs.b)

If you try the above snippet out, you will obtain in most cases a warning that p.old is of different length than p.new. It doesn't happen always due to probable randomization. The reason is that after a few iterations, p.m = 0 and consequently num.prod = 0. Due to zero division, p.gamma.k.m becomes NaN and sorting on NaN leads to an empty vector - hence, the difference in length of p.old and p.new. Since p.old and p.new are of different lengths, it doesn't really make sense to subtract one from the other as you are recycling.

I was thinking that p.m = 0 essentially means there is no match, hence it doesn't really matter if we discontinue EM-algorithm and get out of the loop, i.e. if (p.m == 0) break. I was wondering if you have some other ideas around it.

By the way, fastLink is very nice work. Thanks a lot for it.

fastlink.zip

@tedenamorado
Copy link
Collaborator

Hi,

Thanks a lot for your feedback! We really appreciate it.

As you rightly point out, in your example, the problem is that you do not have a single suitable match for the one observation in one of your datasets. The latter leads to numerical underflow. We are constantly working on more robust ways to adjust for numerical underflow. In the meantime, we will add one warning message to describe the problem you encountered.

By the way, p.m is the overall probability of finding a match among all the possible pairs in the cross product of two datasets. p.m most of the time will be small, but that does not mean there is not a match in the data. The patterns and the probability of each pattern of being a match (zeta.j) are way more informative. Note that zeta.j includes p.m in its calculation, so if p.m is equal to zero, then that is a good indication that there is little overlap in terms of matches between datasets.

Finally, to give you a better recommendation on how to proceed, we would love to know a little bit more about the type of merge you are doing. It does not have to be specific but a brief description will suffice.

We are really glad that you find fastLink of some help!

Ted

@rbagd
Copy link
Author

rbagd commented Dec 12, 2017

Thanks for the input @tedenamorado. In my use case, dfA is a kind of a blacklist and I am looking for records in dfB that match one or more columns within that blacklist (I do matching separately by different sets of columns). False match is very undesirable in such a case so I'd rather ignore anything numerically suspicious just to be sure. I noticed indeed that zeta.j is typically zero for such edge cases, so no match ever happens anyway (hopefully) and the warning seems like not really a problem for my use case.

@tedenamorado
Copy link
Collaborator

Again, thanks a lot for your feedback. Believe me, every day we are trying to think of better ways to detect and handle false positives =)

We have pushed an updated version of emlinkMARmov that includes a more informative warning so that people facing knife-edge cases like the one you brought to our attention. In addition, we will release version 0.3 on CRAN really soon. The new release, besides incorporating your suggestion, will include new functions which will allow the comparison of numeric variables and confusion tables.

Please, if you find any further issues when using fastLink do not hesitate to let us know. We deeply care about what people using fastLink have to say.

Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants