EM with a low duplicate rate #1629
Replies: 3 comments 1 reply
-
We've also observed this in some situations, and haven't come to any definitive conclusions. On (1): I have an untested hypothesis that, where large numbers of comparisons are generated, this issue could be due to the sum of small probabilities during the EM algorithm's process. This could be particularly relevant for situations where poor data quality means that there are many comparisons which are neither certain matches nor certain non-matches. This could result in lots of match probabilities with values like 0.999 and 0.001 (as opposed to say 0.9999999 or 0.0000001, which are 'certain enough' not to affect EM estimates meaningfully). Take the example of first name, and say there are three comparison levels, (1) exact match, (2) fuzzy match and (3) all other. I've never had time to test this, but I've always wondered about it: On (2): Not so sure about this - it's an interesting observation though! Separately to the above, @samnlindsay I know you've had issues with m training - any insights re Aymon's questions? here's an internal chat we had for reference |
Beta Was this translation helpful? Give feedback.
-
I'm afraid I don't have any ideas, and it's often been a bit of trial and error to get to a "sensible" model. One idea I would like to investigate is to train a model blocking on all combinations of variables (4 variables = 12 combinations) and use the |
Beta Was this translation helpful? Give feedback.
-
Thanks Robin and Sam for your responses. I think the overwhelming number of comparisons with a low but non-zero probability could probably cause some problems. I've experimented with your idea of rounding down low probabilities in the last few days, it helps a little for some blocking rules but it's still quite hit and miss. It seems like an extensive trial and error approach with a lot of blocking rules is a decent solution, the |
Beta Was this translation helpful? Give feedback.
-
I thought I'd ask the brains trust for some ideas in this situation.
Sometimes when running the EM algorithm on a
dedupe_only
run to estimate parameters it converges to values that are clearly incorrect, usually with m values that are too low (almost equal to the u values). It depends a lot on the blocking rules used, but it tends to be one of two problems:block_on(["first_name", "surname", "date_of_birth"])
, and the remaining linking variables aren't strong enough to inform the model, e.g. we might be left only with address and sex variables.I've experimented with a few things, including varying the starting values for m's and u's, and many combinations of blocking rules. If anyone has run into this problem for deduplicating large (20M+ rows) datasets I'd be interested to hear your approach.
Beta Was this translation helpful? Give feedback.
All reactions