EM with a low duplicate rate #1629

aymonwuolanne · 2023-09-29T06:59:16Z

aymonwuolanne
Sep 29, 2023

I thought I'd ask the brains trust for some ideas in this situation.

Sometimes when running the EM algorithm on a dedupe_only run to estimate parameters it converges to values that are clearly incorrect, usually with m values that are too low (almost equal to the u values). It depends a lot on the blocking rules used, but it tends to be one of two problems:

The blocking rule is too loose and the proportion of true links in the block is so low that EM can't separate out a cluster of likely matches.
The blocking rule uses up too many of the linking variables, e.g. block_on(["first_name", "surname", "date_of_birth"]), and the remaining linking variables aren't strong enough to inform the model, e.g. we might be left only with address and sex variables.

I've experimented with a few things, including varying the starting values for m's and u's, and many combinations of blocking rules. If anyone has run into this problem for deduplicating large (20M+ rows) datasets I'd be interested to hear your approach.

RobinL · 2023-09-29T08:09:24Z

RobinL
Sep 29, 2023
Maintainer

We've also observed this in some situations, and haven't come to any definitive conclusions.

On (1):

I have an untested hypothesis that, where large numbers of comparisons are generated, this issue could be due to the sum of small probabilities during the EM algorithm's process.

This could be particularly relevant for situations where poor data quality means that there are many comparisons which are neither certain matches nor certain non-matches. This could result in lots of match probabilities with values like 0.999 and 0.001 (as opposed to say 0.9999999 or 0.0000001, which are 'certain enough' not to affect EM estimates meaningfully).

Take the example of first name, and say there are three comparison levels, (1) exact match, (2) fuzzy match and (3) all other.
Essentially, the model may 'see' quite a large number of matches among (3) because there may be many millions of such comparisons with a small (say 1e-5) probability of a match. Each of those is considered to be 1e-5 th of a match, so quite a lot of matches are assigned to the 'all other' category.

I've never had time to test this, but I've always wondered about it:
(1) Is it even correct
(2) If it is correct, could it be fixed by 'overriding' match_probability with additional logic rule like: if prob < 0.001 then prob = 0

On (2):

Not so sure about this - it's an interesting observation though!

Separately to the above, @samnlindsay I know you've had issues with m training - any insights re Aymon's questions? here's an internal chat we had for reference

0 replies

samnlindsay · 2023-09-29T11:42:05Z

samnlindsay
Sep 29, 2023
Maintainer

I'm afraid I don't have any ideas, and it's often been a bit of trial and error to get to a "sensible" model.

One idea I would like to investigate is to train a model blocking on all combinations of variables (4 variables = 12 combinations) and use the parameter_estimate_comparisons_chart to see which blocking combinations agree and, more importantly, to shed light on which ones disagree and why.

0 replies

aymonwuolanne · 2023-10-05T05:59:02Z

aymonwuolanne
Oct 5, 2023
Author

Thanks Robin and Sam for your responses.

I think the overwhelming number of comparisons with a low but non-zero probability could probably cause some problems. I've experimented with your idea of rounding down low probabilities in the last few days, it helps a little for some blocking rules but it's still quite hit and miss.

It seems like an extensive trial and error approach with a lot of blocking rules is a decent solution, the parameter_estimate_comparisons_chart is useful there.

1 reply

RobinL Oct 5, 2023
Maintainer

Thanks very much - that's very interesting, particularly the feedback on trying the rounding down approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EM with a low duplicate rate #1629

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

EM with a low duplicate rate #1629

aymonwuolanne Sep 29, 2023

Replies: 3 comments · 1 reply

RobinL Sep 29, 2023 Maintainer

samnlindsay Sep 29, 2023 Maintainer

aymonwuolanne Oct 5, 2023 Author

RobinL Oct 5, 2023 Maintainer

aymonwuolanne
Sep 29, 2023

Replies: 3 comments 1 reply

RobinL
Sep 29, 2023
Maintainer

samnlindsay
Sep 29, 2023
Maintainer

aymonwuolanne
Oct 5, 2023
Author

RobinL Oct 5, 2023
Maintainer