Adjust m and u values depending on data values #1515

laurentS · 2023-08-10T16:19:36Z

laurentS
Aug 10, 2023

I am trying to link 2 datasets (pretty small, the biggest one has ~50k records) about public entities. (I'm a data/software engineer, not a datascientist, hopefully my question isn't too poorly worded).

The (simplified) data format looks like:

category  |  name  | city | state | postcode
...

I use "common sense" blocking rules: state must be equal, and category as well (the data is assumed to be correct for these 2, which is confirmed by experience so far).

I've created a custom comparison rule for postcodes (the default ones didn't work too well for me) and after playing around for a bit, I noticed that the weights associated with the various columns should ideally be different for different values of category.
For catA, if postcodes match, we have a pair, name doesn't matter, but for catB, name is what really matters and postcode is only a vague indication so I'd want a much lower m/u. So I'd want to change m/u depending on the value of category.

I suppose I could split the datasets and run categories separately, but I'm hoping there's a nicer solution. Is there a way to do something like this?

I tried using custom comparison rules as suggested in #501 but that caused the m and u values to vary in "the wrong way", where some parts of the dataset got better, but overall, it worsened my results.

Answered by aflaxman

Aug 13, 2023

If you are sure you'll be happy with a blocking rule that requires category be equal, then I think you can split this into two linking challenges. First link the data filtered to have category equal to catA and then completely separately link the data filtered to have category equal to catB. This will let you obtain different m and u values for the different categories.

I tried this out in a colab notebook using simulated data from pseudopeople, but I also blocked on age to make it go faster: https://colab.research.google.com/drive/10VUyyH-dkUdNZRBSsDnrpTO1kmapTIGk?usp=sharing

View full answer

aflaxman · 2023-08-13T16:09:01Z

aflaxman
Aug 13, 2023

If you are sure you'll be happy with a blocking rule that requires category be equal, then I think you can split this into two linking challenges. First link the data filtered to have category equal to catA and then completely separately link the data filtered to have category equal to catB. This will let you obtain different m and u values for the different categories.

I tried this out in a colab notebook using simulated data from pseudopeople, but I also blocked on age to make it go faster: https://colab.research.google.com/drive/10VUyyH-dkUdNZRBSsDnrpTO1kmapTIGk?usp=sharing

1 reply

laurentS Aug 18, 2023
Author

@aflaxman sorry for taking so long to get back to you. This is super interesting, thanks for taking the time to look into it (and writing the code!).

I am comfortable with relying on the categories having to be equal. There are more than 2 of them, but that shouldn't be much of an issue, I'll just run more loops.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust m and u values depending on data values #1515

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Adjust m and u values depending on data values #1515

laurentS Aug 10, 2023

Replies: 1 comment · 1 reply

aflaxman Aug 13, 2023

laurentS Aug 18, 2023 Author

laurentS
Aug 10, 2023

Replies: 1 comment 1 reply

aflaxman
Aug 13, 2023

laurentS Aug 18, 2023
Author