-
I am trying to link 2 datasets (pretty small, the biggest one has ~50k records) about public entities. (I'm a data/software engineer, not a datascientist, hopefully my question isn't too poorly worded). The (simplified) data format looks like:
I use "common sense" blocking rules: state must be equal, and category as well (the data is assumed to be correct for these 2, which is confirmed by experience so far). I've created a custom comparison rule for postcodes (the default ones didn't work too well for me) and after playing around for a bit, I noticed that the weights associated with the various columns should ideally be different for different values of I suppose I could split the datasets and run categories separately, but I'm hoping there's a nicer solution. Is there a way to do something like this? I tried using custom comparison rules as suggested in #501 but that caused the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
If you are sure you'll be happy with a blocking rule that requires category be equal, then I think you can split this into two linking challenges. First link the data filtered to have category equal to I tried this out in a colab notebook using simulated data from |
Beta Was this translation helpful? Give feedback.
If you are sure you'll be happy with a blocking rule that requires category be equal, then I think you can split this into two linking challenges. First link the data filtered to have category equal to
catA
and then completely separately link the data filtered to have category equal tocatB
. This will let you obtain different m and u values for the different categories.I tried this out in a colab notebook using simulated data from
pseudopeople
, but I also blocked on age to make it go faster: https://colab.research.google.com/drive/10VUyyH-dkUdNZRBSsDnrpTO1kmapTIGk?usp=sharing