Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking for a way to feed threshold cutoffs to individual variables #66

Open
ajw5296 opened this issue Oct 28, 2022 · 5 comments
Open

Comments

@ajw5296
Copy link

ajw5296 commented Oct 28, 2022

Is there a way to set different cutoff values for certain variables. For instance, if the DOB variable between a potential match isn't above .9, then that wouldn't be considered a match, but all other variables have a cut off of .8.

@tedenamorado
Copy link
Collaborator

tedenamorado commented Oct 28, 2022

@ajw5296 if you are using the fastLink wrapper function, it is not possible (those cutpoints are global).

If anything, let us know.

All my best,

Ted

@tedenamorado
Copy link
Collaborator

@ajw5296 can you provide an example of what you have in mind here? Is your question about cutoff about how we compare variables or about the weight each variable receives when predicting the probability that two records are the same?

Looking forward to hearing from you!

Ted

@ajw5296
Copy link
Author

ajw5296 commented Oct 31, 2022

Hey @tedenamorado, my question is more about cutoffs, and if they can be set at a variable level, more preciously

  1. Are individual matching probabilities calculated within the fastlink method
    So a match might be something like .98(fname), .98(lname), .83(dob), and then these are calculated with their weights
    for the final whole posterior

  2. Can we set threshold cut offs for those individual variables in the method or through other methods. So despite fname and lname having a high probability, we would eliminate the potential match since the dob is below .9 (the respective cut off)

I suppose this is kind of a question about weights in a way, but I think the setting a higher weight for dob is methodologically different than setting a cutoff for dob. But if setting parameters for weights is easier, I'm interested in looking into it.

And just as a note, we looked into the stringSubset method, but since DOBs are shared values, it didn't really help us much.

Let me know if I can provide more info, thanks for your help!

@aalexandersson
Copy link

I do not think it is possible in fastLink other than maybe to create ad hoc linkage variables and then work directly with the corresponding gammas. A similar open issue is #49.

The Python-based splink has a similar open issue moj-analytical-services/splink#434. The proprietary Match*Pro has "Classification Tab" with a user-friendly GUI for creating similar deterministic criteria.

For what it is worth, to me this seems of little use compared with other promised features under development such as probabilistic blocking and active learning.

@tedenamorado
Copy link
Collaborator

Hi @ajw5296,

As @aalexandersson mentions, it is not possible to set deterministic rules based on the probability of observing a specific agreement value for field k given that a pair of records is a match. The model learns these probabilities from the data.

Our focus is on the Probability that a pair of records is a match given the agreement pattern and the parameters of the model, which is a composite measure of the field-specific probabilities of observing an agreement value given that a pair of records is a match.

However, an alternative would be to pass your own set of parameters to fastLink. For example, we discuss how to pass parameters from a random sample of observations to a larger dataset here.

Please, if you feel we can be of further assistance, let us know.

All my best,

Ted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants