Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxing the conditional independence assumption? #82

Open
zmbc opened this issue May 17, 2024 · 4 comments
Open

Relaxing the conditional independence assumption? #82

zmbc opened this issue May 17, 2024 · 4 comments

Comments

@zmbc
Copy link

zmbc commented May 17, 2024

S4 of the appendix of the fastLink paper describes two methods for relaxing the conditional independence assumption. It looks like a simulation study of these was done, but I don't immediately see any documentation on how to do this in fastLink. Does the code exist somewhere?

@tedenamorado
Copy link
Collaborator

Hi,

Sure thing. To implement the second approach discussed in that Appendix, the following code should work:

## Load the package and data
library(fastLink)
data(samplematch)

matches.out <- fastLink(
  dfA = dfA, dfB = dfB, 
  varnames = c("firstname", "middlename", "lastname", "housenum", "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "middlename", "lastname", "streetname", "city"),
  partial.match = c("firstname", "lastname", "streetname"),
  cond.indep = TRUE
)

Since the first approach described in the Appendix produces similar results to the second, we focused on the second approach for implementation of fastLink. This is because it is a standard extension of the Fellegi-Sunter model.

Please, if anything, do not hesitate to let us know.

All my best,

Ted

@zmbc
Copy link
Author

zmbc commented Jun 12, 2024

Thank you @tedenamorado!

I have a conceptual question about this that I'm wondering if you can shed some light on. It seems to me that S4.1 describes an approach that would modify both EM (parameter estimation) and prediction. It looks like the implementation of S4.2 affects only the parameter estimation. Is my understanding correct? Have you considered including interaction effects at prediction time?

@tedenamorado
Copy link
Collaborator

Hi @zmbc,

Both methods impact parameter estimation and, consequently, the predictions based on the Fellegi-Sunter model. The approach in S4.1 treats interactions as a linkage field comparison by combining the information from two different linkage fields. In contrast, the approach in S4.2 allows each linkage field and interaction to contribute to parameter estimation. While S4.2 complicates the EM algorithm because it is not based on closed-form solutions (though it is standard in the literature), S4.1 simplifies the EM algorithm but does not account for all possible two-way interactions.

Please, if anything, do not hesitate to let us know.

Ted

@zmbc
Copy link
Author

zmbc commented Jul 8, 2024

@tedenamorado Apologies if I haven't been clear here. When I say "modify prediction" I mean the structure of the prediction model, not only the estimates of the parameters in it.

I'll give a concrete example. In S4.1 the agreement values of first name and date of birth are combined, so instead of having two agreement variables with 2 levels each, there is a single agreement variable with 4 levels (0-0, 0-1, 1-0, 1-1). If I understand correctly, this would result in m and u probabilities being independently estimated for each of the 4 levels. There is nothing forcing the match weight of 1-1 to equal the match weight of 1-0 + the match weight of 0-1. In fact it is probably quite a bit lower if the two agreements are correlated.

Then when predictions are made, match weights are applied according to the actual interaction pattern in each pair to predict. That is, a pair with 1-0 will get a fairly high match weight, as will a pair with 0-1, but a pair with 1-1 will get a match weight that is lower than the sum of those.

If I understand correctly, S4.2 does not do this. Instead, it would help when estimating the parameters for a match on first name or date of birth, likely lowering the match weight of both if they are correlated, but at prediction time the structure still dictates that the match weight of 1-1 = the match weight of 0-1 + the match weight of 1-0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants