Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic usage example? #1

Open
aalexandersson opened this issue Jan 26, 2024 · 6 comments
Open

Basic usage example? #1

aalexandersson opened this issue Jan 26, 2024 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@aalexandersson
Copy link

Can you please add a basic usage example? Is it already a "minimum viable package" worth registering since it is v0.1.x?

I am very excited to learn about your package, to possibly one day switch from fastLink (R) and Splink (Python). I am not a good Julia programmer though. It would be easier to try your package as a user if it has a basic usage example and if it is registered. Best wishes!

@jw2249a
Copy link
Owner

jw2249a commented Jan 26, 2024

Hi @aalexandersson, thanks for following this!

I need to modify the version numbering because it is currently only viable for my specific use case and needs more basic functionality of the original implementation to get to that point. I have a larger update incoming that refactors some of the code significantly, and it will include a readme.

I am also resolving some questions about the methodology with the original authors. Specifically, deduplication is not implemented yet because I don't understand the intuition around the deduplication in fastlink (see this issue).

@jw2249a jw2249a self-assigned this Jan 26, 2024
@jw2249a jw2249a added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 26, 2024
@aalexandersson
Copy link
Author

aalexandersson commented Jan 26, 2024

Ok, it sounds very promising! 👍

The fastLink deduplication code, as you probably already know in detail, is in dedupeMatches.R. Historically, the first documentation of the algorithm that I am aware of is Ben Fifield's dissertation. I quote (from page 81): "[It] is a greedy algorithm that makes the best possible match for each observation in dataset A in dataset B, and then vice versa for dataset B among the possible remaining matches in dataset A. Any remaining ties are broken at random." I do not fully understand the issue that you posted but could it be such "a remaining tie broken at random"?

About your implementation of deduplication, is there a reason why you need to port the fastLink algorithm exactly, as opposed to "simply" code something that makes sense to you, and works? The main fastLink developer nowadays seems to be Ted Enamorado, and in his chapter "A Primer on Probabilistic Record Linkage" (2021) he uses first principles instead of the default algorithm. Also, Splink, which is a very popular implementation of fastLink in Python, does not use the fastLink algorithm. Instead, they use another SQL-based algorithm. The Splink developers openly discuss this in issue 1602, which is known in the literature as the "stable marriage" problem. So, a close copy of the fastLink deduplication code seems to not be critical.

@jw2249a
Copy link
Owner

jw2249a commented Jan 26, 2024

I guess its less intuition than an actual bug. I need to read the dissertation, but the code starting here doesn't deduplicate the data as intended.

I think the intention was to tapply over the ids of dfb for dfa but it instead tapplys over the ids of dfa against a merge against dfa and vice versa for dfb. The issue is that because the duplicate row ids are removed before dfb is deduped, it doesn't actually run a dedupe on dfB.

Thanks for linking Ben Fitfeld's dissertation, it is helpful to know what the intended behavior was. I will write a function up based on that intended solution.

@aalexandersson
Copy link
Author

[...] it doesn't actually run a dedupe on dfB.

Oh, that would explain it. I have experienced a few duplicates in dfB several times on real but confidential data despite running the default dedupe.matches, even on the recent v0.6.1. I never reported it as an issue because it was feasible to review such left-over duplicates manually.

Therefore, here are my guesses: 1) You are correct, 2) It would help the fastLink developers if you provide a minimal reproducible example, and 3) They view it as a possible minor bug, which is not their priority because they are focusing on a larger update.

I will write a function up based on that intended solution.

Yes, that sounds very reasonable 💯. Implement something that works, and is flexible. It can always be improved later.

@marcelo-g-simas
Copy link

Hello, been looking over this for a couple of days and can't figure out how to get the matches from the two passed in df's based on structures returned by fastLink(). On the R package there is a function called getMatches() that does this, is the plan to eventually implement that or do you have different vision for the Julia implementation?

@jw2249a
Copy link
Owner

jw2249a commented Sep 5, 2024

@Westat-Transportation this functionality is added as of 0.0.7. let me know if the function is what you were looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants