Basic usage example? #1

aalexandersson · 2024-01-26T16:58:15Z

Can you please add a basic usage example? Is it already a "minimum viable package" worth registering since it is v0.1.x?

I am very excited to learn about your package, to possibly one day switch from fastLink (R) and Splink (Python). I am not a good Julia programmer though. It would be easier to try your package as a user if it has a basic usage example and if it is registered. Best wishes!

The text was updated successfully, but these errors were encountered:

jw2249a · 2024-01-26T18:39:20Z

Hi @aalexandersson, thanks for following this!

I need to modify the version numbering because it is currently only viable for my specific use case and needs more basic functionality of the original implementation to get to that point. I have a larger update incoming that refactors some of the code significantly, and it will include a readme.

I am also resolving some questions about the methodology with the original authors. Specifically, deduplication is not implemented yet because I don't understand the intuition around the deduplication in fastlink (see this issue).

aalexandersson · 2024-01-26T19:24:14Z

Ok, it sounds very promising! 👍

The fastLink deduplication code, as you probably already know in detail, is in dedupeMatches.R. Historically, the first documentation of the algorithm that I am aware of is Ben Fifield's dissertation. I quote (from page 81): "[It] is a greedy algorithm that makes the best possible match for each observation in dataset A in dataset B, and then vice versa for dataset B among the possible remaining matches in dataset A. Any remaining ties are broken at random." I do not fully understand the issue that you posted but could it be such "a remaining tie broken at random"?

About your implementation of deduplication, is there a reason why you need to port the fastLink algorithm exactly, as opposed to "simply" code something that makes sense to you, and works? The main fastLink developer nowadays seems to be Ted Enamorado, and in his chapter "A Primer on Probabilistic Record Linkage" (2021) he uses first principles instead of the default algorithm. Also, Splink, which is a very popular implementation of fastLink in Python, does not use the fastLink algorithm. Instead, they use another SQL-based algorithm. The Splink developers openly discuss this in issue 1602, which is known in the literature as the "stable marriage" problem. So, a close copy of the fastLink deduplication code seems to not be critical.

jw2249a · 2024-01-26T20:35:16Z

I guess its less intuition than an actual bug. I need to read the dissertation, but the code starting here doesn't deduplicate the data as intended.

I think the intention was to tapply over the ids of dfb for dfa but it instead tapplys over the ids of dfa against a merge against dfa and vice versa for dfb. The issue is that because the duplicate row ids are removed before dfb is deduped, it doesn't actually run a dedupe on dfB.

Thanks for linking Ben Fitfeld's dissertation, it is helpful to know what the intended behavior was. I will write a function up based on that intended solution.

aalexandersson · 2024-01-26T21:33:10Z

[...] it doesn't actually run a dedupe on dfB.

Oh, that would explain it. I have experienced a few duplicates in dfB several times on real but confidential data despite running the default dedupe.matches, even on the recent v0.6.1. I never reported it as an issue because it was feasible to review such left-over duplicates manually.

Therefore, here are my guesses: 1) You are correct, 2) It would help the fastLink developers if you provide a minimal reproducible example, and 3) They view it as a possible minor bug, which is not their priority because they are focusing on a larger update.

I will write a function up based on that intended solution.

Yes, that sounds very reasonable 💯. Implement something that works, and is flexible. It can always be improved later.

marcelo-g-simas · 2024-07-26T13:08:58Z

Hello, been looking over this for a couple of days and can't figure out how to get the matches from the two passed in df's based on structures returned by fastLink(). On the R package there is a function called getMatches() that does this, is the plan to eventually implement that or do you have different vision for the Julia implementation?

jw2249a · 2024-09-05T18:38:48Z

@Westat-Transportation this functionality is added as of 0.0.7. let me know if the function is what you were looking for.

jw2249a self-assigned this Jan 26, 2024

jw2249a added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 26, 2024

jw2249a mentioned this issue Jan 26, 2024

dedupeMatches does not consider exact matches kosukeimai/fastLink#78

Open

jw2249a mentioned this issue Jul 27, 2024

Add getMatches() function #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic usage example? #1

Basic usage example? #1

aalexandersson commented Jan 26, 2024

jw2249a commented Jan 26, 2024

aalexandersson commented Jan 26, 2024 •

edited

Loading

jw2249a commented Jan 26, 2024

aalexandersson commented Jan 26, 2024

marcelo-g-simas commented Jul 26, 2024

jw2249a commented Sep 5, 2024

Basic usage example? #1

Basic usage example? #1

Comments

aalexandersson commented Jan 26, 2024

jw2249a commented Jan 26, 2024

aalexandersson commented Jan 26, 2024 • edited Loading

jw2249a commented Jan 26, 2024

aalexandersson commented Jan 26, 2024

marcelo-g-simas commented Jul 26, 2024

jw2249a commented Sep 5, 2024

aalexandersson commented Jan 26, 2024 •

edited

Loading