dedupeMatches does not consider exact matches #78

jw2249a · 2024-01-04T03:11:13Z

The deduplication appears to take the match pattern and matched value's index and take the highest zeta value, but does not account for zeta values that are exactly equal. This leads to weird behavior.

Prefact: Issue can be recreated if you append the first row of dfA (where firstname is "daniel") to both dfA and dfB. This means the record will be an exact match to a row in dfA and dfB.

Issue 1: The dedupe algorithm will return all of the matched values as setup above. However, if you change the value of the firstname in the first row to NA, then it will be removed.

Issue 2: f you change the lastname "secuya" to "secuyas" while leaving the first name as NA, it will still be removed by the dedupe function. But, if you add the name "daniel" back to the firstname, it will not be deduped.

tedenamorado · 2024-01-04T03:25:45Z

Thanks for letting us know! I will try to reproduce what you describe and report back.

jw2249a · 2024-01-26T20:38:14Z

I found the issue with the deduplication. The order of the dataframes matters because the duplicate row ids are removed before checking for them again in dfb.

tedenamorado self-assigned this Jan 4, 2024

jw2249a mentioned this issue Jan 26, 2024

Basic usage example? jw2249a/FastLink.jl#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedupeMatches does not consider exact matches #78

dedupeMatches does not consider exact matches #78

jw2249a commented Jan 4, 2024

tedenamorado commented Jan 4, 2024

jw2249a commented Jan 26, 2024

dedupeMatches does not consider exact matches #78

dedupeMatches does not consider exact matches #78

Comments

jw2249a commented Jan 4, 2024

tedenamorado commented Jan 4, 2024

jw2249a commented Jan 26, 2024