Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedupeMatches does not consider exact matches #78

Open
jw2249a opened this issue Jan 4, 2024 · 2 comments
Open

dedupeMatches does not consider exact matches #78

jw2249a opened this issue Jan 4, 2024 · 2 comments
Assignees

Comments

@jw2249a
Copy link

jw2249a commented Jan 4, 2024

The deduplication appears to take the match pattern and matched value's index and take the highest zeta value, but does not account for zeta values that are exactly equal. This leads to weird behavior.

Prefact: Issue can be recreated if you append the first row of dfA (where firstname is "daniel") to both dfA and dfB. This means the record will be an exact match to a row in dfA and dfB.

Issue 1: The dedupe algorithm will return all of the matched values as setup above. However, if you change the value of the firstname in the first row to NA, then it will be removed.

Issue 2: f you change the lastname "secuya" to "secuyas" while leaving the first name as NA, it will still be removed by the dedupe function. But, if you add the name "daniel" back to the firstname, it will not be deduped.

@tedenamorado
Copy link
Collaborator

Thanks for letting us know! I will try to reproduce what you describe and report back.

@jw2249a
Copy link
Author

jw2249a commented Jan 26, 2024

I found the issue with the deduplication. The order of the dataframes matters because the duplicate row ids are removed before checking for them again in dfb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants