NA values create error in getMatches() #56

datafj · 2021-10-06T13:48:24Z

I got an error when running getMatches(). It was caused by this line of code in getMatches():

dfA$dedupe.ids <- out_df$id_2

The above line can't run when the number of rows in dfA is different from the number of rows in out_df.

I traced back the cause of the error and found it was because there were NAs in fl.out$posterior. And further I found one NA value in one of the columns in dfA.

So, I speculate the NA in the input data resulted in NAs in fl.out$posterior, and then they failed getMatches().

I suggest the authors to consider adding a warning message to fastLink(), or replacing NA values in the input data with non-NA values such as empty strings.

The text was updated successfully, but these errors were encountered:

tedenamorado · 2021-10-09T04:18:47Z

Thanks for sharing this information with us! If you could share with us a reproducible example, that would be greatly appreciated. We have not run into such a situation before, but I promise we will investigate this.

All my best,

Ted

datafj · 2021-10-11T19:03:47Z

Run the following code:

library(data.table)
library(fastLink)
dt_input = data.table(id = 1:10, name = c("aaa","bbb","aab","bba","aba","bab","baa","abb","baa",NA))
fl_out = fastLink(dfA = dt_input, dfB = dt_input, varnames = "name")
dt_matches = as.data.table(getMatches(dfA = dt_input, dfB = dt_input, fl.out = fl_out))

And you will get the following message:

Error in `$<-.data.frame`(`*tmp*`, "dedupe.ids", value = c(1, 2, 3, 4,  : 
  replacement has 9 rows, data has 10

The cause is the NA value in dt_input. If you replace this NA with "aaa", there will be no error message. However, in this case replacing it with blank string "" still won't work.

My real dataset is a 40k row dataset with 7 columns used in varnames. One single NA created the similar error message and it was resolved by replacing the NA with a blank string. Unfortunately, I can't post the real data with real names.

tedenamorado · 2021-10-13T16:14:43Z

Excellent! Thanks a lot for sharing this with us. I will make sure to fix the issue this weekend. As soon as a fix is pushed to GitHub, I will let you know.

All my best,

Ted

aalexandersson · 2021-10-13T17:00:33Z

In case it helps, I think there are possibly three different issues with the deduplication example and therefore three possible "fixes":

A better error message is needed.
A warning message instead of an error message is better.
Allow NA in a linkage variable in deduplication of a single dataset. so that there is no need for an error message or a warning message.

I cannot reproduce the error message from "one single NA" when using multiple columns. @datafj referred to an example with 7 columns but could not show reproducible code. As long as at least one linkage variable is complete (i.e., has no NA) then deduplication works as it should, as far as I know. For example, this code has 2 linkage variables and "one single NA" and runs fine because the new variable newvar is complete:

library(data.table)
library(fastLink)
dt_input = data.table(id = 1:10,
    newvar = 1:10,
    name = c("aaa","bbb","aab","bba","aba","bab","baa","abb","baa",NA))
fl_out = fastLink(dfA = dt_input, dfB = dt_input, varnames = c("name", "newvar"))
dt_matches = as.data.table(getMatches(dfA = dt_input, dfB = dt_input, fl.out = fl_out))

tedenamorado · 2021-11-26T19:07:37Z

Hi,

The code has been adjusted to accommodate this knife-edge case. Please, if something else pops-out, let us know. Note that it is because of this community that we have been able to make fastLink a better tool for everyone.

Happy Thanksgiving!

Ted

ericmanning · 2021-11-29T07:16:37Z

It appears that this knife-edge case is not resolved when reweighting names. (See the toy example provided above.)

(P.S. -- Thank you so much for writing and maintaining this incredibly useful package!)

Tom-K-UKRI · 2023-02-21T15:04:53Z

Hi, not 100% sure this is the same issue, but I am having a similar problem when de-duplicating. When I run getMatches after fastLink it is not working as the "replacement has 114997 rows, data has 115257" (i.e. the fastLink output is losing 260 rows compared to the input data).

Acting on the proposed solution above I tried removing rows which had all NAs or blank values in a matching field, which didn't work. getMatches did work when I exclude all rows which have at least one NA for a matching field, though I suspect this is overcompensating.

Since I can't provide a reproducible example I understand a solution is unlikely but just recording this in case it's useful or I'm missing something obvious. I've put a simple version below. Thanks for a fantastic package!

fl_out_dedupe <- fastLink(
dfA = data, dfB = data,
varnames = c("first_name", "surname", "postcode_district", "birth_year"))

dfA_dedupe <- getMatches(dfA = data, dfB = data, fl.out = fl_out_dedupe)

tedenamorado added a commit that referenced this issue Nov 26, 2021

fix issue #56

435ddc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NA values create error in getMatches() #56

NA values create error in getMatches() #56

datafj commented Oct 6, 2021 •

edited

Loading

tedenamorado commented Oct 9, 2021

datafj commented Oct 11, 2021 •

edited

Loading

tedenamorado commented Oct 13, 2021

aalexandersson commented Oct 13, 2021

tedenamorado commented Nov 26, 2021

ericmanning commented Nov 29, 2021

Tom-K-UKRI commented Feb 21, 2023

NA values create error in getMatches() #56

NA values create error in getMatches() #56

Comments

datafj commented Oct 6, 2021 • edited Loading

tedenamorado commented Oct 9, 2021

datafj commented Oct 11, 2021 • edited Loading

tedenamorado commented Oct 13, 2021

aalexandersson commented Oct 13, 2021

tedenamorado commented Nov 26, 2021

ericmanning commented Nov 29, 2021

Tom-K-UKRI commented Feb 21, 2023

datafj commented Oct 6, 2021 •

edited

Loading

datafj commented Oct 11, 2021 •

edited

Loading