Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA values create error in getMatches() #56

Open
datafj opened this issue Oct 6, 2021 · 7 comments
Open

NA values create error in getMatches() #56

datafj opened this issue Oct 6, 2021 · 7 comments

Comments

@datafj
Copy link

datafj commented Oct 6, 2021

I got an error when running getMatches(). It was caused by this line of code in getMatches():

dfA$dedupe.ids <- out_df$id_2

The above line can't run when the number of rows in dfA is different from the number of rows in out_df.

I traced back the cause of the error and found it was because there were NAs in fl.out$posterior. And further I found one NA value in one of the columns in dfA.

So, I speculate the NA in the input data resulted in NAs in fl.out$posterior, and then they failed getMatches().

I suggest the authors to consider adding a warning message to fastLink(), or replacing NA values in the input data with non-NA values such as empty strings.

@tedenamorado
Copy link
Collaborator

Thanks for sharing this information with us! If you could share with us a reproducible example, that would be greatly appreciated. We have not run into such a situation before, but I promise we will investigate this.

All my best,

Ted

@datafj
Copy link
Author

datafj commented Oct 11, 2021

Run the following code:

library(data.table)
library(fastLink)
dt_input = data.table(id = 1:10, name = c("aaa","bbb","aab","bba","aba","bab","baa","abb","baa",NA))
fl_out = fastLink(dfA = dt_input, dfB = dt_input, varnames = "name")
dt_matches = as.data.table(getMatches(dfA = dt_input, dfB = dt_input, fl.out = fl_out))

And you will get the following message:

Error in `$<-.data.frame`(`*tmp*`, "dedupe.ids", value = c(1, 2, 3, 4,  : 
  replacement has 9 rows, data has 10

The cause is the NA value in dt_input. If you replace this NA with "aaa", there will be no error message. However, in this case replacing it with blank string "" still won't work.

My real dataset is a 40k row dataset with 7 columns used in varnames. One single NA created the similar error message and it was resolved by replacing the NA with a blank string. Unfortunately, I can't post the real data with real names.

@tedenamorado
Copy link
Collaborator

Excellent! Thanks a lot for sharing this with us. I will make sure to fix the issue this weekend. As soon as a fix is pushed to GitHub, I will let you know.

All my best,

Ted

@aalexandersson
Copy link

In case it helps, I think there are possibly three different issues with the deduplication example and therefore three possible "fixes":

  1. A better error message is needed.
  2. A warning message instead of an error message is better.
  3. Allow NA in a linkage variable in deduplication of a single dataset. so that there is no need for an error message or a warning message.

I cannot reproduce the error message from "one single NA" when using multiple columns. @datafj referred to an example with 7 columns but could not show reproducible code. As long as at least one linkage variable is complete (i.e., has no NA) then deduplication works as it should, as far as I know. For example, this code has 2 linkage variables and "one single NA" and runs fine because the new variable newvar is complete:

library(data.table)
library(fastLink)
dt_input = data.table(id = 1:10,
    newvar = 1:10,
    name = c("aaa","bbb","aab","bba","aba","bab","baa","abb","baa",NA))
fl_out = fastLink(dfA = dt_input, dfB = dt_input, varnames = c("name", "newvar"))
dt_matches = as.data.table(getMatches(dfA = dt_input, dfB = dt_input, fl.out = fl_out))

tedenamorado added a commit that referenced this issue Nov 26, 2021
@tedenamorado
Copy link
Collaborator

Hi,

The code has been adjusted to accommodate this knife-edge case. Please, if something else pops-out, let us know. Note that it is because of this community that we have been able to make fastLink a better tool for everyone.

Happy Thanksgiving!

Ted

@ericmanning
Copy link

It appears that this knife-edge case is not resolved when reweighting names. (See the toy example provided above.)

(P.S. -- Thank you so much for writing and maintaining this incredibly useful package!)

@Tom-K-UKRI
Copy link

Hi, not 100% sure this is the same issue, but I am having a similar problem when de-duplicating. When I run getMatches after fastLink it is not working as the "replacement has 114997 rows, data has 115257" (i.e. the fastLink output is losing 260 rows compared to the input data).

Acting on the proposed solution above I tried removing rows which had all NAs or blank values in a matching field, which didn't work. getMatches did work when I exclude all rows which have at least one NA for a matching field, though I suspect this is overcompensating.

Since I can't provide a reproducible example I understand a solution is unlikely but just recording this in case it's useful or I'm missing something obvious. I've put a simple version below. Thanks for a fantastic package!

fl_out_dedupe <- fastLink(
dfA = data, dfB = data,
varnames = c("first_name", "surname", "postcode_district", "birth_year"))

dfA_dedupe <- getMatches(dfA = data, dfB = data, fl.out = fl_out_dedupe)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants