-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NA values create error in getMatches() #56
Comments
Thanks for sharing this information with us! If you could share with us a reproducible example, that would be greatly appreciated. We have not run into such a situation before, but I promise we will investigate this. All my best, Ted |
Run the following code:
And you will get the following message:
The cause is the NA value in dt_input. If you replace this NA with "aaa", there will be no error message. However, in this case replacing it with blank string My real dataset is a 40k row dataset with 7 columns used in |
Excellent! Thanks a lot for sharing this with us. I will make sure to fix the issue this weekend. As soon as a fix is pushed to GitHub, I will let you know. All my best, Ted |
In case it helps, I think there are possibly three different issues with the deduplication example and therefore three possible "fixes":
I cannot reproduce the error message from "one single NA" when using multiple columns. @datafj referred to an example with 7 columns but could not show reproducible code. As long as at least one linkage variable is complete (i.e., has no NA) then deduplication works as it should, as far as I know. For example, this code has 2 linkage variables and "one single NA" and runs fine because the new variable
|
Hi, The code has been adjusted to accommodate this knife-edge case. Please, if something else pops-out, let us know. Note that it is because of this community that we have been able to make fastLink a better tool for everyone. Happy Thanksgiving! Ted |
It appears that this knife-edge case is not resolved when reweighting names. (See the toy example provided above.) (P.S. -- Thank you so much for writing and maintaining this incredibly useful package!) |
Hi, not 100% sure this is the same issue, but I am having a similar problem when de-duplicating. When I run getMatches after fastLink it is not working as the "replacement has 114997 rows, data has 115257" (i.e. the fastLink output is losing 260 rows compared to the input data). Acting on the proposed solution above I tried removing rows which had all NAs or blank values in a matching field, which didn't work. getMatches did work when I exclude all rows which have at least one NA for a matching field, though I suspect this is overcompensating. Since I can't provide a reproducible example I understand a solution is unlikely but just recording this in case it's useful or I'm missing something obvious. I've put a simple version below. Thanks for a fantastic package! fl_out_dedupe <- fastLink( dfA_dedupe <- getMatches(dfA = data, dfB = data, fl.out = fl_out_dedupe) |
I got an error when running
getMatches()
. It was caused by this line of code ingetMatches()
:The above line can't run when the number of rows in dfA is different from the number of rows in out_df.
I traced back the cause of the error and found it was because there were NAs in
fl.out$posterior
. And further I found one NA value in one of the columns in dfA.So, I speculate the NA in the input data resulted in NAs in
fl.out$posterior
, and then they failedgetMatches()
.I suggest the authors to consider adding a warning message to
fastLink()
, or replacing NA values in the input data with non-NA values such as empty strings.The text was updated successfully, but these errors were encountered: