-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of NA
values in id_var
#69
Comments
Further ConsiderationsAdd a new argument like
|
NA
is taken into account in id_var
NA
values in id_var
Handle `NA` in ID columns correctly
Update after merging #76Solved problems
Open problems
|
also improved cli output for sdc_min_max() (#91)
With #95, a simple functionality was added which should cover a range of use cases. Thus, closing this issue for now. If there is a requirement for a more sophisticated approach, a new issue should be opened. |
If
id_var
containsNA
values, these are treated as a single ID.This is not correct. Columns whereis.na(id_var)
need to be removed before calculating the number of distinct entities and dominance.The problem is especially obvious when a large portion of
id_var
isNA
.Right now,
NA
values are treated as a single valid ID. Thus, this "ID" is dominant:One possible solution would be to remove all rows with missing
id_var
before calculating distinct ID's and dominance:However, this would lead to the situation that if you test the exact same data for several ID's, you would consider a different number of rows in each check, which is hard to explain and justify.
The text was updated successfully, but these errors were encountered: