Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of NA values in id_var #69

Closed
matthiasgomolka opened this issue Apr 1, 2021 · 3 comments
Closed

Handling of NA values in id_var #69

matthiasgomolka opened this issue Apr 1, 2021 · 3 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@matthiasgomolka
Copy link
Owner

matthiasgomolka commented Apr 1, 2021

If id_var contains NA values, these are treated as a single ID. This is not correct. Columns where is.na(id_var) need to be removed before calculating the number of distinct entities and dominance.

The problem is especially obvious when a large portion of id_var is NA.

library(sdcLog)
#> Warning: package 'sdcLog' was built under R version 4.0.4
library(data.table)
#> Warning: package 'data.table' was built under R version 4.0.4
set.seed(1)

DT <- data.table(
  id = c(LETTERS[1:6], rep(NA, 20)),
  val = rnorm(26)
)

Right now, NA values are treated as a single valid ID. Thus, this "ID" is dominant:

# current version
sdc_descriptives(DT, "id", "val")          
#> Warning: DISCLOSURE PROBLEM: Dominant entities.
#> [ OPTIONS:  sdc.n_ids: 5 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val | zero_as_NA: FALSE ]
#> Dominant entities:
#>    value_share
#> 1:   0.8545532

One possible solution would be to remove all rows with missing id_var before calculating distinct ID's and dominance:

sdc_descriptives(na.omit(DT), "id", "val") 
#> [ OPTIONS:  sdc.n_ids: 5 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val | zero_as_NA: FALSE ]
#> Output complies to RDC rules.

However, this would lead to the situation that if you test the exact same data for several ID's, you would consider a different number of rows in each check, which is hard to explain and justify.

@matthiasgomolka matthiasgomolka added the bug Something isn't working label Apr 1, 2021
@matthiasgomolka matthiasgomolka self-assigned this Apr 1, 2021
@matthiasgomolka matthiasgomolka added the enhancement New feature or request label Apr 1, 2021
@matthiasgomolka
Copy link
Owner Author

matthiasgomolka commented Apr 1, 2021

Further Considerations

Add a new argument like missing_id_var, which can take the values "random" or "structural" (and maybe "drop"?). This would make it possible to cover the following cases in a reasonable way.

missing_id_var = "random"

Tackled problem

This would solve the case in which there are NA values missing at random. For example, consider a variable LEI. Not each entity does have a LEI, but nevertheless the entity exists. The variable should be available, but is missing for somewhat random reasons.

Solution

Fill NA values in id_var with a number of different values which is derived from the relation of the count of non-NA ID's to number of non-NA ID rows.

Then, we would simulate the existence of the missing ID's assuming a similar distribution as for the non-missing ID's. Thus, we would mitigate problems with a low number of distinct ID's as well as dominance.

missing_id_var = "structural"

Tackled problem

In this case, NA ID's stem from situations, where there really is no ID and this is correct. For example. if there are several ID's in a dataset and one of them identifies a very special role which only occurs in a small number of cases. In most cases, this role does not exist and therefore the NA is correct.

Solution

Treat NA a s single (possibly) large entity (as it's done right now), but remove the "ID" NA afterwards. Then, there would be a single additional ID and in most cases this should also solve the problem of dominance. Since there would exist a large possibly dominant NA entity which is not considered as worthy of protection, this NA entity could safely be removed from the results. Most likely, the other non-NA entities are not dominant any more.

@matthiasgomolka matthiasgomolka changed the title NA is taken into account in id_var Handling of NA values in id_var Apr 1, 2021
@matthiasgomolka matthiasgomolka added question Further information is requested help wanted Extra attention is needed and removed bug Something isn't working question Further information is requested labels Apr 12, 2021
matthiasgomolka added a commit that referenced this issue Aug 6, 2021
Handle `NA` in ID columns correctly
@matthiasgomolka
Copy link
Owner Author

Update after merging #76

Solved problems

  • NA's are no longer treated like valid ID's. Instead, their value_share is subtracted from the cumulative value share and then rows where is.na(id) are removed from the dominance calculation. Therefore there won't be dominance problems any more due to a large fraction of NA ID's.
  • If all ID's are NA, no problem will be reported any more (because there is no entity to protect).

Open problems

  • If there are only 4 distinct non-NA ID's and many (think millions) NA values in the ID column, this would still be reported as a Not enough distinct entities.. Conceptually, this is not a problem because it is impossible to infer any information about the 4 distinct ID's if there are millions of other rows. But there might be different approaches how to handle this and these can be applied before using sdc_*() functions. So this problem is less important than the one which is solved by now.

matthiasgomolka added a commit that referenced this issue Dec 6, 2021
also improved cli output for sdc_min_max() (#91)
@matthiasgomolka
Copy link
Owner Author

With #95, a simple functionality was added which should cover a range of use cases. Thus, closing this issue for now. If there is a requirement for a more sophisticated approach, a new issue should be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant