Handling of `NA` values in `id_var` #69

matthiasgomolka · 2021-04-01T06:47:31Z

If id_var contains NA values, these are treated as a single ID. ~~This is not correct. Columns where is.na(id_var) need to be removed before calculating the number of distinct entities and dominance.~~

The problem is especially obvious when a large portion of id_var is NA.

library(sdcLog)
#> Warning: package 'sdcLog' was built under R version 4.0.4
library(data.table)
#> Warning: package 'data.table' was built under R version 4.0.4
set.seed(1)

DT <- data.table(
  id = c(LETTERS[1:6], rep(NA, 20)),
  val = rnorm(26)
)

Right now, NA values are treated as a single valid ID. Thus, this "ID" is dominant:

# current version
sdc_descriptives(DT, "id", "val")          
#> Warning: DISCLOSURE PROBLEM: Dominant entities.
#> [ OPTIONS:  sdc.n_ids: 5 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val | zero_as_NA: FALSE ]
#> Dominant entities:
#>    value_share
#> 1:   0.8545532

One possible solution would be to remove all rows with missing id_var before calculating distinct ID's and dominance:

sdc_descriptives(na.omit(DT), "id", "val") 
#> [ OPTIONS:  sdc.n_ids: 5 | sdc.n_ids_dominance: 2 | sdc.share_dominance: 0.85 ]
#> [ SETTINGS: id_var: id | val_var: val | zero_as_NA: FALSE ]
#> Output complies to RDC rules.

However, this would lead to the situation that if you test the exact same data for several ID's, you would consider a different number of rows in each check, which is hard to explain and justify.

The text was updated successfully, but these errors were encountered:

matthiasgomolka · 2021-04-01T08:35:01Z

Further Considerations

Add a new argument like missing_id_var, which can take the values "random" or "structural" (and maybe "drop"?). This would make it possible to cover the following cases in a reasonable way.

`missing_id_var = "random"`

Tackled problem

This would solve the case in which there are NA values missing at random. For example, consider a variable LEI. Not each entity does have a LEI, but nevertheless the entity exists. The variable should be available, but is missing for somewhat random reasons.

Solution

Fill NA values in id_var with a number of different values which is derived from the relation of the count of non-NA ID's to number of non-NA ID rows.

Then, we would simulate the existence of the missing ID's assuming a similar distribution as for the non-missing ID's. Thus, we would mitigate problems with a low number of distinct ID's as well as dominance.

`missing_id_var = "structural"`

Tackled problem

In this case, NA ID's stem from situations, where there really is no ID and this is correct. For example. if there are several ID's in a dataset and one of them identifies a very special role which only occurs in a small number of cases. In most cases, this role does not exist and therefore the NA is correct.

Solution

Treat NA a s single (possibly) large entity (as it's done right now), but remove the "ID" NA afterwards. Then, there would be a single additional ID and in most cases this should also solve the problem of dominance. Since there would exist a large possibly dominant NA entity which is not considered as worthy of protection, this NA entity could safely be removed from the results. Most likely, the other non-NA entities are not dominant any more.

Handle `NA` in ID columns correctly

matthiasgomolka · 2021-08-06T11:52:07Z

Update after merging #76

Solved problems

NA's are no longer treated like valid ID's. Instead, their value_share is subtracted from the cumulative value share and then rows where is.na(id) are removed from the dominance calculation. Therefore there won't be dominance problems any more due to a large fraction of NA ID's.
If all ID's are NA, no problem will be reported any more (because there is no entity to protect).

Open problems

If there are only 4 distinct non-NA ID's and many (think millions) NA values in the ID column, this would still be reported as a Not enough distinct entities.. Conceptually, this is not a problem because it is impossible to infer any information about the 4 distinct ID's if there are millions of other rows. But there might be different approaches how to handle this and these can be applied before using sdc_*() functions. So this problem is less important than the one which is solved by now.

also improved cli output for sdc_min_max() (#91)

matthiasgomolka · 2021-12-06T16:17:08Z

With #95, a simple functionality was added which should cover a range of use cases. Thus, closing this issue for now. If there is a requirement for a more sophisticated approach, a new issue should be opened.

matthiasgomolka added the bug Something isn't working label Apr 1, 2021

matthiasgomolka self-assigned this Apr 1, 2021

matthiasgomolka added the enhancement New feature or request label Apr 1, 2021

matthiasgomolka changed the title ~~NA is taken into account in id_var~~ Handling of NA values in id_var Apr 1, 2021

matthiasgomolka added question Further information is requested help wanted Extra attention is needed and removed bug Something isn't working question Further information is requested labels Apr 12, 2021

matthiasgomolka mentioned this issue Aug 6, 2021

Handle NA in ID columns correctly #76

Merged

matthiasgomolka added a commit that referenced this issue Aug 6, 2021

Merge pull request #76 from matthiasgomolka/#69

ec2c3eb

Handle `NA` in ID columns correctly

matthiasgomolka added a commit that referenced this issue Dec 6, 2021

fill_id_var implemented for sdc_model() (#69)

af28fff

matthiasgomolka added a commit that referenced this issue Dec 6, 2021

fill_id_var implemented for sdc_descriptives() (#69)

ef5322a

matthiasgomolka added a commit that referenced this issue Dec 6, 2021

fill_id_var implemented for sdc_min_max() (#69)

f533262

also improved cli output for sdc_min_max() (#91)

matthiasgomolka mentioned this issue Dec 6, 2021

ID handling #95

Merged

matthiasgomolka closed this as completed Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of `NA` values in `id_var` #69

Handling of `NA` values in `id_var` #69

matthiasgomolka commented Apr 1, 2021 •

edited

Loading

matthiasgomolka commented Apr 1, 2021 •

edited

Loading

matthiasgomolka commented Aug 6, 2021

matthiasgomolka commented Dec 6, 2021

Handling of NA values in id_var #69

Handling of NA values in id_var #69

Comments

matthiasgomolka commented Apr 1, 2021 • edited Loading

matthiasgomolka commented Apr 1, 2021 • edited Loading

Further Considerations

missing_id_var = "random"

Tackled problem

Solution

missing_id_var = "structural"

Tackled problem

Solution

matthiasgomolka commented Aug 6, 2021

Update after merging #76

Solved problems

Open problems

matthiasgomolka commented Dec 6, 2021

Handling of `NA` values in `id_var` #69

Handling of `NA` values in `id_var` #69

matthiasgomolka commented Apr 1, 2021 •

edited

Loading

matthiasgomolka commented Apr 1, 2021 •

edited

Loading

`missing_id_var = "random"`

`missing_id_var = "structural"`