Fails to import variable with duplicate labels #42

hans-ekbrand · 2020-03-06T17:00:10Z

Is it too much to ask that memisc would be able to cope with duplicated labels?

myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
Warning message:
1 variables have duplicated labels:
  HML23 
> my.df <- as.data.frame(my.meta.data)
Error in as.factor(x) : Duplicate labels
file.remove(z)

I work a lot with data from the Demographic and Health Surveys (DHS), and some of those files are so big that importing them with read.spss() requires amounts of RAM not found in most computers. Many or even most of the hundreds of files from DHS have duplicated labels in them. To get memisc working with such files would really help my work, currently I have a computer with 68 GB RAM so I manage, but I want others to be able to use my code.

Kind regards,

Hans Ekbrand, university of Gothenburg, Sweden.

The text was updated successfully, but these errors were encountered:

melff · 2020-03-06T17:17:40Z

Certainly it would not be too much to ask. This should not be too much work. But I have to make time because I am in the middle of the semester. Any ideas about a scheme for de-duplicating labels? What would you prefer?

melff · 2020-03-06T17:20:42Z

By the way, it might be preferable to import the data in two steps:
my.ds <- subset(my.meta.data,select=c())
my.ds <- within(my.ds,{
<data preparations, like recoding etc.>
})
my.df <- as.data.frame(my.ds)

hans-ekbrand · 2020-03-06T17:29:39Z

Certainly it would not be too much to ask. This should not be too much work. But I have to make time because I am in the middle of the semester. Any ideas about a scheme for de-duplicating labels? What would you prefer?

I am fine with conflating items with different values but identical labels to one label, because I consider the label the "real state" of the data.

The user is warned, and you already provide duplicate_labels() to display the situation.

melff · 2020-03-06T17:34:28Z

That would mean that the original codes would be changed. Maybe I should make this one option for de-duplicating. Another option would modify the labels.

hans-ekbrand · 2020-03-06T17:44:32Z

Yeah, just paste() the numeric code and the label would save the distinction in the original data and would be fairly easy for the user to programmatically conflate them later on, particularly if you use an uncommon separator character/string with paste().

paste(numeric, label, sep = "§") or something.

hans-ekbrand · 2020-03-06T17:46:18Z

By the way, it might be preferable to import the data in two steps:
my.ds <- subset(my.meta.data,select=c())
my.ds <- within(my.ds,{
<data preparations, like recoding etc.>
})
my.df <- as.data.frame(my.ds)

Is it possible to use recode() in the second last step to fix the duplicated labels problem?

melff · 2020-03-06T17:47:57Z

Yep - that is point why there is the whole infrastruture of "data.set" and "item" objects in memisc.

hans-ekbrand · 2020-03-06T17:58:33Z

Oh, great!

Now, this is becomes a support question then. Here is the output of duplicated_labels()

duplicated_labels(my.ds2)
=====================================================
 SHDISTRI: 'District'
----------------------------------------------------------------------------------------------------
  Hamirpur:   28, 168 
  Pratapgarh: 131, 173
  Bilaspur:   30, 406 
  Aurangabad: 235, 515
  Raigarh:    403, 520
  Bijapur:    417, 557

I tried to fix the first duplicated item, Hamirpur, with the following recode

my.ds3 <- within(my.ds2,{
 recode(SHDISTRI, 'foo' <- 168)
 })

and it didn't complain, but it still reports the same problem

duplicated_labels(my.ds3)

===================================================
 SHDISTRI: 'District'
------------------------------------------------------------------------------------------------
  Hamirpur:   28, 168 
  Pratapgarh: 131, 173
  Bilaspur:   30, 406 
  Aurangabad: 235, 515
  Raigarh:    403, 520
  Bijapur:    417, 557

melff · 2020-03-06T18:01:38Z

I see. Well, recoding only changes the codes but not the labels. Still some work for me to do then ...

melff · 2020-03-14T18:23:42Z

I just found and updated some code I wrote earlier to deduplicate labels or codes. It is in the attached zip archive along with some example code and example output (dedup-labels.zip). I will include it into the next memisc release. But for now using the code in zip file may be a quick for your problem.

melff · 2020-03-14T19:50:56Z

There is now a new release of memisc 0.99.23 that includes a function deduplicate_labels(), which hopefully does what is written on the tin.

hans-ekbrand · 2020-10-23T13:08:16Z

Thanks a lot for fixing this!

Now I just need a little help to understand where I put the call to deduplicate_labels()

Everything seems to work for codebook(), but as.data.frame() fails with a new error now.

myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
#File character set is 'UTF-8'.
#Converting character set to the local 'utf-8'.
#Warning message:
#1 variables have duplicated labels:
#  HML23 
foo <- deduplicate_labels(my.meta.data)

Codebook looks fine now

codebook(foo)
====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal

   Values                N Valid Total
                                      
        (unlab.val.) 36672 100.0 100.0

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal
   Missing values: 99

   Values and labels                     N Valid Total
                                                      
   10   'Government health facility'   666  40.8   1.8
   20   'Private health facility'       20   4.8   0.1
   30   'Other sources'                  0   0.0   0.0
   31   'Pharmacy'                      32   4.4   0.1
   32   'Shop/market'                  388  36.1   1.1
   33   'CHW'                           17   2.5   0.0
   34   'Religious institution'          3   0.1   0.0
   35   'School'                       102   4.3   0.3
   96   'Other'                         79   6.3   0.2
   98   'Don't know'                     9   0.7   0.0
      M (unlab.mss.)                    25         0.1
   NA M                              35331        96.4

Running codebook() on unwashed data

codebook(my.meta.data)

shows the duplicated labels:

   10   'Government health facility'   666  40.8   1.8
   11   'Government health facility'     0   0.0   0.0
   20   'Private health facility'       20   4.8   0.1
   21   'Private health facility'        0   0.0   0.0

But as.data.frame() still gives an error, albeit another one now:

my.df <- as.data.frame(foo)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 0, 36672

Thanks in advance for any guidance on how to resolve this.

melff · 2020-11-01T21:17:14Z

as.data.frame() is not supposed to be applied to "importer" objects. In the next release, any such attempt will be flagged as an error.

The following code works, though:

library(memisc)
myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z,ignore.scale.info=TRUE)

ignore.scale.info=TRUE is needed because your data file marks all variables as having a nominal level of measurment.

codebook(my.meta.data)

====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: interval

        Min:  1.000
        Max: 27.000
       Mean:  4.257
   Std.Dev.:  3.038

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal
   Missing values: 99

   Values and labels                     N Valid Total
                                                      
   10   'Government health facility'   666  40.8   1.8
   11   'Government health facility'     0   0.0   0.0
   20   'Private health facility'       20   4.8   0.1
   21   'Private health facility'        0   0.0   0.0
   30   'Other sources'                  0   0.0   0.0
   31   'Pharmacy'                      32   4.4   0.1
   32   'Shop/market'                  388  36.1   1.1
   33   'CHW'                           17   2.5   0.0
   34   'Religious institution'          3   0.1   0.0
   35   'School'                       102   4.3   0.3
   96   'Other'                         79   6.3   0.2
   98   'Don't know'                     9   0.7   0.0
      M (unlab.mss.)                    25         0.1
   NA M                              35331        96.4

foo <- deduplicate_labels(my.meta.data)
codebook(foo)

====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: interval

        Min:  1.000
        Max: 27.000
       Mean:  4.257
   Std.Dev.:  3.038

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal
   Missing values: 99

   Values and labels                     N Valid Total
                                                      
   10   'Government health facility'   666  40.8   1.8
   20   'Private health facility'       20   4.8   0.1
   30   'Other sources'                  0   0.0   0.0
   31   'Pharmacy'                      32   4.4   0.1
   32   'Shop/market'                  388  36.1   1.1
   33   'CHW'                           17   2.5   0.0
   34   'Religious institution'          3   0.1   0.0
   35   'School'                       102   4.3   0.3
   96   'Other'                         79   6.3   0.2
   98   'Don't know'                     9   0.7   0.0
      M (unlab.mss.)                    25         0.1
   NA M                              35331        96.4

foobar <- as.data.set(foo)
foobar1 <- as.data.frame(foobar)
codebook(foobar1)

====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double

        Min:  1.000
        Max: 27.000
       Mean:  4.257
   Std.Dev.:  3.038
   Skewness:  1.457
   Kurtosis:  2.979

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: integer
   Factor with 10 levels

   Levels and labels                   N Valid Total
                                                    
    1 'Government health facility'   666  50.6   1.8
    2 'Private health facility'       20   1.5   0.1
    3 'Other sources'                  0   0.0   0.0
    4 'Pharmacy'                      32   2.4   0.1
    5 'Shop/market'                  388  29.5   1.1
    6 'CHW'                           17   1.3   0.0
    7 'Religious institution'          3   0.2   0.0
    8 'School'                       102   7.8   0.3
    9 'Other'                         79   6.0   0.2
   10 'Don't know'                     9   0.7   0.0
   NA                              35356        96.4

melff added bug enhancement labels Mar 6, 2020

melff closed this as completed Aug 14, 2020

hans-ekbrand mentioned this issue Mar 18, 2021

deduplicate_labels() is painfully slow on large datasets #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to import variable with duplicate labels #42

Fails to import variable with duplicate labels #42

hans-ekbrand commented Mar 6, 2020 •

edited

Loading

melff commented Mar 6, 2020

melff commented Mar 6, 2020

hans-ekbrand commented Mar 6, 2020

melff commented Mar 6, 2020

hans-ekbrand commented Mar 6, 2020

hans-ekbrand commented Mar 6, 2020

melff commented Mar 6, 2020 •

edited

Loading

hans-ekbrand commented Mar 6, 2020

melff commented Mar 6, 2020

melff commented Mar 14, 2020

melff commented Mar 14, 2020

hans-ekbrand commented Oct 23, 2020 •

edited

Loading

melff commented Nov 1, 2020 •

edited

Loading

Fails to import variable with duplicate labels #42

Fails to import variable with duplicate labels #42

Comments

hans-ekbrand commented Mar 6, 2020 • edited Loading

melff commented Mar 6, 2020

melff commented Mar 6, 2020

hans-ekbrand commented Mar 6, 2020

melff commented Mar 6, 2020

hans-ekbrand commented Mar 6, 2020

hans-ekbrand commented Mar 6, 2020

melff commented Mar 6, 2020 • edited Loading

hans-ekbrand commented Mar 6, 2020

melff commented Mar 6, 2020

melff commented Mar 14, 2020

melff commented Mar 14, 2020

hans-ekbrand commented Oct 23, 2020 • edited Loading

melff commented Nov 1, 2020 • edited Loading

hans-ekbrand commented Mar 6, 2020 •

edited

Loading

melff commented Mar 6, 2020 •

edited

Loading

hans-ekbrand commented Oct 23, 2020 •

edited

Loading

melff commented Nov 1, 2020 •

edited

Loading