Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to import variable with duplicate labels #42

Closed
hans-ekbrand opened this issue Mar 6, 2020 · 13 comments
Closed

Fails to import variable with duplicate labels #42

hans-ekbrand opened this issue Mar 6, 2020 · 13 comments

Comments

@hans-ekbrand
Copy link

hans-ekbrand commented Mar 6, 2020

Is it too much to ask that memisc would be able to cope with duplicated labels?

myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
Warning message:
1 variables have duplicated labels:
  HML23 
> my.df <- as.data.frame(my.meta.data)
Error in as.factor(x) : Duplicate labels
file.remove(z)

I work a lot with data from the Demographic and Health Surveys (DHS), and some of those files are so big that importing them with read.spss() requires amounts of RAM not found in most computers. Many or even most of the hundreds of files from DHS have duplicated labels in them. To get memisc working with such files would really help my work, currently I have a computer with 68 GB RAM so I manage, but I want others to be able to use my code.

Kind regards,

Hans Ekbrand, university of Gothenburg, Sweden.

@melff
Copy link
Owner

melff commented Mar 6, 2020

Certainly it would not be too much to ask. This should not be too much work. But I have to make time because I am in the middle of the semester. Any ideas about a scheme for de-duplicating labels? What would you prefer?

@melff
Copy link
Owner

melff commented Mar 6, 2020

By the way, it might be preferable to import the data in two steps:
my.ds <- subset(my.meta.data,select=c())
my.ds <- within(my.ds,{
<data preparations, like recoding etc.>
})
my.df <- as.data.frame(my.ds)

@hans-ekbrand
Copy link
Author

Certainly it would not be too much to ask. This should not be too much work. But I have to make time because I am in the middle of the semester. Any ideas about a scheme for de-duplicating labels? What would you prefer?

I am fine with conflating items with different values but identical labels to one label, because I consider the label the "real state" of the data.

The user is warned, and you already provide duplicate_labels() to display the situation.

@melff
Copy link
Owner

melff commented Mar 6, 2020

That would mean that the original codes would be changed. Maybe I should make this one option for de-duplicating. Another option would modify the labels.

@hans-ekbrand
Copy link
Author

Yeah, just paste() the numeric code and the label would save the distinction in the original data and would be fairly easy for the user to programmatically conflate them later on, particularly if you use an uncommon separator character/string with paste().

paste(numeric, label, sep = "§") or something.

@hans-ekbrand
Copy link
Author

By the way, it might be preferable to import the data in two steps:
my.ds <- subset(my.meta.data,select=c())
my.ds <- within(my.ds,{
<data preparations, like recoding etc.>
})
my.df <- as.data.frame(my.ds)

Is it possible to use recode() in the second last step to fix the duplicated labels problem?

@melff
Copy link
Owner

melff commented Mar 6, 2020

Yep - that is point why there is the whole infrastruture of "data.set" and "item" objects in memisc.

@hans-ekbrand
Copy link
Author

Oh, great!

Now, this is becomes a support question then. Here is the output of duplicated_labels()

duplicated_labels(my.ds2)
=====================================================
 SHDISTRI: 'District'
----------------------------------------------------------------------------------------------------
  Hamirpur:   28, 168 
  Pratapgarh: 131, 173
  Bilaspur:   30, 406 
  Aurangabad: 235, 515
  Raigarh:    403, 520
  Bijapur:    417, 557

I tried to fix the first duplicated item, Hamirpur, with the following recode

my.ds3 <- within(my.ds2,{
 recode(SHDISTRI, 'foo' <- 168)
 })

and it didn't complain, but it still reports the same problem

duplicated_labels(my.ds3)

===================================================
 SHDISTRI: 'District'
------------------------------------------------------------------------------------------------
  Hamirpur:   28, 168 
  Pratapgarh: 131, 173
  Bilaspur:   30, 406 
  Aurangabad: 235, 515
  Raigarh:    403, 520
  Bijapur:    417, 557

@melff
Copy link
Owner

melff commented Mar 6, 2020

I see. Well, recoding only changes the codes but not the labels. Still some work for me to do then ...

@melff
Copy link
Owner

melff commented Mar 14, 2020

I just found and updated some code I wrote earlier to deduplicate labels or codes. It is in the attached zip archive along with some example code and example output (dedup-labels.zip). I will include it into the next memisc release. But for now using the code in zip file may be a quick for your problem.

@melff
Copy link
Owner

melff commented Mar 14, 2020

There is now a new release of memisc 0.99.23 that includes a function deduplicate_labels(), which hopefully does what is written on the tin.

@melff melff closed this as completed Aug 14, 2020
@hans-ekbrand
Copy link
Author

hans-ekbrand commented Oct 23, 2020

Thanks a lot for fixing this!

Now I just need a little help to understand where I put the call to deduplicate_labels()

Everything seems to work for codebook(), but as.data.frame() fails with a new error now.

myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
#File character set is 'UTF-8'.
#Converting character set to the local 'utf-8'.
#Warning message:
#1 variables have duplicated labels:
#  HML23 
foo <- deduplicate_labels(my.meta.data)

Codebook looks fine now

codebook(foo)
====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal

   Values                N Valid Total
                                      
        (unlab.val.) 36672 100.0 100.0

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal
   Missing values: 99

   Values and labels                     N Valid Total
                                                      
   10   'Government health facility'   666  40.8   1.8
   20   'Private health facility'       20   4.8   0.1
   30   'Other sources'                  0   0.0   0.0
   31   'Pharmacy'                      32   4.4   0.1
   32   'Shop/market'                  388  36.1   1.1
   33   'CHW'                           17   2.5   0.0
   34   'Religious institution'          3   0.1   0.0
   35   'School'                       102   4.3   0.3
   96   'Other'                         79   6.3   0.2
   98   'Don't know'                     9   0.7   0.0
      M (unlab.mss.)                    25         0.1
   NA M                              35331        96.4

Running codebook() on unwashed data

codebook(my.meta.data)

shows the duplicated labels:

   10   'Government health facility'   666  40.8   1.8
   11   'Government health facility'     0   0.0   0.0
   20   'Private health facility'       20   4.8   0.1
   21   'Private health facility'        0   0.0   0.0

But as.data.frame() still gives an error, albeit another one now:

my.df <- as.data.frame(foo)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 0, 36672

Thanks in advance for any guidance on how to resolve this.

@melff
Copy link
Owner

melff commented Nov 1, 2020

as.data.frame() is not supposed to be applied to "importer" objects. In the next release, any such attempt will be flagged as an error.

The following code works, though:

library(memisc)
myurl <- "http://hansekbrand.se/temp/BUG.SAV"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z,ignore.scale.info=TRUE)

ignore.scale.info=TRUE is needed because your data file marks all variables as having a nominal level of measurment.

codebook(my.meta.data)
====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: interval

        Min:  1.000
        Max: 27.000
       Mean:  4.257
   Std.Dev.:  3.038

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal
   Missing values: 99

   Values and labels                     N Valid Total
                                                      
   10   'Government health facility'   666  40.8   1.8
   11   'Government health facility'     0   0.0   0.0
   20   'Private health facility'       20   4.8   0.1
   21   'Private health facility'        0   0.0   0.0
   30   'Other sources'                  0   0.0   0.0
   31   'Pharmacy'                      32   4.4   0.1
   32   'Shop/market'                  388  36.1   1.1
   33   'CHW'                           17   2.5   0.0
   34   'Religious institution'          3   0.1   0.0
   35   'School'                       102   4.3   0.3
   96   'Other'                         79   6.3   0.2
   98   'Don't know'                     9   0.7   0.0
      M (unlab.mss.)                    25         0.1
   NA M                              35331        96.4

foo <- deduplicate_labels(my.meta.data)
codebook(foo)
====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: interval

        Min:  1.000
        Max: 27.000
       Mean:  4.257
   Std.Dev.:  3.038

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: double
   Measurement: nominal
   Missing values: 99

   Values and labels                     N Valid Total
                                                      
   10   'Government health facility'   666  40.8   1.8
   20   'Private health facility'       20   4.8   0.1
   30   'Other sources'                  0   0.0   0.0
   31   'Pharmacy'                      32   4.4   0.1
   32   'Shop/market'                  388  36.1   1.1
   33   'CHW'                           17   2.5   0.0
   34   'Religious institution'          3   0.1   0.0
   35   'School'                       102   4.3   0.3
   96   'Other'                         79   6.3   0.2
   98   'Don't know'                     9   0.7   0.0
      M (unlab.mss.)                    25         0.1
   NA M                              35331        96.4
foobar <- as.data.set(foo)
foobar1 <- as.data.frame(foobar)
codebook(foobar1)
====================================================================================================

   HVIDX 'Line number'

----------------------------------------------------------------------------------------------------

   Storage mode: double

        Min:  1.000
        Max: 27.000
       Mean:  4.257
   Std.Dev.:  3.038
   Skewness:  1.457
   Kurtosis:  2.979

====================================================================================================

   HML23 'Place where net was obtained'

----------------------------------------------------------------------------------------------------

   Storage mode: integer
   Factor with 10 levels

   Levels and labels                   N Valid Total
                                                    
    1 'Government health facility'   666  50.6   1.8
    2 'Private health facility'       20   1.5   0.1
    3 'Other sources'                  0   0.0   0.0
    4 'Pharmacy'                      32   2.4   0.1
    5 'Shop/market'                  388  29.5   1.1
    6 'CHW'                           17   1.3   0.0
    7 'Religious institution'          3   0.2   0.0
    8 'School'                       102   7.8   0.3
    9 'Other'                         79   6.0   0.2
   10 'Don't know'                     9   0.7   0.0
   NA                              35356        96.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants