Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

stevenmce · 2018-05-16T10:21:38Z

When ingesting SPSS or Stata data which includes variables that contains both labelled and unlabelled values, the ingest process assigns values with no label as N/A. For example, a 10 point scale (1 = Not at all, 10 = Very much, no value labels assigned for responses 2 to 9) would result in an ingested version which contains only the values 1, 10 and NA. This results in the UNF data being different between the original and ingested versions.

We have attempted to provide a demonstration of the issue with simulated data at:
https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZYGTWM

djbrooke · 2018-05-16T13:54:47Z

Thanks @stevenmce, we'll take a look at this.

landreev · 2018-05-16T16:44:18Z

To clarify, this is specifically about exporting ingested files as RData, correct? (it's not mentioned above, but looking at the linked sample dataset suggests that it is...)

Off the top of my head, there definitely is a known problem with calculating UNFs of categorical variables (or "factors" in the language of R), when converting datafiles between Stata/SPSS and R; it is caused by the fact that R handles its factors in a way that's fundamentally different from other formats. (see http://guides.dataverse.org/en/4.8.6/user/tabulardataingest/rdata.html#r-factors for some discussion of the controversy).

But this specific case - empty labels of distinct categories all becoming NA factor values - is unambiguously wrong; and we should come up with a way to handle it better.

We'll discuss this and follow up.

landreev · 2018-05-16T17:02:19Z

Just to reiterate, I am fairly positive that this is not an ingest problem, as the original description suggests.
I still need to review the ingested files in your sample dataset more carefully (thanks btw, for providing all the examples and information) to confirm. But it looks to me like the ingest process does the right thing. For ex., it ingests the variable column v6b in the stata file as a numeric categorical vector; where the numeric values 1 and 5 have assigned labels; but the other values happen to have none.
It's the conversion to R that seems to be the problem. It looks like we just never thought about this situation - what to do when there are missing labels like this. My first guess, we should just be using "2", "3", "4" etc. for the values that are missing the descriptive labels... But please let us investigate/think about this some more.

djbrooke · 2018-05-16T18:48:32Z

@stevenmce - we talked about this in Backlog Grooming today and we'll need some more information and investigation before we estimate. We'll have another opportunity to discuss next Wednesday.

pdurbin · 2018-06-18T14:35:54Z

@stevenmce emphasized the importance of this bug during last week's community meeting so I assigned it to myself and plan to bring it up during backlog grooming in the context of pull request #4708, which is currently in flight.

djbrooke · 2018-06-18T14:38:46Z

@pdurbin let's keep it separate if possible - that one is already big enough :)

@landreev, you had mentioned some further investigation above. Can you work with @pdurbin to make sure the efforts are coordinated and can we please try to get some specific questions to @stevenmce and the ADA team before we groom this?

landreev · 2018-06-19T20:15:11Z

@pdurbin @djbrooke
I believe there are multiple reasons why this should be treated separately from the PR 4708.
One is that the sample ingested Stata file they provide (https://demo.dataverse.org/file.xhtml?fileId=28929&version=RELEASED&version=.0) is an "old-school", pre-v.13 Stata. So it's not even handled by the ingest plugin @benjamin-martinez and @oscardssmith have been working on.
They (the people who submitted the report) do mention that they haven't had a particularly good luck with Stata 13 - meaning, most of their Stata 13 files just don't ingest, period - and that most likely will be much improved by the fixes to the Stata 13+ ingest plugin. But that was not specific to the value label bug this issue is mainly about.
But, most importantly - as I tried to explain in the comment above, this does not appear to be a Stata ingest; or ingest in general. Rather, it's a problem with converting the data to RData format, once ingested. We don't seem to be doing anything wrong with how we store these value labels in the database. What we are doing wrong, is we somehow fail to pass the list of these labels to R, when we export to R, in a way that does not confuse R into thinking that no label on the list = missing value.

landreev · 2018-06-19T20:21:29Z

The TL;DR version:
This is an issue outside of the Stata 13 ingest.
If we need to prioritize it, because it's important to a partner, let's do that.

landreev · 2018-06-19T20:50:58Z

(@pdurbin - thanks for bringing this up during standup! I did miss these last comments in this issue from yesterday...)

pdurbin · 2018-06-19T23:54:26Z

@benjamin-martinez thanks for investigating this bug with me today and thank you @landreev for the summary here and in person after tech hours. To re-iterate, the bug seems to on export to RData format.

By the way, Ben and I weren't sure what value Dataverse is adding by creating its own RData derivative of an original RData file but whatever. 😄

djbrooke · 2018-06-20T16:22:04Z

@landreev @pdurbin @benjamin-martinez

Do you feel that there's enough here to get an estimate on the fix for this and pull it into a sprint? Or do we need more info from ADA? I'd like to bring it into the sprint today if we have enough info.

pdurbin · 2018-06-20T20:29:56Z

@benjamin-martinez here is the code we were looking at:

src/main/webapp/file-download-button-fragment.xhtml

<p:commandLink rendered="#{!downloadPopupRequired}"
               process="@this"
               disabled="#{(fileMetadata.dataFile.ingestInProgress  or lockedFromDownload) ? 'disabled' : ''}" 
               actionListener="#{fileDownloadService.startFileDownload(guestbookResponse, fileMetadata, 'RData')}">
    #{bundle['file.downloadBtn.format.rdata']}
</p:commandLink>
<p:commandLink rendered="#{downloadPopupRequired}"
               process="@this"
               disabled="#{(fileMetadata.dataFile.ingestInProgress  or lockedFromDownload) ? 'disabled' : ''}" 
               action="#{guestbookResponseService.modifyDatafileAndFormat(guestbookResponse, fileMetadata, 'RData' )}"
               update="@widgetVar(downloadPopup)"
               oncomplete="PF('downloadPopup').show();handleResizeDialog('downloadPopup');">
    #{bundle['file.downloadBtn.format.rdata']}
</p:commandLink>

Those methods are in these "service beans":

src/main/java/edu/harvard/iq/dataverse/FileDownloadServiceBean.java
src/main/java/edu/harvard/iq/dataverse/GuestbookResponseServiceBean.java

Ultimately, the redirect sends the user to this API bean: src/main/java/edu/harvard/iq/dataverse/api/Access.java

http://guides.dataverse.org/en/4.9/api/dataaccess.html shows "formats" such as "original" and "RData".

Heres's a screenshot from Firefox:

From what @landreev was saying at the meeting this afternoon, R code is generated on the fly by Dataverse and sent to Rserve, which is mentioned here: http://guides.dataverse.org/en/4.9/installation/r-rapache-tworavens.html

landreev · 2018-06-22T17:52:22Z

@benjamin-martinez @oscardssmith
The class that talks to R in order to produce the R data frame (an .RData file) is RemoteDataFrameService.java. It uses a few other classes in edu/harvard/iq/dataverse/rserve, and some code in edu/harvard/iq/dataverse/dataaccess/DataConverter.java. There is also some fixed R code in the Dataverse (edu/harvard/iq/dataverse/rserve/scripts/dataverse_r_functions.R) that is passed to R in real time. The communication to R is done using Rserve protocol. The Rserve server can run on your local systems, or you can use remote Rserve running on one of our test boxes - for example, on dvn-build.hmdc.harvard.edu.

It's based largely on some much older code written by another developer during an earlier stage of the project.
Looking at it, it's pretty complicated stuff. I'm wondering if this issue would be better suited for somebody from the full-time developers team. So do not hesitate to ask for help and; or even jump on something else if you need to.

The map of all the categorical values for the datafile is passed to the RemoteDataFrameService as sro.getValueTable(). Then we turn it into a map on the R side called VALTABLE; and that's what R will use to change the vectors made from the values in the tab files into "factors" - R's version of categorical variables.

One solution for the issue at hand is to add some R code instructing it to assume that the VALTABLE may not contain the labels for some values found in the tabular vector. And not to assume N/A, but to just use the string value of the element as the label (i.e., if the value in the numeric vector is 1, but there is no label in VALTABLE corresponding to 1, juse use "1" for the label).

There is an alternative solution though, that would not require any extra R code, but can be done solely on the Dataverse side and in Java:
When we create the valueTable in sro (RJobRequest.java) - this happens in DataConverter.getValueTableForRequestedVariables(...)) - we can, in addition to going through all the category labels in DataVariable.getCategories(), subset the vector for this variable from the tab file, got through the values, and if necessary, create extra labels, for all the unique values that are not in the dataVariable.getCategories()... (Oscar, you should have a rough idea of how to subset individual vectors, from looking at the code that calculates UNFs and summary stats...)

The drawbacks of this method - having to read and subset the tabular file on the application side. (this of course would have to be done for categorical variables only, not for every vector - but still!).
And R will need to read the tabular file anyway - so that would be the reason to solve it by adding something on the R side... But that comes with having to figure out what goes on in that R code.

oscardssmith · 2018-06-27T13:54:28Z

@stevenmce given that in R categorical columns can not have data not assigned to a category, what path do you want us to take her?

For each unlabeled value, produce a new label for that value which is str(value)
Prevent these arguably incomplete files from being downloaded in R
Prevent these arguably incomplete files from being ingested

pdurbin · 2018-06-28T18:20:35Z

@izahn suggests using a labeller built into haven: https://cran.r-project.org/web/packages/haven/vignettes/semantics.html

izahn · 2018-06-28T19:03:10Z

See https://haven.tidyverse.org/articles/semantics.html for tools designed to preserve metadata from other statistics packages in R.

pdurbin · 2018-06-28T19:20:44Z

I'm not sure if this will be helpful or not but I found https://cran.r-project.org/package=DDIwR and the PDF says, "This package provides various functions to read DDI based metadata documentation, and write dedicated setup files for R, SPSS, Stata and SAS to read an associated .csv file containing the raw data, apply labels for variables and values and also deal with the treatment of missing values."

landreev · 2018-07-13T23:06:11Z

There is also the case specific to our production (and maybe to one or two particularly old external installations, such as Odum): we have a limited number of ingested tabular files for which we don't have the saved originals/don't have the original format preserved at all. The files in question were grandfathered from something called "VDC" - the dinosaur ancestor of the Dataverse application. Some of these old files happen to be very very important (for example, many data files from Gary King's datasets are part of this subset). So we don't want these R conversions to just start failing for these files.
(granted, for most or all of Gary's files, the RData files are already cached; so one way to address this would be to keep them in place, and only wipe the cached RDatas for the files that were specifically affected by the missing label issue. Also, we were recently discussing working with Murray project to restore the original for these old tabular files. So this may get resolved outside of this issue - but still, this is an issue).

landreev · 2018-07-16T15:14:03Z

A couple more things:
Generally, it makes perfect sense, to want to use a piece of software written in R, by R programmers, and specifically for the purpose of converting foreign formats to RData. Rather than writing and maintaining code in R ourselves, which is not our strongest point.
Whoever, there's one potential issue with this approach: the above is only true for the tabular ingest as it is implemented now. That is, when we convert proprietary formats (stata, spss, etc.) into our internal format ONCE, with no way to modify it later on. But we are already working on providing APIs for modifying existing tabular metadata. So this creates a possibility of a use case where the owner has improved the descriptive tabular metadata - with better variable names, labels, etc. - which is reflected in the DDI, json exports, and/or in TwoRavens etc. Yet if they try “download as RData”, users will see the original, “poor” metadata from the saved original.

I don’t know if we want/need to discuss this further. This PR definitely addresses the current issue; and this will not be a problem for the foreseeable future. But we need to assume that we will likely have to go back to learning how to properly convert our (Dataverse) tabular metadata to R data frames from scratch in the future.

In this PR however, we do need it explained somewhere in the documentation, that this is the way we convert to RData - from the saved original, using a third party R library.

landreev · 2018-07-16T19:47:52Z

Re: documentation: there's a section in the API guide, "Basic File Access" that mentions download as R:
RData | Tabular data as an R Data frame (generated; unless the “original” file was in R);
this is probably a good place to explain how these RData files are generated.

landreev · 2018-07-23T14:55:19Z

I requested some changes in the PR. (want to address/rename the new "isLabled" thing in DataVariable)

pdurbin · 2018-07-23T16:15:00Z

I discussed this issue with @oscardssmith and @landreev this morning and have a pretty good sense of the work that's left, messing with the new unused boolean. There's a file floating around called "BeerTastingTestData.dta" that can be used for testing. One needs to install the haven R package or use an Rserve with it (dvn-build soon, probably).

pdurbin · 2018-07-23T20:14:00Z

I made the requested changes in 03c8051 and cleared them with @oscardssmith and then @landreev . I'm attaching the file I used to ensure that the new "factor" column is being populated with a "true" when the RData file contains factors: test_factor.RData.txt

Back to code review.

landreev · 2018-07-24T15:14:51Z

@kcondon For QA, please use Rserve on dvn-build.hmdc.harvard.edu.

…unctions.R (#4676)

djbrooke added the Status: Backlog label May 16, 2018

benjamin-martinez assigned pdurbin Jun 15, 2018

benjamin-martinez self-assigned this Jun 18, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Jun 20, 2018

djbrooke unassigned pdurbin and benjamin-martinez Jun 20, 2018

benjamin-martinez added Status: Development and removed Status: This/Next Sprint labels Jun 20, 2018

benjamin-martinez self-assigned this Jun 20, 2018

djbrooke assigned oscardssmith Jun 22, 2018

pdurbin assigned stevenmce Jun 27, 2018

oscardssmith removed the Status: Development label Jul 5, 2018

djbrooke assigned oscardssmith Jul 16, 2018

oscardssmith removed their assignment Jul 18, 2018

landreev added Status: Development and removed Status: Code Review labels Jul 23, 2018

landreev assigned oscardssmith and unassigned landreev Jul 23, 2018

pdurbin assigned pdurbin and unassigned oscardssmith Jul 23, 2018

djbrooke mentioned this issue Jul 23, 2018

Provide "bare minimum" R installation instructions; add "haven" package. #4878

Closed

pdurbin added a commit that referenced this issue Jul 23, 2018

clean up sql update (now a boolean called "factor") #4676

03c8051

pdurbin added Status: Code Review and removed Status: Development labels Jul 23, 2018

pdurbin added a commit that referenced this issue Jul 23, 2018

Merge branch 'develop' into 4676-mixed-labels-in-R #4676

39a5a63

pdurbin removed their assignment Jul 23, 2018

landreev added Status: QA and removed Status: Code Review labels Jul 24, 2018

landreev assigned kcondon Jul 24, 2018

djbrooke assigned landreev Jul 31, 2018

landreev added a commit that referenced this issue Jul 31, 2018

Fixed the newlines (removed the DOS-style "\r"s) in the dataverse_r_f…

a34034f

…unctions.R (#4676)

landreev added a commit that referenced this issue Aug 1, 2018

Fixed DOS newlines in the preprocess.R file too. (#4676)

42965b8

kcondon closed this as completed Aug 1, 2018

kcondon removed the Status: QA label Aug 1, 2018

djbrooke added this to the 4.9.2 - Stata Upgrades, etc. milestone Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

stevenmce commented May 16, 2018

djbrooke commented May 16, 2018

landreev commented May 16, 2018 •

edited

landreev commented May 16, 2018

djbrooke commented May 16, 2018

pdurbin commented Jun 18, 2018

djbrooke commented Jun 18, 2018

landreev commented Jun 19, 2018

landreev commented Jun 19, 2018

landreev commented Jun 19, 2018

pdurbin commented Jun 19, 2018

djbrooke commented Jun 20, 2018

pdurbin commented Jun 20, 2018 •

edited

landreev commented Jun 22, 2018

oscardssmith commented Jun 27, 2018

pdurbin commented Jun 28, 2018 •

edited

izahn commented Jun 28, 2018

pdurbin commented Jun 28, 2018

landreev commented Jul 13, 2018 •

edited

landreev commented Jul 16, 2018

landreev commented Jul 16, 2018

landreev commented Jul 23, 2018

pdurbin commented Jul 23, 2018

pdurbin commented Jul 23, 2018

landreev commented Jul 24, 2018

Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

Comments

stevenmce commented May 16, 2018

djbrooke commented May 16, 2018

landreev commented May 16, 2018 • edited

landreev commented May 16, 2018

djbrooke commented May 16, 2018

pdurbin commented Jun 18, 2018

djbrooke commented Jun 18, 2018

landreev commented Jun 19, 2018

landreev commented Jun 19, 2018

landreev commented Jun 19, 2018

pdurbin commented Jun 19, 2018

djbrooke commented Jun 20, 2018

pdurbin commented Jun 20, 2018 • edited

landreev commented Jun 22, 2018

oscardssmith commented Jun 27, 2018

pdurbin commented Jun 28, 2018 • edited

izahn commented Jun 28, 2018

pdurbin commented Jun 28, 2018

landreev commented Jul 13, 2018 • edited

landreev commented Jul 16, 2018

landreev commented Jul 16, 2018

landreev commented Jul 23, 2018

pdurbin commented Jul 23, 2018

pdurbin commented Jul 23, 2018

landreev commented Jul 24, 2018

landreev commented May 16, 2018 •

edited

pdurbin commented Jun 20, 2018 •

edited

pdurbin commented Jun 28, 2018 •

edited

landreev commented Jul 13, 2018 •

edited