Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataverse ingest error for variables with mix of labelled and unlabelled values #4676

Closed
stevenmce opened this issue May 16, 2018 · 32 comments
Closed
Assignees

Comments

@stevenmce
Copy link

When ingesting SPSS or Stata data which includes variables that contains both labelled and unlabelled values, the ingest process assigns values with no label as N/A. For example, a 10 point scale (1 = Not at all, 10 = Very much, no value labels assigned for responses 2 to 9) would result in an ingested version which contains only the values 1, 10 and NA. This results in the UNF data being different between the original and ingested versions.

We have attempted to provide a demonstration of the issue with simulated data at:
https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZYGTWM

@djbrooke
Copy link
Contributor

Thanks @stevenmce, we'll take a look at this.

@landreev
Copy link
Contributor

landreev commented May 16, 2018

To clarify, this is specifically about exporting ingested files as RData, correct? (it's not mentioned above, but looking at the linked sample dataset suggests that it is...)

Off the top of my head, there definitely is a known problem with calculating UNFs of categorical variables (or "factors" in the language of R), when converting datafiles between Stata/SPSS and R; it is caused by the fact that R handles its factors in a way that's fundamentally different from other formats. (see http://guides.dataverse.org/en/4.8.6/user/tabulardataingest/rdata.html#r-factors for some discussion of the controversy).

But this specific case - empty labels of distinct categories all becoming NA factor values - is unambiguously wrong; and we should come up with a way to handle it better.

We'll discuss this and follow up.

@landreev
Copy link
Contributor

Just to reiterate, I am fairly positive that this is not an ingest problem, as the original description suggests.
I still need to review the ingested files in your sample dataset more carefully (thanks btw, for providing all the examples and information) to confirm. But it looks to me like the ingest process does the right thing. For ex., it ingests the variable column v6b in the stata file as a numeric categorical vector; where the numeric values 1 and 5 have assigned labels; but the other values happen to have none.
It's the conversion to R that seems to be the problem. It looks like we just never thought about this situation - what to do when there are missing labels like this. My first guess, we should just be using "2", "3", "4" etc. for the values that are missing the descriptive labels... But please let us investigate/think about this some more.

@djbrooke
Copy link
Contributor

@stevenmce - we talked about this in Backlog Grooming today and we'll need some more information and investigation before we estimate. We'll have another opportunity to discuss next Wednesday.

@pdurbin
Copy link
Member

pdurbin commented Jun 18, 2018

@stevenmce emphasized the importance of this bug during last week's community meeting so I assigned it to myself and plan to bring it up during backlog grooming in the context of pull request #4708, which is currently in flight.

@djbrooke
Copy link
Contributor

@pdurbin let's keep it separate if possible - that one is already big enough :)

@landreev, you had mentioned some further investigation above. Can you work with @pdurbin to make sure the efforts are coordinated and can we please try to get some specific questions to @stevenmce and the ADA team before we groom this?

@benjamin-martinez benjamin-martinez self-assigned this Jun 18, 2018
@landreev
Copy link
Contributor

@pdurbin @djbrooke
I believe there are multiple reasons why this should be treated separately from the PR 4708.
One is that the sample ingested Stata file they provide (https://demo.dataverse.org/file.xhtml?fileId=28929&version=RELEASED&version=.0) is an "old-school", pre-v.13 Stata. So it's not even handled by the ingest plugin @benjamin-martinez and @oscardssmith have been working on.
They (the people who submitted the report) do mention that they haven't had a particularly good luck with Stata 13 - meaning, most of their Stata 13 files just don't ingest, period - and that most likely will be much improved by the fixes to the Stata 13+ ingest plugin. But that was not specific to the value label bug this issue is mainly about.
But, most importantly - as I tried to explain in the comment above, this does not appear to be a Stata ingest; or ingest in general. Rather, it's a problem with converting the data to RData format, once ingested. We don't seem to be doing anything wrong with how we store these value labels in the database. What we are doing wrong, is we somehow fail to pass the list of these labels to R, when we export to R, in a way that does not confuse R into thinking that no label on the list = missing value.

@landreev
Copy link
Contributor

The TL;DR version:
This is an issue outside of the Stata 13 ingest.
If we need to prioritize it, because it's important to a partner, let's do that.

@landreev
Copy link
Contributor

(@pdurbin - thanks for bringing this up during standup! I did miss these last comments in this issue from yesterday...)

@pdurbin
Copy link
Member

pdurbin commented Jun 19, 2018

@benjamin-martinez thanks for investigating this bug with me today and thank you @landreev for the summary here and in person after tech hours. To re-iterate, the bug seems to on export to RData format.

By the way, Ben and I weren't sure what value Dataverse is adding by creating its own RData derivative of an original RData file but whatever. 😄

@djbrooke
Copy link
Contributor

@landreev @pdurbin @benjamin-martinez

Do you feel that there's enough here to get an estimate on the fix for this and pull it into a sprint? Or do we need more info from ADA? I'd like to bring it into the sprint today if we have enough info.

@pdurbin
Copy link
Member

pdurbin commented Jun 20, 2018

@benjamin-martinez here is the code we were looking at:

src/main/webapp/file-download-button-fragment.xhtml

<p:commandLink rendered="#{!downloadPopupRequired}"
               process="@this"
               disabled="#{(fileMetadata.dataFile.ingestInProgress  or lockedFromDownload) ? 'disabled' : ''}" 
               actionListener="#{fileDownloadService.startFileDownload(guestbookResponse, fileMetadata, 'RData')}">
    #{bundle['file.downloadBtn.format.rdata']}
</p:commandLink>
<p:commandLink rendered="#{downloadPopupRequired}"
               process="@this"
               disabled="#{(fileMetadata.dataFile.ingestInProgress  or lockedFromDownload) ? 'disabled' : ''}" 
               action="#{guestbookResponseService.modifyDatafileAndFormat(guestbookResponse, fileMetadata, 'RData' )}"
               update="@widgetVar(downloadPopup)"
               oncomplete="PF('downloadPopup').show();handleResizeDialog('downloadPopup');">
    #{bundle['file.downloadBtn.format.rdata']}
</p:commandLink>

Those methods are in these "service beans":

  • src/main/java/edu/harvard/iq/dataverse/FileDownloadServiceBean.java
  • src/main/java/edu/harvard/iq/dataverse/GuestbookResponseServiceBean.java

Ultimately, the redirect sends the user to this API bean: src/main/java/edu/harvard/iq/dataverse/api/Access.java

http://guides.dataverse.org/en/4.9/api/dataaccess.html shows "formats" such as "original" and "RData".

Heres's a screenshot from Firefox:

screen shot 2018-06-20 at 4 29 11 pm

From what @landreev was saying at the meeting this afternoon, R code is generated on the fly by Dataverse and sent to Rserve, which is mentioned here: http://guides.dataverse.org/en/4.9/installation/r-rapache-tworavens.html

@landreev
Copy link
Contributor

@benjamin-martinez @oscardssmith
The class that talks to R in order to produce the R data frame (an .RData file) is RemoteDataFrameService.java. It uses a few other classes in edu/harvard/iq/dataverse/rserve, and some code in edu/harvard/iq/dataverse/dataaccess/DataConverter.java. There is also some fixed R code in the Dataverse (edu/harvard/iq/dataverse/rserve/scripts/dataverse_r_functions.R) that is passed to R in real time. The communication to R is done using Rserve protocol. The Rserve server can run on your local systems, or you can use remote Rserve running on one of our test boxes - for example, on dvn-build.hmdc.harvard.edu.

It's based largely on some much older code written by another developer during an earlier stage of the project.
Looking at it, it's pretty complicated stuff. I'm wondering if this issue would be better suited for somebody from the full-time developers team. So do not hesitate to ask for help and; or even jump on something else if you need to.

The map of all the categorical values for the datafile is passed to the RemoteDataFrameService as sro.getValueTable(). Then we turn it into a map on the R side called VALTABLE; and that's what R will use to change the vectors made from the values in the tab files into "factors" - R's version of categorical variables.

One solution for the issue at hand is to add some R code instructing it to assume that the VALTABLE may not contain the labels for some values found in the tabular vector. And not to assume N/A, but to just use the string value of the element as the label (i.e., if the value in the numeric vector is 1, but there is no label in VALTABLE corresponding to 1, juse use "1" for the label).

There is an alternative solution though, that would not require any extra R code, but can be done solely on the Dataverse side and in Java:
When we create the valueTable in sro (RJobRequest.java) - this happens in DataConverter.getValueTableForRequestedVariables(...)) - we can, in addition to going through all the category labels in DataVariable.getCategories(), subset the vector for this variable from the tab file, got through the values, and if necessary, create extra labels, for all the unique values that are not in the dataVariable.getCategories()... (Oscar, you should have a rough idea of how to subset individual vectors, from looking at the code that calculates UNFs and summary stats...)

The drawbacks of this method - having to read and subset the tabular file on the application side. (this of course would have to be done for categorical variables only, not for every vector - but still!).
And R will need to read the tabular file anyway - so that would be the reason to solve it by adding something on the R side... But that comes with having to figure out what goes on in that R code.

@oscardssmith
Copy link
Contributor

@stevenmce given that in R categorical columns can not have data not assigned to a category, what path do you want us to take her?

  1. For each unlabeled value, produce a new label for that value which is str(value)
  2. Prevent these arguably incomplete files from being downloaded in R
  3. Prevent these arguably incomplete files from being ingested

@pdurbin
Copy link
Member

pdurbin commented Jun 28, 2018

@izahn suggests using a labeller built into haven: https://cran.r-project.org/web/packages/haven/vignettes/semantics.html

@izahn
Copy link

izahn commented Jun 28, 2018

See https://haven.tidyverse.org/articles/semantics.html for tools designed to preserve metadata from other statistics packages in R.

@pdurbin
Copy link
Member

pdurbin commented Jun 28, 2018

I'm not sure if this will be helpful or not but I found https://cran.r-project.org/package=DDIwR and the PDF says, "This package provides various functions to read DDI based metadata documentation, and write dedicated setup files for R, SPSS, Stata and SAS to read an associated .csv file containing the raw data, apply labels for variables and values and also deal with the treatment of missing values."

@landreev
Copy link
Contributor

landreev commented Jul 13, 2018

There is also the case specific to our production (and maybe to one or two particularly old external installations, such as Odum): we have a limited number of ingested tabular files for which we don't have the saved originals/don't have the original format preserved at all. The files in question were grandfathered from something called "VDC" - the dinosaur ancestor of the Dataverse application. Some of these old files happen to be very very important (for example, many data files from Gary King's datasets are part of this subset). So we don't want these R conversions to just start failing for these files.
(granted, for most or all of Gary's files, the RData files are already cached; so one way to address this would be to keep them in place, and only wipe the cached RDatas for the files that were specifically affected by the missing label issue. Also, we were recently discussing working with Murray project to restore the original for these old tabular files. So this may get resolved outside of this issue - but still, this is an issue).

@landreev
Copy link
Contributor

A couple more things:
Generally, it makes perfect sense, to want to use a piece of software written in R, by R programmers, and specifically for the purpose of converting foreign formats to RData. Rather than writing and maintaining code in R ourselves, which is not our strongest point.
Whoever, there's one potential issue with this approach: the above is only true for the tabular ingest as it is implemented now. That is, when we convert proprietary formats (stata, spss, etc.) into our internal format ONCE, with no way to modify it later on. But we are already working on providing APIs for modifying existing tabular metadata. So this creates a possibility of a use case where the owner has improved the descriptive tabular metadata - with better variable names, labels, etc. - which is reflected in the DDI, json exports, and/or in TwoRavens etc. Yet if they try “download as RData”, users will see the original, “poor” metadata from the saved original.

I don’t know if we want/need to discuss this further. This PR definitely addresses the current issue; and this will not be a problem for the foreseeable future. But we need to assume that we will likely have to go back to learning how to properly convert our (Dataverse) tabular metadata to R data frames from scratch in the future.

In this PR however, we do need it explained somewhere in the documentation, that this is the way we convert to RData - from the saved original, using a third party R library.

@landreev
Copy link
Contributor

Re: documentation: there's a section in the API guide, "Basic File Access" that mentions download as R:
RData | Tabular data as an R Data frame (generated; unless the “original” file was in R);
this is probably a good place to explain how these RData files are generated.

@landreev
Copy link
Contributor

I requested some changes in the PR. (want to address/rename the new "isLabled" thing in DataVariable)

@pdurbin
Copy link
Member

pdurbin commented Jul 23, 2018

I discussed this issue with @oscardssmith and @landreev this morning and have a pretty good sense of the work that's left, messing with the new unused boolean. There's a file floating around called "BeerTastingTestData.dta" that can be used for testing. One needs to install the haven R package or use an Rserve with it (dvn-build soon, probably).

@pdurbin
Copy link
Member

pdurbin commented Jul 23, 2018

I made the requested changes in 03c8051 and cleared them with @oscardssmith and then @landreev . I'm attaching the file I used to ensure that the new "factor" column is being populated with a "true" when the RData file contains factors: test_factor.RData.txt

Back to code review.

@landreev
Copy link
Contributor

@kcondon For QA, please use Rserve on dvn-build.hmdc.harvard.edu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants