-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataverse ingest error for variables with mix of labelled and unlabelled values #4676
Comments
Thanks @stevenmce, we'll take a look at this. |
To clarify, this is specifically about exporting ingested files as RData, correct? (it's not mentioned above, but looking at the linked sample dataset suggests that it is...) Off the top of my head, there definitely is a known problem with calculating UNFs of categorical variables (or "factors" in the language of R), when converting datafiles between Stata/SPSS and R; it is caused by the fact that R handles its factors in a way that's fundamentally different from other formats. (see http://guides.dataverse.org/en/4.8.6/user/tabulardataingest/rdata.html#r-factors for some discussion of the controversy). But this specific case - empty labels of distinct categories all becoming NA factor values - is unambiguously wrong; and we should come up with a way to handle it better. We'll discuss this and follow up. |
Just to reiterate, I am fairly positive that this is not an ingest problem, as the original description suggests. |
@stevenmce - we talked about this in Backlog Grooming today and we'll need some more information and investigation before we estimate. We'll have another opportunity to discuss next Wednesday. |
@stevenmce emphasized the importance of this bug during last week's community meeting so I assigned it to myself and plan to bring it up during backlog grooming in the context of pull request #4708, which is currently in flight. |
@pdurbin let's keep it separate if possible - that one is already big enough :) @landreev, you had mentioned some further investigation above. Can you work with @pdurbin to make sure the efforts are coordinated and can we please try to get some specific questions to @stevenmce and the ADA team before we groom this? |
@pdurbin @djbrooke |
The TL;DR version: |
(@pdurbin - thanks for bringing this up during standup! I did miss these last comments in this issue from yesterday...) |
@benjamin-martinez thanks for investigating this bug with me today and thank you @landreev for the summary here and in person after tech hours. To re-iterate, the bug seems to on export to RData format. By the way, Ben and I weren't sure what value Dataverse is adding by creating its own RData derivative of an original RData file but whatever. 😄 |
@landreev @pdurbin @benjamin-martinez Do you feel that there's enough here to get an estimate on the fix for this and pull it into a sprint? Or do we need more info from ADA? I'd like to bring it into the sprint today if we have enough info. |
@benjamin-martinez here is the code we were looking at: src/main/webapp/file-download-button-fragment.xhtml
Those methods are in these "service beans":
Ultimately, the redirect sends the user to this API bean: src/main/java/edu/harvard/iq/dataverse/api/Access.java http://guides.dataverse.org/en/4.9/api/dataaccess.html shows "formats" such as "original" and "RData". Heres's a screenshot from Firefox: From what @landreev was saying at the meeting this afternoon, R code is generated on the fly by Dataverse and sent to Rserve, which is mentioned here: http://guides.dataverse.org/en/4.9/installation/r-rapache-tworavens.html |
@benjamin-martinez @oscardssmith It's based largely on some much older code written by another developer during an earlier stage of the project. The map of all the categorical values for the datafile is passed to the RemoteDataFrameService as sro.getValueTable(). Then we turn it into a map on the R side called VALTABLE; and that's what R will use to change the vectors made from the values in the tab files into "factors" - R's version of categorical variables. One solution for the issue at hand is to add some R code instructing it to assume that the VALTABLE may not contain the labels for some values found in the tabular vector. And not to assume N/A, but to just use the string value of the element as the label (i.e., if the value in the numeric vector is 1, but there is no label in VALTABLE corresponding to 1, juse use "1" for the label). There is an alternative solution though, that would not require any extra R code, but can be done solely on the Dataverse side and in Java: The drawbacks of this method - having to read and subset the tabular file on the application side. (this of course would have to be done for categorical variables only, not for every vector - but still!). |
@stevenmce given that in R categorical columns can not have data not assigned to a category, what path do you want us to take her?
|
@izahn suggests using a labeller built into haven: https://cran.r-project.org/web/packages/haven/vignettes/semantics.html |
See https://haven.tidyverse.org/articles/semantics.html for tools designed to preserve metadata from other statistics packages in R. |
I'm not sure if this will be helpful or not but I found https://cran.r-project.org/package=DDIwR and the PDF says, "This package provides various functions to read DDI based metadata documentation, and write dedicated setup files for R, SPSS, Stata and SAS to read an associated .csv file containing the raw data, apply labels for variables and values and also deal with the treatment of missing values." |
There is also the case specific to our production (and maybe to one or two particularly old external installations, such as Odum): we have a limited number of ingested tabular files for which we don't have the saved originals/don't have the original format preserved at all. The files in question were grandfathered from something called "VDC" - the dinosaur ancestor of the Dataverse application. Some of these old files happen to be very very important (for example, many data files from Gary King's datasets are part of this subset). So we don't want these R conversions to just start failing for these files. |
A couple more things: I don’t know if we want/need to discuss this further. This PR definitely addresses the current issue; and this will not be a problem for the foreseeable future. But we need to assume that we will likely have to go back to learning how to properly convert our (Dataverse) tabular metadata to R data frames from scratch in the future. In this PR however, we do need it explained somewhere in the documentation, that this is the way we convert to RData - from the saved original, using a third party R library. |
Re: documentation: there's a section in the API guide, "Basic File Access" that mentions download as R: |
I requested some changes in the PR. (want to address/rename the new "isLabled" thing in DataVariable) |
I discussed this issue with @oscardssmith and @landreev this morning and have a pretty good sense of the work that's left, messing with the new unused boolean. There's a file floating around called "BeerTastingTestData.dta" that can be used for testing. One needs to install the haven R package or use an Rserve with it (dvn-build soon, probably). |
I made the requested changes in 03c8051 and cleared them with @oscardssmith and then @landreev . I'm attaching the file I used to ensure that the new "factor" column is being populated with a "true" when the RData file contains factors: test_factor.RData.txt Back to code review. |
@kcondon For QA, please use Rserve on dvn-build.hmdc.harvard.edu. |
When ingesting SPSS or Stata data which includes variables that contains both labelled and unlabelled values, the ingest process assigns values with no label as N/A. For example, a 10 point scale (1 = Not at all, 10 = Very much, no value labels assigned for responses 2 to 9) would result in an ingested version which contains only the values 1, 10 and NA. This results in the UNF data being different between the original and ingested versions.
We have attempted to provide a demonstration of the issue with simulated data at:
https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/ZYGTWM
The text was updated successfully, but these errors were encountered: