New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataverse fails to recognize newer RData formats #6803
Comments
|
I picked this up and poked around a bit. I found an RData file from 2007 at https://doi.org/10.7910/DVN/RWUY8G/90LTEO and it can be successfully ingested. Here's how the first 32 bytes look:
Here are just the first five bytes, which are what our code cares about:
So yes, for these RData files from 2007 our code works well, the regex makes sense ("RDX2" is right there). However, the first five bytes look pretty wildly different for modern RData files. Here are the first 32 bytes look for an RData file from 2019 from https://doi.org/10.7910/DVN/FTYHPJ/JAJHKH
Here are the first 32 bytes for an RData file I just created locally with R 3.6.3 following instructions from https://statisticsglobe.com/r-save-load-rdata-workspace-file
I haven't come across an RDX3 file yet but maybe I'll try to create one myself. In short, the examples seem to require a different regex or perhaps even a different technique to detect that they're RData files. Other thoughts:
|
You are looking at the compessed stream. You're supposed to uncompress it, then look at the first 5 bytes. |
For an example of an "RDX3" file - see the dataset that triggered this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OFQUAN |
@landreev thanks! Yes, now I see "RDX3"...
|
I created pull request #6832 because it was enough to get Dataverse to attempt to ingest the file. However, I don't have a working Rserve installation on my Mac. This is what I did: Start R REPL:
Then from the R REPL:
Start RServe:
But then I'm getting weird errors even on RData files that should ingest. So I plan to spin up my pull request on Linux and try it there. |
There's a section in the installation guide that explains how to install R components that you need to ingest R files. |
Yes, I'd prefer that we test it before we move the issue along. There is a non-zero chance that the R code we use for ingest will somehow stop working on these newer files. |
Good. |
Short version: RData files with the "RDX3" headers are not detected as R files and not ingested.
Longer version: our type check for type "application/x-rlang-transport" is very simple: we test that it's
^(52)(44)(41|42|58)(31|32)(0A)$
.(
IngestableDataChecker.java:71
)Translated from hex, the pattern above is
RD[ABX][12]\n
. So "RDX2" is recognized, but "RDX3" is not. Aside from this check, we don't try to parse the contents of the file ourselves; we rely on R to do it for us. So chances are, if we simply modify the pattern, these files will ingest just fine. But we need to verify that and check if any of the external R components need to be upgraded as well.Also need to check if there are other formats that are being skipped. But v.3 appears to be the latest (introduced a year ago in R 3.5.3, Mar. 2019).
Once the above is done, we can try ingesting all these skipped RData files using the reingest API.
The text was updated successfully, but these errors were encountered: