Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataverse fails to recognize newer RData formats #6803

Closed
landreev opened this issue Apr 9, 2020 · 10 comments · Fixed by #6832
Closed

Dataverse fails to recognize newer RData formats #6803

landreev opened this issue Apr 9, 2020 · 10 comments · Fixed by #6832

Comments

@landreev
Copy link
Contributor

landreev commented Apr 9, 2020

Short version: RData files with the "RDX3" headers are not detected as R files and not ingested.

Longer version: our type check for type "application/x-rlang-transport" is very simple: we test that it's

  1. valid gzip
  2. the first 5 uncompressed bytes match the following pattern: ^(52)(44)(41|42|58)(31|32)(0A)$.
    (IngestableDataChecker.java:71)

Translated from hex, the pattern above is RD[ABX][12]\n. So "RDX2" is recognized, but "RDX3" is not. Aside from this check, we don't try to parse the contents of the file ourselves; we rely on R to do it for us. So chances are, if we simply modify the pattern, these files will ingest just fine. But we need to verify that and check if any of the external R components need to be upgraded as well.

Also need to check if there are other formats that are being skipped. But v.3 appears to be the latest (introduced a year ago in R 3.5.3, Mar. 2019).

Once the above is done, we can try ingesting all these skipped RData files using the reingest API.

@landreev landreev changed the title Dataverse fails torecognize newer RData formats Dataverse fails to recognize newer RData formats Apr 9, 2020
@djbrooke
Copy link
Contributor

  • We assume this will be a small change, and we'll plan to reingest (and include release note that installations should reingest)

@djbrooke djbrooke added the Small label Apr 15, 2020
@pdurbin pdurbin self-assigned this Apr 16, 2020
@pdurbin pdurbin moved this from Up Next 🛎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Apr 16, 2020
@pdurbin
Copy link
Member

pdurbin commented Apr 16, 2020

I picked this up and poked around a bit.

I found an RData file from 2007 at https://doi.org/10.7910/DVN/RWUY8G/90LTEO and it can be successfully ingested. Here's how the first 32 bytes look:

00000000: 5244 5832 0a58 0a00 0000 0200 0202 0100  RDX2.X..........
00000010: 0104 0000 0004 0200 0000 0100 0010 0900  ................

Here are just the first five bytes, which are what our code cares about:

00000000: 5244 5832 0a                             RDX2.

So yes, for these RData files from 2007 our code works well, the regex makes sense ("RDX2" is right there).

However, the first five bytes look pretty wildly different for modern RData files.

Here are the first 32 bytes look for an RData file from 2019 from https://doi.org/10.7910/DVN/FTYHPJ/JAJHKH

00000000: 1f8b 0800 0000 0000 0003 ecfd 3d73 1c77  ............=s.w
00000010: dee7 6be6 f4e0 a108 910c 46ac 0c19 3a11  ..k.......F...:.

Here are the first 32 bytes for an RData file I just created locally with R 3.6.3 following instructions from https://statisticsglobe.com/r-save-load-rdata-workspace-file

00000000: 1f8b 0800 0000 0000 0003 0b72 8930 e68a  ...........r.0..
00000010: e062 6060 6066 6066 0362 5620 9381 3534  .b```f`f.bV ..54

I haven't come across an RDX3 file yet but maybe I'll try to create one myself.

In short, the examples seem to require a different regex or perhaps even a different technique to detect that they're RData files.

Other thoughts:

@landreev
Copy link
Contributor Author

However, the first five bytes look pretty wildly different for modern RData files.
...

You are looking at the compessed stream. You're supposed to uncompress it, then look at the first 5 bytes.

@landreev
Copy link
Contributor Author

For an example of an "RDX3" file - see the dataset that triggered this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OFQUAN
(this was reported in #dv-general, on Apr. 7).
The .RData files that are recognized as "Gzip Archive" in that dataset are examples of this new RData flavor.

@pdurbin
Copy link
Member

pdurbin commented Apr 16, 2020

@landreev thanks! Yes, now I see "RDX3"...

gunzip -c S02-dataTurkey.RData | xxd -l 32

00000000: 5244 5833 0a58 0a00 0000 0300 0306 0100  RDX3.X..........
00000010: 0305 0000 0000 0555 5446 2d38 0000 0402  .......UTF-8....

@pdurbin
Copy link
Member

pdurbin commented Apr 16, 2020

I created pull request #6832 because it was enough to get Dataverse to attempt to ingest the file.

However, I don't have a working Rserve installation on my Mac. This is what I did:

Start R REPL:

R

Then from the R REPL:

> install.packages("Rserve")

Start RServe:

R CMD Rserve

But then I'm getting weird errors even on RData files that should ingest.

So I plan to spin up my pull request on Linux and try it there.

@landreev
Copy link
Contributor Author

There's a section in the installation guide that explains how to install R components that you need to ingest R files.

@landreev
Copy link
Contributor Author

Yes, I'd prefer that we test it before we move the issue along. There is a non-zero chance that the R code we use for ingest will somehow stop working on these newer files.
I could test it too, but would have to finish other things first.

@pdurbin
Copy link
Member

pdurbin commented Apr 17, 2020

@landreev thanks I ended up testing on a server but it's good to know that devs can use the installation guide instructions. I got an RDX3 file to ingest 🎉 so I moved pull request #6832 to code review. I'll take this issue off the board, as is our convention.

@pdurbin pdurbin removed their assignment Apr 17, 2020
@landreev
Copy link
Contributor Author

Good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants