Better detection test for whether a file is ingested #113

kuriwaki · 2022-01-07T00:15:06Z

The current method to detect whether something is_ingested, introduced in v0.3.0 is problematic: It only checks if there is a metadata file associated with the fileid. But I guess some files, e.g. those that have ingestion warnings, don't have a metadata file. This can cause the wrong download format as in #80.

If I have a dataset id or name, I now know how to check whether something is ingested: check if the entry originalFileFormat exists (e.g. this JSON).

However, in the particular stage of the client, I sometimes don't have a dataset identifier, only the numeric fileid + server. This happens for example with get_*_by_doi where the user only provides a file DOI. @landreev pointed out that the Dataverse api/files API apparently does not contain info like originalFileFormat, perhaps for legacy reasons.

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand? (@pdurbin ?). In the above example, how would I obtain the dataset iddoi:10.70122/FK2/PPIAXE only by knowing file id=1734017 and server = demo.dataverse.org?

The text was updated successfully, but these errors were encountered:

pdurbin · 2022-01-07T16:00:12Z

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?

My first thought is to get https://demo.dataverse.org/api/search?q=fileId:1734017 and find a dataset_persistent_id of "doi:10.70122/FK2/PPIAXE".

kuriwaki · 2022-01-08T07:05:34Z

@pdurbin this looks promising. Our function dataverse_search() could possibly mimic this. But when I tried searching for fileId=3123547 , which I expected to be this CCES file, I got something completely different: https://dataverse.harvard.edu/api/search?q=fileId:3123547. Do you know why this occurs, and how to fix the query so I get the CCES file instead?

Here is the query confirming that at least the id of the CCES file of interest is 3123547.
https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/GDF6Z0

even though this is a search, for the purpose of this issue I'd need to have it be a strict match on the file id. (that is, return a single entry if the file id exists, and return 0 results if the file id does not exist).

kuriwaki · 2022-01-08T08:42:38Z

Add-on Re:

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?

We would also need to have a method that can get the dataset JSON with only the file DOI (persistentID) in hand. (to use in get_*_by_doi). Using the same example of dataset id: doi:10.70122/FK2/PPIAXE, we'd want to know the dataset id with only persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O and the server.

pdurbin · 2022-01-12T16:09:47Z

@kuriwaki huh. fileId works on the demo server but not for Harvard Dataverse nor my dev laptop. Can you please try id:datafile_NNN instead, like the example below?

https://dataverse.harvard.edu/api/search?q=id:datafile_3123547

pdurbin · 2022-01-12T16:51:39Z

I don't think that "MHDB0O" file is indexed. https://dataverse.harvard.edu/api/search?q=id:datafile_1734017 should find it but it doesn't. Can you please open an issue in https://github.com/IQSS/dataverse.harvard.edu/issues about this?

For a file that is properly indexed, like the CCES file we've been talking about ( https://dataverse.harvard.edu/api/search?q=id:datafile_3123547 ), you should be able to search for it by DOI like this (not the quotes around the DOI): https://dataverse.harvard.edu/api/search?q=filePersistentId:%22doi:10.7910/DVN/GDF6Z0/JPMOZZ%22

kuriwaki · 2022-01-12T16:53:08Z

(for numeric id's)

id:datafile_NNN

This is great. The following three examples work as intended - they give me the single entry. I will try implementing it on dev.

library(dataverse)

#  rds
dataverse_search(id = "datafile_1734017", server = "demo.dataverse.org", type = "file")$name

# CCES problematic dta
dataverse_search(id = "datafile_3123547", server = "dataverse.harvard.edu", type = "file")$name

# other dataverse
dataverse_search(id = "datafile_204446", server = "dataverse.nl", type = "file")$name

kuriwaki · 2022-01-12T17:00:10Z

I don't think that "MHDB0O" file is indexed.

That actually came from the demo dataverse, not Harvard dataverse. This one works great: https://demo.dataverse.org/api/search?q=id:datafile_1734017

For a file that is properly indexed, like the CCES file we've been talking about, you should be able to search for it by DOI like this (note the quotes around the DOI)

Thank you. This seems to work in the two examples below, with the quotes escaped

# CCES
dataverse_search(filePersistentId = "\"doi:10.7910/DVN/GDF6Z0/JPMOZZ\"", server = "dataverse.harvard.edu")$name

# demo.dataverse
dataverse_search(filePersistentId = "\"doi:10.70122/FK2/HXJVJU/SA3Z2V\"", server = "demo.dataverse.org")$name

kuriwaki mentioned this issue Jan 7, 2022

Parsing error for some large Stata files #80

Closed

kuriwaki changed the title ~~Better method for is_ingest~~ Better detection test for whether a file is ingested Jan 7, 2022

pdurbin self-assigned this Jan 12, 2022

pdurbin added this to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 12, 2022

pdurbin removed their assignment Jan 12, 2022

pdurbin removed this from IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 12, 2022

kuriwaki added a commit that referenced this issue Jan 13, 2022

First attempt for #113 (though see #113 (comment)), retry for #80

5ffbc11

kuriwaki added a commit that referenced this issue Jan 13, 2022

Implement suggestion to prepend with quotes or datafile_. #113 (comment)

e32d354

kuriwaki added a commit that referenced this issue Jan 13, 2022

Put error when search does not turn up anything in #113

2e95ba1

kuriwaki mentioned this issue Jan 13, 2022

CRAN 0.3.10 Better ingest detection method which solves #80 #114

Merged

kuriwaki added this to the CRAN 0.3.10 milestone Jan 13, 2022

kuriwaki closed this as completed in #114 Jan 13, 2022

pdurbin mentioned this issue Feb 4, 2022

Unable to retrieve an unpublished data file. #115

Closed

1 task

kuriwaki added a commit that referenced this issue Mar 10, 2022

Change search term for #113, following #115 (comment)

26cf55c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better detection test for whether a file is ingested #113

Better detection test for whether a file is ingested #113

kuriwaki commented Jan 7, 2022

pdurbin commented Jan 7, 2022

kuriwaki commented Jan 8, 2022

kuriwaki commented Jan 8, 2022

pdurbin commented Jan 12, 2022

pdurbin commented Jan 12, 2022

kuriwaki commented Jan 12, 2022 •

edited

kuriwaki commented Jan 12, 2022

Better detection test for whether a file is ingested #113

Better detection test for whether a file is ingested #113

Comments

kuriwaki commented Jan 7, 2022

pdurbin commented Jan 7, 2022

kuriwaki commented Jan 8, 2022

kuriwaki commented Jan 8, 2022

pdurbin commented Jan 12, 2022

pdurbin commented Jan 12, 2022

kuriwaki commented Jan 12, 2022 • edited

kuriwaki commented Jan 12, 2022

kuriwaki commented Jan 12, 2022 •

edited