Unable to retrieve an unpublished data file. #115

famuvie · 2022-02-04T15:03:53Z

Please specify whether your issue is about:

a suggested code or documentation change, improvement to the code, or feature request

I'm having an issue while trying to access an unpublished data file, which requires an API token. Unfortunately, this makes the following code not reproducible. Let me know if there is a way of making a reproducible example in this case.

The problem is that I cannot use any of the get_dataframe_by_* functions, due to an issue with is_ingested() which seems unable to find the target file.
However, if I work around is_ingested() I can retrieve the data as is shown in the example below.

library(dataverse)
packageVersion("dataverse")
#> [1] '0.3.10'

get_dataframe_by_id(
  fileid = 12930,
  dataset = "https://doi.org/10.18167/DVN1/8Z1ZI9"
)
#> Error in is_ingested(fileid, ...): File information not found on Dataverse API

# A successful read
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
fileid = 12930
query <- list(format = "original")
u_part <- "access/datafile/"
u <- paste0(dataverse:::api_url(server), u_part, fileid)
r <- httr::GET(u, httr::add_headers(`X-Dataverse-key` = key), query = query)
httr::content(r)
#> Rows: 1347 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): ;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1,347 × 1
#>    `;Vache;Date;Commune;eleveur;Troupeau;R_bursa;H_marginatum;I_ricinus;H_scupe…
#>    <chr>                                                                        
#>  1 1;200532 2248;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  2 2;200531 7320;26/02/2020;CASTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  3 3;200530 9555;26/02/2020;ZALANA;;;1;0;0;0;0;0;0;0;1                          
#>  4 4;200532 8365;26/02/2020;MOLTIFAO;;;0;0;0;0;0;0;0;0;0                        
#>  5 5;200533 1185;26/02/2020;PIEDIGRIGGIO;;;0;0;0;2;0;0;0;0;2                    
#>  6 6;200532 3312;26/02/2020;CORTE;;;0;0;0;0;0;0;0;0;0                           
#>  7 7;200531 0907;26/02/2020;BORGO;;;0;0;0;0;0;0;0;0;0                           
#>  8 8;200532 2246;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                       
#>  9 9;200530 8506;26/02/2020;CORTE;;;1;0;0;0;0;0;0;0;1                           
#> 10 10;200532 2245;26/02/2020;POPOLASCA;;;0;0;0;0;0;0;0;0;0                      
#> # … with 1,337 more rows

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Linux Mint 20.1
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13 knitr_1.37      magrittr_2.0.2  rlang_1.0.0    
#>  [5] fastmap_1.1.0   fansi_0.5.0     stringr_1.4.0   styler_1.4.1   
#>  [9] highr_0.9       tools_4.1.2     xfun_0.29       utf8_1.2.2     
#> [13] cli_3.1.0       withr_2.4.2     htmltools_0.5.2 ellipsis_0.3.2 
#> [17] yaml_2.2.2      digest_0.6.29   tibble_3.1.6    lifecycle_1.0.1
#> [21] crayon_1.4.2    purrr_0.3.4     vctrs_0.3.8     fs_1.5.2       
#> [25] glue_1.6.1      evaluate_0.14   rmarkdown_2.11  reprex_2.0.1   
#> [29] stringi_1.7.6   compiler_4.1.2  pillar_1.6.4    backports_1.4.1
#> [33] pkgconfig_2.0.3

The text was updated successfully, but these errors were encountered:

famuvie · 2022-02-04T15:06:33Z

Ultimately, the problem in is_ingested() boils down to dataverse_search() not finding the file:

library(dataverse)
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930", type = "file", server = server, key = key)
#> 0 of 0 results retrieved
#> list()

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

It is worth noting that I can find the file using some keywords on the web interface.

Whereas dataverse_search() correctly finds a published file.

pdurbin · 2022-02-04T15:28:51Z

Hmm, because the file is in draft, I bet _draft would need to be appended like this:

id = "datafile_12930_draft"

@famuvie do you want to see if you can find your draft file that way with curl? You'll have to pass your API token. Docs on this at https://guides.dataverse.org/en/5.9/api/search.html

@kuriwaki this might also work:

entityId:12930

An example: https://dataverse.harvard.edu/api/search?q=entityId:3371438

(I'm not sure why I suggested id instead of entityId at #113 (comment) . The id changes (_draft is dropped on publish) but entityId stays the same.)

famuvie · 2022-02-04T15:33:59Z

Not sure how to pass the API token with curl, but it works with dataverse_search():

library(dataverse)
server <- Sys.getenv("DATAVERSE_SERVER")
key <- Sys.getenv("DATAVERSE_KEY")
dataverse_search(id = "datafile_12930_draft", type = "file", server = server, key = key)
#> 1 of 1 result retrieved
#>                   name type
#> 1 Bovine_2020_2021.tab file
#>                                                    url file_id
#> 1 https://dataverse.cirad.fr/api/access/datafile/12930   12930
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              description
#> 1 On this document, there is only the tick data collection between 2020 and 2021.\n\nSome information about variable :\n\n- Vache : Identifier of cow (10 digits)\n- Date : date of slaughterhouse visit\n- Commune : origin cow municipality\n- Eleveur : origin cow breeder\n- Troupeau : municipality and breeder\n- H_marginatus : number of *H_marginatus* collected\n- R_bursa : number of *R_bursa* collected\n- I_ricinus : number of *I_ricinus* collected\n- H_scupense : number of *H_scupense* collected\n- B_annulatus : number of *B_annulatus* collected\n- R_sanguineus : number of *R_sanguineus* collected\n- H_punctata : number of *H_punctata* collected.\n- D_marginatus : number of *D_marginatus* collected\n- Tiques ? : sum of ticks collected
#>       file_type         file_content_type size_in_bytes
#> 1 Tab-Delimited text/tab-separated-values         69648
#>                                md5 checksum.type
#> 1 688c6fc5f92e6526a3cd158854027e8b           MD5
#>                     checksum.value                            unf dataset_name
#> 1 688c6fc5f92e6526a3cd158854027e8b UNF:6:lt7ZJ1diuShhMCd8UWq5zQ==       Bovine
#>   dataset_id    dataset_persistent_id
#> 1      12928 doi:10.18167/DVN1/8Z1ZI9
#>                                                                                                                                         dataset_citation
#> 1 Bartholomee, Colombine, 2022, "Bovine", https://doi.org/10.18167/DVN1/8Z1ZI9, CIRAD Dataverse, DRAFT VERSION, UNF:6:ov2odYXNktsIbiuwc2MDJQ== [fileUNF]

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

pdurbin · 2022-02-04T15:37:48Z

I'm not sure how to pass the API token with curl. I'll check.

You can pass it as a header or a query parameter. Please see https://guides.dataverse.org/en/5.9/api/auth.html

famuvie · 2022-02-04T15:38:59Z

Sorry, I made a mistake in the previous example and have just corrected it. It actually works!

famuvie · 2022-02-04T15:46:19Z

Still, I can't find a hacky way for adding "_draft" to the file id. I guess that needs to be fixed in the package :)

kuriwaki · 2022-03-11T18:26:26Z

Thanks @famuvie for creating an issue. A partial fix is now on dev.
@pdurbin, thanks for pointing out entityId. I implemented it on dev as there seems to be no downside.

I created a test dataset on demo dataverse that is intentionally unpublished. The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API. Have you seen this before?

Proper UNF detection becomes necessary since that's how it currently determines if a file is ingested or not.

> str(dataset_files(dataset = "10.70122/FK2/4XHVAP", server = "demo.dataverse.org")[[1]]$dataFile)
List of 16
 $ id                 : int 1951382
 $ persistentId       : chr ""
 $ pidURL             : chr ""
 $ filename           : chr "mtcars.tab"
 $ contentType        : chr "text/tab-separated-values"
 $ filesize           : int 1713
 $ storageIdentifier  : chr "s3://demo-dataverse-org:17f75571af3-60325bcbb1f1"
 $ originalFileFormat : chr "text/csv"
 $ originalFormatLabel: chr "Comma Separated Values"
 $ originalFileSize   : int 1700
 $ originalFileName   : chr "mtcars.csv"
 $ UNF                : chr "UNF:6:KRE/AItWGJWd5tJ+bboN7A=="
 $ rootDataFileId     : int -1
 $ md5                : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum           :List of 2
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ creationDate       : chr "2022-03-10"

> str(dataverse_search(entityId = 1951382, server = "demo.dataverse.org", key = Sys.getenv("DATAVERSE_KEY")))
1 of 1 result retrieved
'data.frame':	1 obs. of  13 variables:
 $ name                 : chr "mtcars.csv"
 $ type                 : chr "file"
 $ url                  : chr "https://demo.dataverse.org/api/access/datafile/1951382"
 $ file_id              : chr "1951382"
 $ file_type            : chr "Comma Separated Values"
 $ file_content_type    : chr "text/csv"
 $ size_in_bytes        : int 1700
 $ md5                  : chr "c502359c26a0931eef53b2207b2344f9"
 $ checksum             :'data.frame':	1 obs. of  2 variables:
  ..$ type : chr "MD5"
  ..$ value: chr "c502359c26a0931eef53b2207b2344f9"
 $ dataset_name         : chr "Permanent draft dataset for testing"
 $ dataset_id           : chr "1951381"
 $ dataset_persistent_id: chr "doi:10.70122/FK2/4XHVAP"
 $ dataset_citation     : chr "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\", https://doi.org/10.70122/FK2/4XHVAP, Demo Datav"| __truncated__

pdurbin · 2022-03-11T20:26:58Z

The get commands seem to go ok except for my unpublished test file does not have a UNF under the SEARCH API even though it does with the File API.

Huh. This is news to me but I see what you mean.

No UNF from the Search API when I look at your unpublished file...

curl -H X-Dataverse-key:$API_TOKEN https://demo.dataverse.org/api/search?q=id:datafile_1951382_draft

{
  "status": "OK",
  "data": {
    "q": "id:datafile_1951382_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "mtcars.csv",
        "type": "file",
        "url": "https://demo.dataverse.org/api/access/datafile/1951382",
        "file_id": "1951382",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 1700,
        "md5": "c502359c26a0931eef53b2207b2344f9",
        "checksum": {
          "type": "MD5",
          "value": "c502359c26a0931eef53b2207b2344f9"
        },
        "dataset_name": "Permanent draft dataset for testing",
        "dataset_id": "1951381",
        "dataset_persistent_id": "doi:10.70122/FK2/4XHVAP",
        "dataset_citation": "Kuriwaki, Shiro, 2022, \"Permanent draft dataset for testing\", https://doi.org/10.70122/FK2/4XHVAP, Demo Dataverse, DRAFT VERSION"
      }
    ],
    "count_in_response": 1
  }
}

... but when I look at a published file (different server but shouldn't matter), I do see a UNF:

curl https://dataverse.harvard.edu/api/search?q=id:datafile_3371438
{
  "status": "OK",
  "data": {
    "q": "id:datafile_3371438",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2019-02-25.tab",
        "type": "file",
        "url": "https://dataverse.harvard.edu/api/access/datafile/3371438",
        "file_id": "3371438",
        "description": "",
        "published_at": "2019-02-26T03:03:13Z",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 17232,
        "md5": "9bd94d028049c9a53bca9bb19d4fb57e",
        "checksum": {
          "type": "MD5",
          "value": "9bd94d028049c9a53bca9bb19d4fb57e"
        },
        "unf": "UNF:6:2MMoV8KKO8R7sb27Q5GXtA==",
        "file_persistent_id": "doi:10.7910/DVN/TJCLKP/3VSTKY",
        "dataset_name": "Open Source at Harvard",
        "dataset_id": "3035124",
        "dataset_persistent_id": "doi:10.7910/DVN/TJCLKP",
        "dataset_citation": "Durbin, Philip, 2017, \"Open Source at Harvard\", https://doi.org/10.7910/DVN/TJCLKP, Harvard Dataverse, DRAFT VERSION, UNF:6:2MMoV8KKO8R7sb27Q5GXtA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}

Perhaps we don't reindex the file after ingest is complete? I'm not sure. You could test this by making a change to your draft dataset metadata (add a keyword or something). This will reindex the dataaset and its files.

kuriwaki · 2022-03-11T23:09:17Z

Yes! It was sufficient to add a data description to the draft dataset, and it somehow updated. Thank you.

pdurbin · 2022-03-14T13:56:07Z

@kuriwaki hmm, I can replicate this on "develop" on my laptop (around 0d853b74e9). When I first upload a file to a draft, the UNF does not appear in search results...

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
{
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2016-06-29.csv",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Comma Separated Values",
        "file_content_type": "text/csv",
        "size_in_bytes": 58690,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        },
        "dataset_name": "zzz",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzz\", https://doi.org/10.5072/FK2/JJK8WY, Root, DRAFT VERSION"
      }
    ],
    "count_in_response": 1
  }
}

... but if I edit the metadata of the draft dataset (forcing the file to be reindexed, the UNF appears):

$ curl -s -H X-Dataverse-key:$API_TOKEN http://localhost:8080/api/search?q=id:datafile_5_draft | jq .
{
  "status": "OK",
  "data": {
    "q": "id:datafile_5_draft",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "2016-06-29.tab",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/5",
        "file_id": "5",
        "file_type": "Tab-Delimited",
        "file_content_type": "text/tab-separated-values",
        "size_in_bytes": 59208,
        "md5": "d5de092a84304a9965c787b8dcd27c99",
        "checksum": {
          "type": "MD5",
          "value": "d5de092a84304a9965c787b8dcd27c99"
        },
        "unf": "UNF:6:6YVg+pUWsYD52stDkZuzUA==",
        "dataset_name": "zzzyyy",
        "dataset_id": "4",
        "dataset_persistent_id": "doi:10.5072/FK2/JJK8WY",
        "dataset_citation": "Admin, Dataverse, 2022, \"zzzyyy\", https://doi.org/10.5072/FK2/JJK8WY, Root, DRAFT VERSION, UNF:6:6YVg+pUWsYD52stDkZuzUA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}

Please feel free to open an issue about this at https://github.com/IQSS/dataverse/issues if you'd like.

… edits

kuriwaki · 2022-03-15T18:38:06Z

I will put a tip about this in the dataverse download vignette. I think it is a limitation that might be common to people who try to download draft datasets, but the current method to edit something seems not too onerous.

kuriwaki · 2022-06-11T15:24:02Z

Addressed by 0.3.11.

kuriwaki modified the milestones: CRAN 0.3.10, CRAN 0.4.0 Feb 7, 2022

kuriwaki added a commit that referenced this issue Mar 10, 2022

Change search term for #113, following #115 (comment)

26cf55c

kuriwaki added a commit that referenced this issue Mar 15, 2022

Add comment about drafts in vig (#115 (comment)) along with other vig…

c087f18

… edits

kuriwaki mentioned this issue Apr 9, 2022

CRAN 0.3.11 #119

Closed

4 tasks

kuriwaki added a commit that referenced this issue Jun 10, 2022

Add to note re: #115

e2138fd

kuriwaki mentioned this issue Jun 11, 2022

CRAN 0.3.11 #121

Merged

kuriwaki closed this as completed Jun 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to retrieve an unpublished data file. #115

Unable to retrieve an unpublished data file. #115

famuvie commented Feb 4, 2022 •

edited by kuriwaki

famuvie commented Feb 4, 2022 •

edited

pdurbin commented Feb 4, 2022 •

edited

famuvie commented Feb 4, 2022 •

edited

pdurbin commented Feb 4, 2022

famuvie commented Feb 4, 2022

famuvie commented Feb 4, 2022

kuriwaki commented Mar 11, 2022

pdurbin commented Mar 11, 2022

kuriwaki commented Mar 11, 2022

pdurbin commented Mar 14, 2022

kuriwaki commented Mar 15, 2022

kuriwaki commented Jun 11, 2022

Unable to retrieve an unpublished data file. #115

Unable to retrieve an unpublished data file. #115

Comments

famuvie commented Feb 4, 2022 • edited by kuriwaki

famuvie commented Feb 4, 2022 • edited

pdurbin commented Feb 4, 2022 • edited

famuvie commented Feb 4, 2022 • edited

pdurbin commented Feb 4, 2022

famuvie commented Feb 4, 2022

famuvie commented Feb 4, 2022

kuriwaki commented Mar 11, 2022

pdurbin commented Mar 11, 2022

kuriwaki commented Mar 11, 2022

pdurbin commented Mar 14, 2022

kuriwaki commented Mar 15, 2022

kuriwaki commented Jun 11, 2022

famuvie commented Feb 4, 2022 •

edited by kuriwaki

famuvie commented Feb 4, 2022 •

edited

pdurbin commented Feb 4, 2022 •

edited

famuvie commented Feb 4, 2022 •

edited