Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
388 lines (308 sloc) 14.9 KB
---
title: "o2r corpus - showcase upload"
author: "Daniel Nüst"
---
## Required libraries
```{r libs}
library("httr")
library("curl")
library("stringr")
library("jsonlite")
```
## Corpus preparation
This document contains some script to upload the workspaces making up the o2r test corpus to an instance of the o2r reproducibility service.
**Note:** This document is best run chunk by chunk, and not as a whole, using RStudio.
The path to the corpus directory must be set for any code to work.
You can add a file `.Renviron` on your local machine next to this file, and in it define the path in the environment variable `ERC_EXAMPLES_CORPUS_PATH`.
```{r}
if(!exists("corpus_path")) {
corpus_path <- Sys.getenv(x = "ERC_EXAMPLES_CORPUS_PATH", unset = NA)
if (is.na(corpus_path))
stop("Corpus path must be set!")
}
cat("Loading copus papers from", corpus_path, "\n")
```
At the path `r corpus_path` a directory `Finished` is expected with one directory per successfully reproduced paper.
Inside the paper's directory, there must be an archive file called `workspace.zip`.
In the workspace archive, there must be an R Markdown document, i.e. a file with file extension `.Rmd`.
This file must be a suitable ERC main document for upload to the o2r reference implementation.
## Start o2r reference implementation
This is optional if you want to upload corpus papers to a remote deployment of the reference implementation.
```{bash, eval = FALSE}
git clone https://github.com/o2r-project/reference-implementation.git
cd reference-implementation
make hub
```
**Now log in as a "regular user" at http://localhost.**
## Get user cookie for authentication
The following code chunk retrieves the cookie from the local reference implementation, which is exposed via [o2r-guestlister]().
This is a "security hole" which of course does not work when uploading workspaces to a remote reference implementation deployment.
In case of a remote service, log in, find the cookie `connec.sid` for the platform website in your browser (e.g. using debug tools, `F12`), and manually define the variable `o2r_user_cookie` with the value of said cookie.
Depending on the used browser, you must decode the string so that is starts with `s:`.
```{r cookie_local, eval=FALSE}
o2r_api_endpoint <- "http://localhost/api/v1/"
user_cookie_url <- "http://localhost/oauth/cookie/0000-0001-6225-344X"
# get cookie via API from o2r-guestlister, but user must login manually
if (!is.null(content(GET(user_cookie_url))$error))
invisible(readline(prompt = "Login as 'User' at http://localhost, then press [enter] to continue"))
if (!is.null(content(GET(user_cookie_url))$error) && !interactive())
stop("Login as 'User' at http://localhost before rendering the document.")
o2r_user_cookie <- curl_unescape(content(GET(user_cookie_url))$cookie)
# clear cookies (can be required for manual testing and when the cookie changes)
httr::handle_reset(o2r_api_endpoint)
```
```{r cookie_manual, eval=FALSE}
o2r_api_endpoint <- "https://o2r.uni-muenster.de/api/v1/"
o2r_user_cookie <- URLdecode("s%3A1gULF5iJHqZdzTwXeU77N633xOXoU-ft.c22tDacKgXinGLZjovY%2Bz70IY0o0svHDFbYA3cAooXQ")
```
```{r log_endpoint_cookie}
cat("Endpoint ", o2r_api_endpoint, "\n")
cat("Cookie ", o2r_user_cookie, "\n")
```
## Finished papers
**TODO**: the finished paper archives should be published somewhere, e.g. on Zenodo.
```{r finished_papers}
finished_papers_path <- file.path(corpus_path, "Finished")
list.files(finished_papers_path)
```
In each paper there is a ZIP file with the reproduced workspace.
Ideally, there is _one_ R Markdown file and _one_ HTML file at the top level of these ZIP files.
```{r zip_files}
finished_papers_repro_zips <- list.files(path = finished_papers_path,
#pattern = "Reproduced.*zip",
pattern = "workspace.zip",
recursive = TRUE,
full.names = TRUE)
```
`r length(finished_papers_repro_zips)` papers have a reproduced workspace.
_Which of the reproduced workspace contains an R Markdown file?_
```{r zips_with_rmd}
has_rmd <- sapply(X = finished_papers_repro_zips, FUN = function(zipfile) {
files_in_zip <- unzip(zipfile = zipfile, list = TRUE)
any(grepl(pattern = "main\\.Rmd", x = files_in_zip, ignore.case = TRUE))
})
finished_papers_repro_zips[has_rmd]
```
`r length(finished_papers_repro_zips[has_rmd])` papers have a reproduced workspace with an R Markdown file.
_Which of the reproduced workspaces can be uploaded, published, and a job executed?_
See notes [here](https://docs.google.com/spreadsheets/d/1WG8Ab5EKhb4KcZNLpveWwbvqlZ41Gjj7l1LccKyBla4/edit#gid=406313660).
```{r list_of_green_workspaces}
green_papers <- finished_papers_repro_zips[has_rmd][c(1, # Bayesian - check fails
2, # Aquestiondriven - check fails
4, # gastropod
8, # metafor
13,# flood risk
14,# geochronological
16, # INSYDE
22, # Fourier
26, # Timescales
27, # survival
28, # Underestimation - success with no diffs
11, # lubridate - success with no diffs
24, # split-apply-combine
25, # tidy data
# to check:
19, # LOESS - execution takes very long
5, # landslide - recently updated by DN, not waited yet to finish
7, # calibrating norway
18 # vector machines
# need fix:
#10, # CPLFD # ERROR saving new compendium: Error: key /tmp/o2r/compendium/B8c0H/r/nc/TmaxForYears.nc must not contain '.' > o2r-meta
)]
```
Uploading `r length(green_papers)`:
```{r output_green_papers, echo=FALSE}
green_papers
```
## Upload corpus papers as workspaces
```{r upload_functions}
relativePath = function(p) {
str_replace(p, paste0(path.expand(corpus_path), "/"), "")
}
upload_workspace_zip <- function(file, cookie) {
file_relative <- relativePath(file)
startTime <- Sys.time()
cat(format(startTime), "Uploading ", file_relative, "\n")
# https://o2r.info/o2r-web-api/compendium/upload/#upload-via-api
multipart_file <- upload_file(file)
response <- POST(url = paste0(o2r_api_endpoint, "compendium"),
body = list(compendium = multipart_file,
content_type = "workspace"),
accept_json(),
set_cookies(connect.sid = cookie),
encode = "multipart")
#str(response)
response_content <- content(response)
if (is.list(response_content) && !is.null(response_content$id)) {
cat(format(Sys.time()), "Uploaded ", file_relative, ": ",
response_content$id,
paste0("(", format(Sys.time() - startTime), ")"), "\n")
result <- response_content$id
} else {
cat(format(Sys.time()), "Error uploading ", file_relative, "\n",
toString(response_content), "\n")
result <- NA
}
names(result) <- file_relative
return(result)
}
```
```{r upload}
finished_papers_uploads <- lapply(X = green_papers,
FUN = upload_workspace_zip,
cookie = o2r_user_cookie)
```
You can now go to the o2r UI and log in as "User" to edit compendium candidate's metadata.
If no metadata editing is required, you can attempt to publish all candidates with the code in the next section.
## Publish corpus papers
```{r publish_function}
# publish candidates with direct copy of the metadata
publish_compendium <- function(id, cookie, sleep_after_upload_secs = 0) {
if (is.na(id)) {
warning("Cannot publish compendium from ", names(id), "\n")
return(NA)
}
cat(format(Sys.time()), "Publishing ", id, paste0(" (", relativePath(names(id)) ,")"), "...")
# get metadata
response <- GET(url = paste0(o2r_api_endpoint, "compendium/", id, "/metadata"),
accept_json(),
set_cookies(connect.sid = cookie))
metadata <- content(response, as = "text")
# avoid parsing and recoding because of unboxing issues
# extract "o2r" element from metadata
metadata <- str_sub(string = metadata,
start = str_locate(string = metadata, pattern = "\\{\"o2r\"")[[1]],
end = str_length(metadata) - 1)
#cat("Sending metadata:\n", metadata, "\n")
# update metadata
response_update <- PUT(url = paste0(o2r_api_endpoint, "compendium/", id, "/metadata"),
body = metadata,
content_type_json(),
accept_json(),
set_cookies(connect.sid = cookie))
cat(format(Sys.time()), "Published? ", status_code(response_update), "\n")
if (status_code(response_update) != 200) {
cat("Response:", toString(content(response_update)), "\n")
}
result <- status_code(response_update)
names(result) <- id
Sys.sleep(time = sleep_after_upload_secs)
return(result)
}
```
Use loop to publish one paper after the other, and take a little break in between.
```{r publish}
finished_papers_publishings <- c()
for (candidate_id in finished_papers_uploads) {
result <- publish_compendium(candidate_id, cookie = o2r_user_cookie, sleep_after_upload_secs = 2)
if (!is.na(result))
finished_papers_publishings <- c(finished_papers_publishings, result)
}
```
If not all workspaces could be published automatically, go to [the author page](http://localhost/#!/author/0000-0001-6225-344X) to edit the candidate compendia.
Because of a race condition in `o2r-meta` ([#104](https://github.com/o2r-project/o2r-meta/issues/104)), we might get some `HTTP 500` errors.
Let's try to publish those again.
```{r retry_500_errors}
publish_had_500_error <- names(finished_papers_publishings[finished_papers_publishings == 500])
for (candidate_id in publish_had_500_error) {
result <- publish_compendium(candidate_id, cookie = o2r_user_cookie, sleep_after_upload_secs = 2)
# overwrite status
finished_papers_publishings[candidate_id] <- result
}
```
_How many candidates are still not published now?_
```{r publish_result}
length(finished_papers_publishings[finished_papers_publishings != 200])
```
## Start a job for each paper
```{r job_functions}
start_job <- function(id, cookie) {
cat(format(Sys.time()), "Starting job for ", id, "\n")
# get metadata
response <- POST(url = paste0(o2r_api_endpoint, "job"),
body = list(compendium_id = id),
accept_json(),
set_cookies(connect.sid = cookie))
content <- content(response)
if (is.null(content$job_id)) {
cat(format(Sys.time()), "Error starting job: ", toString(content), "\n")
result <- paste0("Error: ", toString(content))
} else {
cat(format(Sys.time()), "Job for ", id, "is", content$job_id, "\n")
result <- content$job_id
}
names(result) <- paste("compendium:", id)
return(result)
}
#start_job(names(finished_papers_publishings)[[2]], o2r_user_cookie)
job_status <- function(job_id, cookie) {
response <- GET(url = paste0(o2r_api_endpoint, "job/", job_id),
accept_json(),
# authenticate even if not needed to not destroy cookie caching
set_cookies(connect.sid = cookie))
job <- content(response)
return(job$status)
}
finish_job <- function(id, cookie) {
job_id <- start_job(id, cookie)
if (str_detect(string = toString(job_id), pattern = "Error")) {
final_status = NA
} else {
cat(format(Sys.time()), "Waiting for job ", job_id)
while (job_status(job_id, cookie) == "running") {
cat(".")
Sys.sleep(time = 10)
}
cat("\n")
final_status <- job_status(job_id, cookie)
cat(format(Sys.time()), "Job ", job_id, "ended with", final_status, "\n")
}
names(final_status) <- job_id
return(final_status)
}
```
Run the jobs, one at a time.
```{r jobs}
finished_papers_jobs <- c()
for (id in names(finished_papers_publishings)) {
result <- finish_job(id, cookie = o2r_user_cookie)
finished_papers_jobs <- c(finished_papers_jobs, result)
}
summary(as.factor(finished_papers_jobs))
```
## Fixing errors & testing
The following code chunk can be used to upload, publish, and execute a single compendium from the list `finished_papers_repro_zips[has_rmd]` for debugging purposes.
For the function to work, run the chunks above to load finished papers and the three `.._function` chunks.
```{r single_upload, eval=FALSE}
process_workspace <- function(workspaceIndex, the_list = finished_papers_repro_zips[has_rmd]) {
id <- upload_workspace_zip(the_list[[workspaceIndex]], cookie = o2r_user_cookie)
publish_result <- publish_compendium(id, o2r_user_cookie)
job_result <- finish_job(names(publish_result), o2r_user_cookie)
cat("Uploaded ", finished_papers_repro_zips[has_rmd][[workspaceIndex]], " as ", id,
"\nand published resulting in ", publish_result, "\nand ran job ", names(job_result), " resulting in ", job_result)
}
#process_workspace(19)
```
## Testing the metadata extraction
The following commands are assumed to run in the repository directory of [`o2r-meta`](https://github.com/o2r-project/o2r-meta).
```
/o2r-meta$ python3 o2rmeta.py -debug extract -i ~/ownCloud/o2r-data/Korpus/Reproducing\ papers/Finished/Aspacetimemodel/workspace -o /tmp/Korpus -b ~/ownCloud/o2r-data/Korpus/Reproducing\ papers/Finished/Aspacetimemodel/workspace
/o2r-meta$ ls /tmp/Korpus/
erc_spec.pdf metadata_raw.json
/o2r-meta$ cat /tmp/Korpus/metadata_raw.json
```
Make sure `mainfile` and `displayfile` in raw metadata are correct.
You can also broker and check the o2r metadata:
```
$ python3 o2rmeta.py -debug broker -i /tmp/Korpus/metadata_raw.json -m broker/mappings/o2r-map.json -o /tmp/Korpus
$ ls /tmp/Korpus/
erc_spec.pdf metadata_o2r_1.json metadata_raw.json package_slip.json
$ cat /tmp/Korpus/metadata_o2r.json
$ python3 o2rmeta.py -debug validate -s schema/json/o2r-meta-schema.json -c /tmp/Korpus/metadata_o2r_1.json
```
The last call must show a "valid" metadata file for publishing to work.
## Testing the manifest generation
```{r manifests, eval=FALSE}
library("containerit")
workspace <- here("Finished/spacetime - spatio temporal data in R/workspace")
```