corpus/showcases.Rmd

---
title: "o2r corpus - showcase upload"
author: "Daniel Nüst"
---

## Required libraries

```{r libs}
library("httr")
library("curl")
library("stringr")
library("jsonlite")
library("rjson")
```


## Corpus preparation

This document contains some script to upload the workspaces making up the o2r test corpus to an instance of the o2r reproducibility service.

**Note:** This document is best run chunk by chunk, and not as a whole, using RStudio.

The path to the corpus directory must be set for any code to work.
You can add a file `.Renviron` on your local machine next to this file, and in it define the path in the environment variable `ERC_EXAMPLES_CORPUS_PATH`.

```{r}
corpus_path <- "../ERC"
cat("Loading copus papers from", corpus_path, "\n")
bindings <- rjson::fromJSON(file = "bindings.json")
```

At the path `r corpus_path` a directory `Finished` is expected with one directory per successfully reproduced paper.
Inside the paper's directory, there must be an archive file called `workspace.zip`.
In the workspace archive, there must be an R Markdown document, i.e. a file with file extension `.Rmd`.
This file must be a suitable ERC main document for upload to the o2r reference implementation.

## Start o2r reference implementation

This is optional if you want to upload corpus papers to a remote deployment of the reference implementation.

```{bash, eval = FALSE}
#git clone https://github.com/o2r-project/reference-implementation.git
#cd reference-implementation
#make hub
```

**Now log in as a "regular user" at http://localhost.**

## Get user cookie for authentication

The following code chunk retrieves the cookie from the local reference implementation, which is exposed via [o2r-guestlister]().
This is a "security hole" which of course does not work when uploading workspaces to a remote reference implementation deployment.
In case of a remote service, log in, find the cookie `connec.sid` for the platform website in your browser (e.g. using debug tools, `F12`), and manually define the variable `o2r_user_cookie` with the value of said cookie.
Depending on the used browser, you must decode the string so that is starts with `s:`.

```{r cookie_local, eval=FALSE}
#o2r_api_endpoint <- "http://localhost/api/v1/"
#user_cookie_url <- "http://localhost/oauth/cookie/0000-0001-6225-344X"
o2r_api_endpoint <- "https://o2r.uni-muenster.de/api/v1/"
o2r_user_cookie <- URLdecode("s%3Al0yBqmf0rMVcSjyCjfy_sj2oCCPEvkxx.lXe5WBzXQIwAUW5xTqojOW8hLfxKB8dCntB8vpdidnA")
# get cookie via API from o2r-guestlister, but user must login manually
#if (!is.null(content(GET(user_cookie_url))$error))
#  invisible(readline(prompt = "Login as 'User' at http://localhost, then press [enter] to continue"))
#if (!is.null(content(GET(user_cookie_url))$error) && !interactive())   
#  stop("Login as 'User' at http://localhost before rendering the document.")
#o2r_user_cookie <- curl::curl_unescape(content(GET(user_cookie_url))$cookie)
# clear cookies (can be required for manual testing and when the cookie changes)
httr::handle_reset(o2r_api_endpoint)
```

```{r cookie_manual, eval=FALSE}
#o2r_api_endpoint <- "https://o2r.uni-muenster.de/api/v1/"
#o2r_user_cookie <- URLdecode("s%3AJJr2hI0Q3ffgU9ZcB55yJIRfJTvUEbKh.GJbmCuqLeUrj1D8zGqHlY1ndRDUOj9jA1QDnthgTYz8")
```

```{r log_endpoint_cookie}
cat("Endpoint    ", o2r_api_endpoint, "\n")
cat("Cookie      ", o2r_user_cookie, "\n")
```

## Finished papers

**TODO**: the finished paper archives should be published somewhere, e.g. on Zenodo.

```{r finished_papers}
finished_papers_path <- file.path(corpus_path, "Finished")
list.files(finished_papers_path)
```

In each paper there is a ZIP file with the reproduced workspace.
Ideally, there is _one_ R Markdown file and _one_ HTML file at the top level of these ZIP files.

```{r zip_files}
finished_papers_repro_zips <- list.files(path = finished_papers_path,
                                         #pattern = "Reproduced.*zip",
                                         pattern = "workspace.zip",
                                         recursive = TRUE,
                                         full.names = TRUE)
```

`r length(finished_papers_repro_zips)` papers have a reproduced workspace.

_Which of the reproduced workspace contains an R Markdown file?_

```{r zips_with_rmd}
has_rmd <- sapply(X = finished_papers_repro_zips, FUN = function(zipfile) {
  files_in_zip <- unzip(zipfile = zipfile, list = TRUE)
  any(grepl(pattern = "main\\.Rmd", x = files_in_zip, ignore.case = TRUE))
})
finished_papers_repro_zips[has_rmd]
```

`r length(finished_papers_repro_zips[has_rmd])` papers have a reproduced workspace with an R Markdown file.

_Which of the reproduced workspaces can be uploaded, published, and a job executed?_

See notes [here](https://docs.google.com/spreadsheets/d/1WG8Ab5EKhb4KcZNLpveWwbvqlZ41Gjj7l1LccKyBla4/edit#gid=406313660).

```{r list_of_green_workspaces}
green_papers <- finished_papers_repro_zips[has_rmd]
```

Uploading `r length(green_papers)`:

```{r output_green_papers, echo=FALSE}
green_papers
```

## Upload corpus papers as workspaces

```{r upload_functions}
relativePath = function(p) {
  str_replace(p, paste0(path.expand(corpus_path), "/"), "")
}
upload_workspace_zip <- function(file, cookie) {
  file_relative <- relativePath(file)
  startTime <- Sys.time()
  cat(format(startTime), "Uploading ", file_relative, "\n")
  # https://o2r.info/o2r-web-api/compendium/upload/#upload-via-api
  multipart_file <- upload_file(file)
  response <- POST(url = paste0(o2r_api_endpoint, "compendium"),
                   body = list(compendium = multipart_file,
                               content_type = "workspace"),
                   accept_json(),
                   set_cookies(connect.sid = cookie),
                   encode = "multipart")
  #str(response)
  response_content <- content(response)
  
  if (is.list(response_content) && !is.null(response_content$id)) {
    cat(format(Sys.time()), "Uploaded  ", file_relative, ": ",
        response_content$id,
        paste0("(", format(Sys.time() - startTime), ")"), "\n")
    result <- response_content$id
  } else {
    cat(format(Sys.time()), "Error uploading ", file_relative, "\n",
        toString(response_content), "\n")
    result <- NA
  }
  names(result) <- file_relative
  return(result)
}
```
  
```{r upload}
finished_papers_uploads <- lapply(X = green_papers,
                                  FUN = upload_workspace_zip,
                                  cookie = o2r_user_cookie)
```

You can now go to the o2r UI and log in as "User" to edit compendium candidate's metadata.
If no metadata editing is required, you can attempt to publish all candidates with the code in the next section.

## Publish corpus papers

```{r publish_function}
# publish candidates with direct copy of the metadata
publish_compendium <- function(id, cookie, sleep_after_upload_secs = 0) {
  if (is.na(id)) {
    warning("Cannot publish compendium from ", names(id), "\n")
    return(NA)
  }
  
  cat(format(Sys.time()), "Publishing ", id, paste0(" (", relativePath(names(id)) ,")"), "...")
  
  # get metadata
  response <- GET(url = paste0(o2r_api_endpoint, "compendium/", id, "/metadata"),
                  accept_json(),
                  set_cookies(connect.sid = cookie))
  
  metadata <-  content(response, as = "text")
  
  tmpMeta = fromJSON(metadata)["metadata"]
  ####BAD HACK: fromJSON toJSON transforms array including "displayFile" to character
  if(typeof(tmpMeta$metadata$o2r$displayfile_candidates)=="character"){
    tmpMeta$metadata$o2r$displayfile_candidates = c("display.html", "")
  }
  if(typeof(tmpMeta$metadata$o2r$codefiles)=="character"){
    #tmpMeta$metadata$o2r$codefiles = c("main.Rmd")
  }
  if(typeof(tmpMeta$metadata$o2r$mainfile_candidates)=="character"){
    tmpMeta$metadata$o2r$mainfile_candidates = c("main.Rmd", "")
  }
  
  for (binding in bindings) {
    if(binding["title"]==tmpMeta$metadata$o2r$title){
      tmpMeta$metadata$o2r$interaction = binding$bindings
      for (i in seq(from=1, to=length(tmpMeta$metadata$o2r$interaction),by=1)){
        tmpMeta$metadata$o2r$interaction[[i]]$id=paste0(id)
        create_binding <- POST(url = paste0(o2r_api_endpoint, "bindings/binding"),
              body = toJSON(tmpMeta$metadata$o2r$interaction[[i]]),
              content_type_json(),
              accept_json()
              )
      }
    }
  }
  
  metadata <- toJSON(tmpMeta)

  # avoid parsing and recoding because of unboxing issues
  # extract "o2r" element from metadata
  metadata <- str_sub(string = metadata,
                      start = str_locate(string = metadata, pattern = "\\{\"o2r\"")[[1]],
                      end = str_length(metadata) - 1)
  #cat("Sending metadata:\n", metadata, "\n")
  
  # update metadata
  response_update <- PUT(url =  paste0(o2r_api_endpoint, "compendium/", id, "/metadata"),
                         body = metadata,
                         content_type_json(),
                         accept_json(),
                         set_cookies(connect.sid = cookie))
  cat(format(Sys.time()), "Published? ", status_code(response_update), "\n")
  if (status_code(response_update) != 200) {
    cat("Response:", toString(content(response_update)), "\n")
  }
  result <- status_code(response_update)
  names(result) <- id
  Sys.sleep(time = sleep_after_upload_secs)
  return(result)
}
```

Use loop to publish one paper after the other, and take a little break in between.

```{r publish}
finished_papers_publishings <- c()
for (candidate_id in finished_papers_uploads) {
  result <- publish_compendium(candidate_id, cookie = o2r_user_cookie, sleep_after_upload_secs = 2)
  if (!is.na(result))
      finished_papers_publishings <- c(finished_papers_publishings, result)
}
```

If not all workspaces could be published automatically, go to [the author page](http://localhost/#!/author/0000-0001-6225-344X) to edit the candidate compendia.

Because of a race condition in `o2r-meta` ([#104](https://github.com/o2r-project/o2r-meta/issues/104)), we might get some `HTTP 500` errors.
Let's try to publish those again.

```{r retry_500_errors}
publish_had_500_error <- names(finished_papers_publishings[finished_papers_publishings == 500])
for (candidate_id in publish_had_500_error) {
  result <- publish_compendium(candidate_id, cookie = o2r_user_cookie, sleep_after_upload_secs = 2)
  # overwrite status
  finished_papers_publishings[candidate_id] <- result
}
```

_How many candidates are still not published now?_

```{r publish_result}
length(finished_papers_publishings[finished_papers_publishings != 200])
```

## Start a job for each paper

```{r job_functions}
start_job <- function(id, cookie) {
  cat(format(Sys.time()), "Starting job for ", id, "\n")
  
  # get metadata
  response <- POST(url = paste0(o2r_api_endpoint, "job"),
                   body = list(compendium_id = id),
                   accept_json(),
                   set_cookies(connect.sid = cookie))
  content <- content(response)
  if (is.null(content$job_id)) {
    cat(format(Sys.time()), "Error starting job: ", toString(content), "\n")
    result <- paste0("Error: ", toString(content))
  } else {
    cat(format(Sys.time()), "Job for ", id, "is", content$job_id, "\n")
    result <- content$job_id
  }
  names(result) <- paste("compendium:", id)
  return(result)
}
#start_job(names(finished_papers_publishings)[[2]], o2r_user_cookie)
job_status <- function(job_id, cookie) {
  response <- GET(url = paste0(o2r_api_endpoint, "job/", job_id),
                  accept_json(),
                  # authenticate even if not needed to not destroy cookie caching
                  set_cookies(connect.sid = cookie))
  job <- content(response)
  return(job$status)
}
finish_job <- function(id, cookie) {
  job_id <- start_job(id, cookie)
  
  if (str_detect(string = toString(job_id), pattern = "Error")) {
    final_status = NA
  } else {
    cat(format(Sys.time()), "Waiting for job  ", job_id)
    while (job_status(job_id, cookie) == "running") {
      cat(".")
      Sys.sleep(time = 10)
    }
    cat("\n")
    
    final_status <- job_status(job_id, cookie)
    cat(format(Sys.time()), "Job              ", job_id, "ended with", final_status, "\n")
  }
  
  names(final_status) <- job_id
  return(final_status)
}
```

Run the jobs, one at a time.

```{r jobs}
finished_papers_jobs <- c()
for (id in names(finished_papers_publishings)) {
  result <- finish_job(id, cookie = o2r_user_cookie)
  finished_papers_jobs <- c(finished_papers_jobs, result)
}
summary(as.factor(finished_papers_jobs))
```

## Fixing errors & testing

### Publish single compendium

The following code chunk can be used to upload, publish, and execute a single compendium from the list `finished_papers_repro_zips[has_rmd]` for debugging purposes.
For the function to work, run the chunks above to load finished papers and the three `.._function` chunks.

```{r single_upload, eval=FALSE}
process_workspace <- function(workspaceIndex, the_list = finished_papers_repro_zips[has_rmd]) {
  id <- upload_workspace_zip(the_list[[workspaceIndex]], cookie = o2r_user_cookie)
  publish_result <- publish_compendium(id, o2r_user_cookie)
  job_result <- finish_job(names(publish_result), o2r_user_cookie)
  cat("Uploaded ", finished_papers_repro_zips[has_rmd][[workspaceIndex]], " as ", id,
    "\nand published resulting in ", publish_result, "\nand ran job ", names(job_result), " resulting in ", job_result)
}
#process_workspace(19)
```

### Testing the metadata extraction

The following commands are assumed to run in the repository directory of [`o2r-meta`](https://github.com/o2r-project/o2r-meta).

```
/o2r-meta$ python3 o2rmeta.py -debug extract -i ~/ownCloud/o2r-data/Korpus/Reproducing\ papers/Finished/Aspacetimemodel/workspace -o /tmp/Korpus -b ~/ownCloud/o2r-data/Korpus/Reproducing\ papers/Finished/Aspacetimemodel/workspace
/o2r-meta$ ls /tmp/Korpus/
erc_spec.pdf  metadata_raw.json
/o2r-meta$ cat /tmp/Korpus/metadata_raw.json
```

Make sure `mainfile` and `displayfile` in raw metadata are correct.

You can also broker and check the o2r metadata:

```
$ python3 o2rmeta.py -debug broker -i /tmp/Korpus/metadata_raw.json -m broker/mappings/o2r-map.json -o /tmp/Korpus
$ ls /tmp/Korpus/
erc_spec.pdf  metadata_o2r_1.json  metadata_raw.json  package_slip.json
$ cat /tmp/Korpus/metadata_o2r.json
$ python3 o2rmeta.py -debug validate -s schema/json/o2r-meta-schema.json -c /tmp/Korpus/metadata_o2r_1.json
```

The last call must show a "valid" metadata file for publishing to work.

### Testing the manifest generation

```{r manifests, eval=FALSE}
library("containerit")
workspace <- here("Finished/spacetime - spatio temporal data in R/workspace")
```

### Create a public link (candidates only)

```{r link, eval=FALSE}
library("httr")
library("curl")

# need to be admin!
#o2r_api_endpoint <- "http://localhost/api/v1/"
user_cookie_url <- "http://localhost/oauth/cookie/0000-0002-1701-2564"
o2r_api_endpoint <- "https://o2r.uni-muenster.de/api/v1/"
#o2r_user_cookie <- URLdecode("s%3AHwd...")

# get cookie via API from o2r-guestlister, but user must login manually
if (!is.null(content(GET(user_cookie_url))$error))
  invisible(readline(prompt = "Login as 'Admin' at http://localhost, then press [enter] to continue"))
if (!is.null(content(GET(user_cookie_url))$error) && !interactive())
  stop("Login as 'Admin' at http://localhost before creating the link")

# set cookie manually here from demo server
#o2r_admin_cookie <- curl::curl_unescape(content(GET(user_cookie_url))$cookie)
o2r_admin_cookie <- URLdecode("s%3AHwd21aHfTV...")

# clear cookies (can be required for manual testing and when the cookie changes)
httr::handle_reset(o2r_api_endpoint)

candidate_id <- upload_workspace_zip(file = "../workspaces/workspace-rmd-data.zip", cookie = o2r_admin_cookie)
  
create_link <- function(id, cookie) {
  response <- httr::PUT(url =  paste0(o2r_api_endpoint, "compendium/", id, "/link"),
                        accept_json(),
                        set_cookies(connect.sid = cookie))
  cat(format(Sys.time()), "Link created for", id, "? ", httr::status_code(response), "\n")
  result <- httr::content(response)
  return(result)
}

#create_link(id = candidate_id, cookie = o2r_admin_cookie)
create_link(id = "Qliyf", cookie = o2r_admin_cookie)
```