Skip to content

API: refactor to use feather serialisation instead of rds #327

@PietrH

Description

@PietrH

Update:

I have had another look, and a long think about this.

  • I think the way forward is still to use feather, be it {feather} or {arrow} (but {arrow} is the easy choice here), because I expect significant performance benefits for both the service and the client.

  • I don't think switching to the File API makes sense at the moment, it would add quite a bit of complexity, especially if I want to retain progress reporting for the client. And I'm not sure the benefits would be that much greater than multiple requests/OpenCPU sessions with a better serialiser.

  • A fallback to rds is possible, in that case we could make {arrow} an optional dependency. I'm still contemplating if this is worth the added maintenance cost... arrow can be a bit tricky if you can't install from binary (some Linux users...) or if you have an older system. (a recent C++ compiler is needed). Maybe this fire only needs to be put out when I see some smoke... (as in, I might ignore this until someone complains).


Currently the get_val() helper supports fetching from OpenCPU as JSON or RDS.

In #323 , Stijn found that we are crashing the session by running out of memory, possibly on a serialisation process. I believe base::writeRDS() -> base::serialize() might be the cause of this memory usage. Assuming the crash happens on the writing of the object as RDS to the output stream. I've not been able to replicate this issue locally or on the RStudio Server.

There are a few open issues on opencpu for child processes that died:

I did a quick local test on a deployments table to see if outputting as feather or parquet might help:

Unit: milliseconds
           expr       min        lq      mean    median
 rw_feather(df)  13.28443  14.40181  15.38457  15.07805
 rw_parquet(df)  38.05100  39.58587  41.90330  40.76735
     rw_rds(df) 263.91068 265.43494 269.96770 266.53872
        uq       max neval
  16.14836  22.61397   100
  42.62479  60.33460   100
 270.30619 296.12905   100

It looks like both are faster on my system, I have not benchmarked memory usage yet.

This is using lz4 compression for feather, snappy for parquet and gzip for rds.

Stijn proposes using an alternative fetch method; returning the session id and writing out paged result objects to the session dir, then having the client fetch these objects and serializing on the client. This ties in to an existing paging branch, but Stijn mentioned this will probably require some optimisation on the database so we have a nice column to sort on.


to benchmark:

# compare memory usage and speed of different ways of storing/fetching detections


# read a detections table -------------------------------------------------

# stored result object
df <- readRDS("~/Downloads/albertkanaal.rds")
### or you could create the object via a query: ###
# df <-
#   get_acoustic_detections(animal_project_code = "2013_albertkanaal",
#                           api = FALSE) # because it doesn't work via the API, that's what we are trying to fix

# subset
df_sample <- dplyr::slice_sample(df, prop = 0.1)

# functions for feather and rds extract and load --------------------------


rw_feather <- function(df){
  feather_path <- tempfile()
  arrow::write_feather(df, feather_path, compression = "lz4")
  arrow::read_feather(feather_path)
}

rw_rds <- function(df){
  rds_path <- tempfile()
  saveRDS(df, rds_path, compress = FALSE)
  readRDS(rds_path)
}

# benchmark ---------------------------------------------------------------

bench_result <-
  bench::mark(
    rw_feather(df_sample),
    rw_rds(df_sample),
    memory = TRUE,
    filter_gc = FALSE,
    iterations = 3
  )

Blockers / Action Points

  • Install arrow on Lifewatch RStudio

    • Update R>4.0
    • Include C++17 compiler on Lifewatch RStudio -> gcc>7 currently 5.4.0
  • Implement query paging on etnservice

Optional:

  • Implement chunked writing to file on etnservice
  • Switch to using File API instead of Object API for fetching file to client

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions