Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] tar_terra_rast_wrap: multi-target method to preserve SpatRaster metadata #63

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

brownag
Copy link
Contributor

@brownag brownag commented Apr 29, 2024

This is a draft PR that might be able to address #58

This is a completely different way of managing target files--where the target file in _targets/objects/ is an RDS file (like ordinary targets) containing a PackedSpatRaster which is backed by a cached geospatial data file (and any sidecars) held in a user-specified folder

  • Source data file is written using terra::wrapCache() to a user-specified cache directory
  • Target (custom format, saves RDS in target store) is created for PackedSpatRaster which is linked to the cache files
  • Target (format="file") is created for cache files
  • Add cache-managing functions geotargets_destroy_cache(), geotargets_init_cache() and env option for cache path GEOTARGETS_CACHE_DIR (and associated methods)

For now this only works for SpatRaster, but I think a similar solution could be developed for SpatVectorProxy (although this would require either changes to wrapCache() in terra, or a custom wrapCache()-like method developed for this case)

Current "issue" is that you can modify the cache (intentionally or unintentionally) and the main target will not be invalidated. I tried tracking the cache directory before running the caching target, but then this leads to the caching having to run twice before it is skipped.

Example of storing units and categories:

library(targets)
tar_script({
    
    make_rast1 <- function() {
        x <- terra::rast(system.file("ex/elev.tif", package = "terra"))
        terra::units(x) <- "m"
        terra::varnames(x) <- "elev"
        x
    }
    
    make_rast2 <- function() {
        x <- terra::rast(system.file("ex/elev.tif", package = "terra"))
        y <- terra::classify(x, cbind(c(0, 300, 500),
                                      c(300, 500, 1000),
                                      1:3))
        levels(y) <- data.frame(value = 1:3,
                                category = c("low", "med", "hi"))
        y
    }
    
    list(
        geotargets::tar_terra_rast_wrap(
            rast1,
            make_rast1()
        ),
        geotargets::tar_terra_rast_wrap(
            rast2,
            make_rast2()
        )
    )
})

tar_make()
#> ▶ dispatched target rast1
#> ● completed target rast1 [0.009 seconds]
#> ▶ dispatched target rast2
#> ● completed target rast2 [0.016 seconds]
#> ▶ dispatched target rast1_cache_files
#> ● completed target rast1_cache_files [0 seconds]
#> ▶ dispatched target rast2_cache_files
#> ● completed target rast2_cache_files [0 seconds]
#> ▶ ended pipeline [0.334 seconds]

x_raw <- readRDS("_targets/objects/rast1")
x <- tar_read(rast1)

x_raw@attributes
#> $sources
#>   sid
#> 1   1
#>                                                                             source
#> 1 /tmp/RtmpfZAMV0/reprex-67d786408f708-bared-pika/geotargets_cache/rast1/rast1.tif
#>   bands nlyr
#> 1     1    1
#> 
#> $units
#> [1] "m"

terra::units(x)
#> [1] "m"

# varnames not preserved in PackedSpatRaster either
terra::varnames(x)
#> [1] "rast1"

x
#> class       : SpatRaster 
#> dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source      : rast1.tif 
#> name        : elevation 
#> min value   :       141 
#> max value   :       547 
#> unit        :         m

x <- tar_read(rast2)

terra::levels(x)
#> [[1]]
#>   value category
#> 1     1      low
#> 2     2      med
#> 3     3       hi

x
#> class       : SpatRaster 
#> dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source      : rast2.tif 
#> categories  : category 
#> name        : category 
#> min value   :      low 
#> max value   :       hi

tar_read(rast1_cache_files)
#> [1] "geotargets_cache/rast1/rast1.tif"         
#> [2] "geotargets_cache/rast1/rast1.tif.aux.json"

# all skip
tar_make()
#> ✔ skipped target rast1
#> ✔ skipped target rast2
#> ✔ skipped target rast1_cache_files
#> ✔ skipped target rast2_cache_files
#> ✔ skipped pipeline [0.119 seconds]

# change the rast1 cache by changing units
x <- jsonlite::read_json("geotargets_cache/rast1/rast1.tif.aux.json")
x[[1]][[1]] <- "km" 
jsonlite::write_json(x, "geotargets_cache/rast1/rast1.tif.aux.json")

# need to rebuild rast1 target
tar_make()
#> ✔ skipped target rast1
#> ✔ skipped target rast2
#> ▶ dispatched target rast1_cache_files
#> ● completed target rast1_cache_files [0.001 seconds]
#> ✔ skipped target rast2_cache_files
#> ▶ ended pipeline [0.153 seconds]

# all skip
tar_make()
#> ✔ skipped target rast1
#> ✔ skipped target rast2
#> ✔ skipped target rast1_cache_files
#> ✔ skipped target rast2_cache_files
#> ✔ skipped pipeline [0.118 seconds]

… metadata

 - Source data file is written using `terra::wrapCache()` to a user-specified cache directory
 - Target is created for PackedSpatRaster based on cache
 - Target is created for cache files
 - Add cache-managing functions `geotargets_destroy_cache()` and `geotargets_init_cache()`
@brownag brownag changed the title [draft] tar_terra_rast_wrap: draft multi-target method to preserve SpatRaster metadata [draft] tar_terra_rast_wrap: multi-target method to preserve SpatRaster metadata Apr 29, 2024
Copy link
Collaborator

@Aariq Aariq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it has taken so long to get to this. I'm excited about this PR because it seems to me like this could actually make sense to just be the default way that geotargets works with terra and that would solve a lot of issues.

#' @export
geotargets_init_cache <- function(name = NULL) {
cachedir <- geotargets_option_get("cache.dir")
target_cache_dir <- file.path(cachedir %||% "geotargets_cache", name %||% "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_cache_dir <- file.path(cachedir %||% "geotargets_cache", name %||% "")
target_cache_dir <- file.path(cachedir %||% "_geotargets", name %||% "")

Maybe? Just for consistency with _targets/—both being directories you shouldn't edit manually.

Comment on lines +31 to +37
geotargets_destroy_cache <- function(name = NULL, init = FALSE) {
cachedir <- geotargets_option_get("cache.dir")
target_cache_dir <- file.path(cachedir %||% "geotargets_cache", name %||% "")
res <- unlink(target_cache_dir, recursive = TRUE)
if (init) geotargets_init_cache(name = name)
invisible(res)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way that this could set some "flag" that could be used to invalidate the "upstream" target through a custom cue? Or perhaps it runs tar_invalidate() on all targets created with tar_terra_rast_wrap() when run? I think it's fine that manually deleting a file from the cache breaks the pipeline, but I think any "official" way of deleting the cache should correctly invalidate targets.

Copy link
Collaborator

@Aariq Aariq Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think of it, the directory names inside the cache are target names, yeah? So if this could get all those dir names and pass them to tar_invalidate(any_of(dirnames)) I think it would make this function a lot more useful.

resources = targets::tar_option_get("resources"),
storage = targets::tar_option_get("storage"),
retrieval = targets::tar_option_get("retrieval"),
cue = targets::tar_option_get("cue")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cue = targets::tar_option_get("cue")) {
cue = targets::tar_option_get("cue"),
description = targets::tar_option_get("description")) {

full.names = TRUE,
recursive = TRUE
)")),
format = "file_fast",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could allow the option of either "file" or "file_fast" here, but I'm guessing it doesn't really matter since it seems like nothing will ever depend on this target.

Comment on lines +138 to +148
rast_cache_files <- targets::tar_target_raw(
paste0(name, "_cache_files"),
str2expression(paste0("
list.files(
file.path(", shQuote(cachedir), ", ", shQuote(name),"),
full.names = TRUE,
recursive = TRUE
)")),
format = "file_fast",
deps = name
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm questioning whether this target even needs to exist. Unless there is a way to make it be "upstream" of the wrapCache target, then it doesn't really serve a purpose. Invalidating this target will never do anything, and there's no reason to use this target rather than the upstream one in a pipeline. So maybe this doesn't need to return multiple targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants