Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soil depth and the behaviour of min and max #175

Open
pierreroudier opened this issue Dec 8, 2020 · 7 comments
Open

Soil depth and the behaviour of min and max #175

pierreroudier opened this issue Dec 8, 2020 · 7 comments
Assignees
Milestone

Comments

@pierreroudier
Copy link
Collaborator

Following up on a chat between @dylanbeaudette and myself.

I've reversed a commit I've just pushed, as this is a drastic change that requires discussion.

At the moment, min and max are overloading the base functions. For horizons that are bounded in depth by "top" and "bottom":

  • max gives the absolute maximum "bottom" depth in the collection, ie max(spc$bottom)
  • min gives the minimum of the profile-specific max(bottom).

I understand this makes sense from a pedological perspective, where "soil depth" is the maximum depth of a profile. However, I don;t think the behaviour is appropriate for min.

Consider sp4:

library(aqp)
data(sp4)
depths(sp4) <- id ~ top + bottom

In this case:

> min(sp4)
[1] 16

I think the problem arises in particular when looking over a collection of soil profiles:

> sp4
SoilProfileCollection with 10 profiles and 30 horizons
profile ID: id  |  horizon ID: hzID 
Depth range: 16 - 49 cm

----- Horizons (6 / 30 rows  |  10 / 14 columns) -----
     id hzID top bottom name   K   Mg  Ca CEC_7 ex_Ca_to_Mg
 colusa    1   0      3    A 0.3 25.7 9.0  23.0        0.35
 colusa    2   3      8  ABt 0.2 23.7 5.6  21.4        0.23
 colusa    3   8     30  Bt1 0.1 23.2 1.9  23.7        0.08
 colusa    4  30     42  Bt2 0.1 44.3 0.3  43.0        0.01
  glenn    5   0      9    A 0.2 21.9 4.4  18.8        0.20
  glenn    6   9     34   Bt 0.3 18.9 4.5  27.5        0.20
[... more horizons ...]

----- Sites (6 / 10 rows  |  1 / 1 columns) -----
        id
    colusa
     glenn
     kings
  mariposa
 mendocino
      napa
[... more sites ...]

Spatial Data: [EMPTY]

In particular this bit:

Depth range: 16 - 49 cm

gives the impression that the collection only spans the 16--49 cm depth.

It's very much a wording issue:

  • a pedologist would read "depth range" as "range of profile maximum soil depth"
  • a pedometrician like me would think of the horizon's effective depths, and read that as "range of horizon depths represented in the collection"

I would argue for min and max to refer to horizon depths, which are very simple, basic data concepts, and correspond to min(top) and max(bottom) respectively. This would be some low-level, data wrangling function.

On the other hand, I would propose that the more complex concept of soil profile depth being given its own, specific function (soil_depth(x)?). Arguably this effective soil depth can be calculated using different approaches, which this more advanced function could accommodate.

Separating convenience functions that purely report the data, from functions that report a science concept.

The revised min for reference:

# overload min() to give us the min depth within a collection
#' Get the minimum top depth in a SoilProfileCollection
#'
#' @name min
#'
#' @description Get the shallowest depth of description out of all profiles in a SoilProfileCollection. Data missing one or more of: top depth, profile ID, or any optional attribute are omitted using \code{complete.cases}.
#' @param x a SoilProfileCollection
#' @param v optional: a vector of horizon attribute names to refine calculation
#' @aliases min,SoilProfileCollection-method
#' @docType methods
#' @rdname min
setMethod(
  f = 'min',
  signature(x = "SoilProfileCollection"),
  definition = function(x, v = NULL) {
    h <- x@horizons

    # get bottom depth column name
    hz_top_depths <- horizonDepths(x)[1]

    # handle empty spc
    if(length(x@horizons[[hz_top_depths]]) == 0)
      return(NA)

    # optionally use a horizon-level property refine calculation
    if (!missing(v)) {
      target.names <- c(hz_top_depths, idname(x), v)
    } else {
      target.names <- c(hz_top_depths, idname(x))
    }

    # filter out missing data
    h <- .data.frame.j(h, target.names, aqp_df_class(x))
    h <- h[complete.cases(h),]

    # compute max depth within each profile
    if (aqp_df_class(x) == "data.table" &
        requireNamespace("data.table") ) {

      # base R faster for big data with no existing key
      # but if the key is already set, then this is ~2x faster with 1M profiles (sorted numeric IDs)
      if (idname(x) %in% data.table::key(h)) {
        idn <- idname(x)
        .I <- NULL
        # # with no key
        # user  system elapsed
        # 7.26    5.52   16.83
        # # with pre-set key
        # user  system elapsed
        # 2.07    0.00    2.08

        # cant invoke this for something like min/max probably -- might do better on linux
        # data.table::setkeyv(h, c(idn))

        dep <- h[[hz_top_depths]]
        d <- dep[h[, .I[hz_top_depths == min(hz_top_depths, na.rm = T)][1],
                     by = idn]$V1]

        # return from here for data.table
        return(min(d, na.rm = TRUE))
      }
    }

    # tapply on a data.frame
    # user  system elapsed
    # 4.39    0.00    4.39
    d <- tapply(h[[hz_top_depths]], h[[idname(x)]], min, na.rm = TRUE)

    # return the shallowest (of the deepest depths in each profile)
    return(min(d, na.rm = TRUE))
  }
)
@dylanbeaudette
Copy link
Member

dylanbeaudette commented Dec 8, 2020

Thanks for the detailed explanation Pierre. I see where you are coming from, a direction that hadn't occurred to me when first overloading these base functions. Outside of those cases where you have soil samples which do not truly represent soil profiles (truncated to depth slices or slabs, grab samples, or ???), do you expect min(SPC) to equal anything other than 0 most of the time? Mostly curious about a use case.

Long-term it makes a lot of sense: the original idea behind min / max overloads came before the more advanced approaches which are now available and well-tested (depthOf, 'estimateSoilDepth, etc.). These kind of functions can use pattern matching on the horizon designation (or other column) to make a more informed estimate vs. e.g. ! is.na(clay)`.

For aqp 2.0, where we can break things, this seems possible and I support the extensive changes required. I expect that there will be a lot of tedious work in {aqp}, some in {soilDB}, and perhaps some as well in {sharpshootR}. I do not know how many other things will be affected but I do know that this could be a breaking change without much of a safety net.

Here is a rough plan:

  • add new tests expecting the new behavior and the old behavior
  • invent new tests to find cases where expecting old behavior results in 0, usually wrong
  • find all instances of min(SPC) and replace with appropriate function: depthOf or estimateSoilDepth, or ???
  • notify users in the manual pages / AQP intro

@brownag
Copy link
Member

brownag commented Dec 8, 2020

I also stumbled over the whole "min is the min of the max" thing when I (attempted) to make the show method faster for large (1M+) data.table collections.

You can see the comments and benchmarks I put there in line. At the time I was thinking that I would be able to beat tapply and bypass the high-demands of the current implementation (for large collections) by silently using data.table. I may be able to do better at that now.

I would argue for min and max to refer ... to min(top) and max(bottom) respectively.

I support this. Here is an example with a function that I use a lot where the proposed changes to min would make more sense.

glom-based workflows have been sort of my "pet" interest in aqp. As an aside, I hope to be able to generalize these methods better using some vectorized ops I have been considering -- getting around the bottlenecks of lists/profileApply/iteration/etc.

library(aqp)
data("sp4")
sp4 <- tibble::tibble(sp4)
depths(sp4) <- id ~ top + bottom

# obscure example: return _thickest_ horizon overlapping [0,30]
#                  produces a "ragged" SoilProfileCollection
sp4_B <- glomApply(sp4, function(p) c(0, 30), 
                   modality = "thickest", truncate = TRUE)

par(mar=c(0,0,0,0))
plot(sp4_B)

# shallowest top depth is 3cm -- NB: not zero -- whereas shallowest bottom-most depth is 16
min(sp4_B$top)
min(sp4_B)

# the deepest bottom depth is the deepest bottom depth (30cm)
max(sp4_B$bottom)
max(sp4_B)

On the other hand, I would propose that the more complex concept of soil profile depth being given its own, specific function (soil_depth(x)?).
Arguably this effective soil depth can be calculated using different approaches, which this more advanced function could accommodate.

I agree that "soil depth" is a more nuanced definition. The values we report in the show methods should be something like a "coverage" or "depth support interval". Mostly I have learned to ignore the output of that in show since it rarely makes sense for soils that have bedrock layers described in the profile.

I do look at it when I have trunc()'d a SPC -- and wish that the min worked as you describe above so I can know whether my profiles are e.g. [0,30] We have estimateSoilDepth (and other issues related to those functions and recent generalizations in that family i.e. depthOf). IMO the routine used in min/max is very much the data-oriented view compared to estimateSoilDepth. And even estimateSoilDepth is a coarse heuristic.

I wrote a blog about the complex issues around soil depth estimation in aqp. That is not saying anyone has to read said blog, but just pointing out that there are a lot of things to be said, and to consider, when talking about "pedologic" depth.

@brownag brownag closed this as completed Dec 8, 2020
@brownag brownag reopened this Dec 8, 2020
@pierreroudier
Copy link
Collaborator Author

Thanks for the detailed explanation Pierre. I see where you are coming from, a direction that hadn't occurred to me when first overloading these base functions. Outside of those cases where you have soil samples which do not truly represent soil profiles (truncated to depth slices or slabs, grab samples, or ???), do you expect min(SPC) to equal anything other than 0 most of the time? Mostly curious about a use case.

Most such cases would be profiles that have recorded the organic/humic horizon above the mineral fraction, and that are using negative depth values to indicate that.

Otherwise we have some marginal cases where people focused on subsoil etc: in terms of soil information system, we moved from something very profile-centric, to something a lot more holistic, where if anyone measures anything about soil (even if it's not a full profile description), we will capture and record it.

@pierreroudier
Copy link
Collaborator Author

glom-based workflows have been sort of my "pet" interest in aqp. As an aside, I hope to be able to generalize these methods better using some vectorized ops I have been considering -- getting around the bottlenecks of lists/profileApply/iteration/etc.


Have you considered using the intervals package for that? I ave been using Intervals objects to compute depth intervals overlap etc:

library(aqp)
library(intervals)
library(dplyr)

data(sp4)

int_df <- sp4 %>% 
  select(top, bottom) %>% 
  as.matrix %>% 
  Intervals()

# Using GlobalSoilMap depth intervals as an example
gsm_intervals <-  c('0-5 cm', '5-15 cm', '15-30 cm', '30-60 cm', '60-100 cm', '100-200 cm')

int_gsm <- data.frame(
  top = gsm_intervals[1:(length(gsm_intervals) - 1)], 
  bottom = gsm_intervals[2:length(gsm_intervals)]
) %>% 
  as.matrix() %>% 
  Intervals()

int_over <- interval_overlap(int_df, int_gsm)

lapply(int_over, function(x) gsm_intervals[x])

@brownag
Copy link
Member

brownag commented Dec 8, 2020

I have seen that package and thought it may be useful.

glomApply takes a function that returns the profile-specific depths and my example was deliberately very simple (always returned a constant -- so only the profile was varying). Compared to GSM depths the cases I am using glom are a bit more general (and could not be calculated as simply as you do above) as the intervals are usually, or at least can be, profile-specific.

We run into a variety of things in US taxonomy/soil survey that affect the window of interest and making the profile specific adjustments is pretty important for some of our analyses and criteria. You raise the point of organic materials, negative depths and the like. Conventionally, our soil surface is at zero, but for instance a lot of our old lab data starts at some value greater than zero because OSM was not sampled for analysis. Historically values may have used negative depths, and we do see it turn up in old descriptions and rarely in the database.

There are many other cases where a workflow like yours above would be handy when aggregation is occurring over known constant intervals. I'll surely consider it, but I doubt I would bring in a whole 'nother package for relatively simple vector operations.

@brownag
Copy link
Member

brownag commented Feb 14, 2021

min and max SoilProfileCollection methods may benefit from the abstraction in #199

It might be more handy to have raw row indices of the horizon data.frame rather than hzidcol values returned (as an optional argument?) so the result of the abstracted method can be used to easily get min(bottom)/max(bottom) for the subset of horizons that are the deepest in each profile

@dylanbeaudette
Copy link
Member

Lets pursue this idea, given all of the new findings in #199. We have to be careful though: changing the current min / max overloads will affect many functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants