Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findHzGaps(), fillHzGaps() function to find / fill gaps in horizon depths #205

Closed
2 of 3 tasks
dylanbeaudette opened this issue Feb 19, 2021 · 4 comments
Closed
2 of 3 tasks

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Feb 19, 2021

Need a function to reliably "fill" horizon gaps in existing data or as a product of filtering on `checkHzDepthLogic(..., byhz = TRUE)

  • fillHzGaps()
  • findHzGaps()
  • data.table optimization

Example using latest {aqp}.

library(aqp)
library(data.table)

d <- lapply(letters[1:10], random_profile, n = c(6, 7, 8), n_prop = 5, method = 'LPP', SPC = TRUE)
d <- combine(d)

# sprinkle NA
z <- d
z$bottom[2] <- NA
z$top[20] <- z$bottom[20]

z$bottom[32] <- 15
z$bottom[5]
z$top[6] <- 95

# dropping horizons (a good idea?)
zz <- dice(z, byhz = TRUE)

# ok
metadata(zz)$removed.horizons

par(mar = c(0, 0, 0, 1))
plotSPC(zz, name = NA, divide.hz = FALSE, default.color = 'grey')

image

@dylanbeaudette
Copy link
Member Author

Functional, tested, documented (but not optimized) fillHzGaps now in master branch.

dylanbeaudette added a commit that referenced this issue Feb 28, 2021
dylanbeaudette added a commit that referenced this issue Feb 28, 2021
dylanbeaudette added a commit that referenced this issue Mar 1, 2021
fillHzGaps() is not yet optimized, so this option is disabled by
default.
@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Mar 1, 2021

@brownag any suggestions on how to optimize fillHzGaps() with some data.table magic?

brownag added a commit that referenced this issue Mar 1, 2021
@brownag
Copy link
Member

brownag commented Mar 2, 2021

Here is an attempt--some 3x faster on sp4--using only {base}.

library(aqp, warn = FALSE)
#> This is aqp 1.27

# original code, slightly modified relative to master
fillHzGaps <- function(x, flag = FALSE) {
  
  idn <- idname(x)
  hzidn <- hzidname(x)
  htb <- horizonDepths(x)
  hznames <- horizonNames(x)
  
  ids.top.bottom.idx <- match(c(idn, hzidn, htb), hznames)
  
  h <- horizons(x)
  
  hs <- split(h, h[[idn]])
  
  h.filled <- lapply(hs, function(i) {
    n <- nrow(i)
    
    s <- 1:(n-1)
    
    .top <- i[[htb[1]]]
    .bottom <- i[[htb[2]]]
    idx <- which(.bottom[s] != .top[s + 1])
    
    if(length(idx) > 0) {
      gap.top <- .bottom[idx]
      gap.bottom <- .top[idx + 1]
      
      hz.template <- i[1, ids.top.bottom.idx]
      hz.template[[htb[1]]] <- gap.top
      hz.template[[htb[2]]] <- gap.bottom
      hz.template[[hzidn]] <- NA
      res <- data.table::rbindlist(list(i, hz.template), fill = TRUE)
      res <- as.data.frame(res)
      return(res)
    } else {
      return(i)
    }
    
  })
  
  h.filled <- do.call('rbind', h.filled)
  o <- order(h.filled[[idn]], h.filled[[htb[1]]])
  h.filled <- h.filled[o, ]
  idx <- which(is.na(h.filled[[hzidn]]))
  
  if(length(idx) > 0) {
    m <- max(as.numeric(h[[hzidn]]), na.rm = TRUE)
    s <- seq(
      from = m + 1,
      to = m + length(idx),
      by = 1
    )
    
    if(flag) {
      h.filled[['.filledGap']] <- FALSE
      h.filled[['.filledGap']][idx] <- TRUE
    }
    
    h.filled[[hzidn]][idx] <- as.character(s)
  }
  
  # note: this is the right place to deal with hzid
  h.filled$hzID <- as.character(1:nrow(h.filled))
  replaceHorizons(x) <- aqp:::.as.data.frame.aqp(h.filled, aqp_df_class(x))
  hzidname(x) <- "hzID"
  
  return(x)
}

# new code
fillHzGaps_2 <- function(x, flag = FALSE) {
  idn <- idname(x)
  hzidn <- hzidname(x)
  
  htb <- horizonDepths(x)
  
  hznames <- horizonNames(x)
  hcnames <- c(idn, hzidn, htb)
  
  h <- horizons(x)
  
  lead.idx <- 2:nrow(h)
  lag.idx <- 1:(nrow(h) - 1)
  
  # identify bad horizons
  bad.idx <- which(h[[htb[2]]][lag.idx] != h[[htb[1]]][lead.idx]
                   & h[[idn]][lag.idx] == h[[idn]][lead.idx])
  
  # create template data.frame
  hz.template <- h[bad.idx, ]
  
  # replace non-ID/depth column values with NA
  hz.template[, hznames[!hznames %in% hcnames]] <- NA
  
  # fill gaps
  hz.template[[htb[1]]] <- h[[htb[2]]][bad.idx]     # replace top with (overlying) bottom
  hz.template[[htb[2]]] <- h[[htb[1]]][bad.idx + 1] # replace bottom with (underlying) top
  
  # flag if needed
  if (flag) {
    h[['.filledGap']] <- FALSE
    hz.template[['.filledGap']] <- TRUE
  }  
  
  # combine original data with filled data
  res <- rbind(h, hz.template)
  
  # ID + top depth sort
  res <- res[order(res[[idn]], res[[htb[1]]]),]
  
  # re-calculate unique hzID (note: AFTER reorder)
  res$hzID <- as.character(1:nrow(res))
  
  # replace horizons (use df class in object x)
  replaceHorizons(x) <- aqp:::.as.data.frame.aqp(res, aqp_df_class(x))
  
  # use the autocalculated hzID (in case user had e.g. phiid, chiid set)
  hzidname(x) <- "hzID"
  
  return(x)
}

# create sample dataset
data(sp4)
depths(sp4) <- id ~ top + bottom

# introduce gaps
idx <- c(2, 8, 12)
sp4$top[idx] <- NA

# check
horizons(sp4)[idx, ]
#>          id name top bottom   K   Mg  Ca CEC_7 ex_Ca_to_Mg sand silt clay   CF
#> 2    colusa  ABt  NA      8 0.2 23.7 5.6  21.4        0.23   42   31   27 0.27
#> 8     kings  Bt1  NA     13 0.6 12.1 7.0  18.0        0.51   36   49   15 0.75
#> 12 mariposa  Bt2  NA     34 0.3 44.3 6.2  34.1        0.14   36   33   31 0.71
#>    hzID
#> 2     2
#> 8     8
#> 12   12

# remove problematic horizons
x <- HzDepthLogicSubset(sp4, byhz = TRUE)
#> dropping horizons with invalid depth logic, see `metadata(x)$removed.horizons`

# benchmark filling gaps
bench::mark(horizons(fillHzGaps(x, flag = TRUE)), 
            horizons(fillHzGaps_2(x, flag = TRUE)), 
            min_iterations = 100)
#> # A tibble: 2 x 6
#>   expression                                min median `itr/sec` mem_alloc
#>   <bch:expr>                             <bch:> <bch:>     <dbl> <bch:byt>
#> 1 horizons(fillHzGaps(x, flag = TRUE))    5.3ms  5.7ms      171.     667KB
#> 2 horizons(fillHzGaps_2(x, flag = TRUE)) 1.47ms 1.53ms      633.     237KB
#> # ... with 1 more variable: `gc/sec` <dbl>

@dylanbeaudette I might have missed something in your logic for filling in the hzID column values for the gaps.. but I think that can be simpler. For one, we should only modify the hzidcol called "hzID".

All we need is to assign the values in hzID (not whatever the user has set as the hzidcol) and re-set hzidcol (w/ hzidname<-) to "hzID" after replacing horizon data in SPC. I suppose you may have deliberately wanted to assign "new" horizon IDs to these filled layers? That is difficult for me to wrap my mind around considering "hzID" is an auto-calculated field and essentially always should be ascending integers within the horizons slot.

Basically your solution returns something like (using name = "hzID")
image
(note that the filled horizons are hzID 31, 32, 33 respectively)

But I think it should return:
image
(i.e. as if the SPC constructor depths<- made the hzID column after ID+top sort)

Like I said, I may not have fully understood your intention... BUT if this was not intentional... I suggest we add the following test to test-fillHzGaps.R, where z is a "gap filled" SPC.

  # calculated hzIDs are in ascending order
  expect_equal(hzID(z), as.character(1:nrow(z)))

If you would prefer the new filled layers to start counting from the previous max(hzID)... I think that should be easy to implement in my version of the function, but I am not sure how I feel about it. Thoughts?

See #207 for a commit with my suggested changes incorporated.

@dylanbeaudette
Copy link
Member Author

Nice work, good idea using lag/lead index + logical vectors.

I originally clobbered the hzID contents with a completely new set of IDs, but then reconsidered after imagining situations where it would be helpful to preserve the original IDs. I suppose that could be done in a different column (dice does this) but it seemed more reasonable to not mess with the existing IDs.

dylanbeaudette added a commit that referenced this issue Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants