Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault after repeated builds #1

Open
helske opened this issue Sep 4, 2017 · 27 comments
Open

segfault after repeated builds #1

helske opened this issue Sep 4, 2017 · 27 comments

Comments

@helske
Copy link
Owner

helske commented Sep 4, 2017

I get segfault when I repeatedly build the model. Valgrind reports:

[1] "iteration 1"
[1] "first model"
[1] "create matrices"
[1] "second model"
[1] "rebuild first model"
[1] "repeat millstein function 100,000 times"
[1] "iteration 2"
[1] "first model"
[1] "create matrices"
[1] "second model"
[1] "rebuild first model"
[1] "repeat millstein function 100,000 times"
[1] "iteration 3"
[1] "first model"
[1] "create matrices"
[1] "second model"
[1] "rebuild first model"
[1] "repeat millstein function 100,000 times"
==26398== Jump to the invalid address stated on the next line
==26398==    at 0x1DD75010: ???
==26398==    by 0x4F7D026: R_RunWeakRefFinalizer (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F7D213: ??? (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F4A679: Rf_eval (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F4EA11: ??? (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F4AB99: Rf_eval (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F4D723: ??? (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F4AA9F: Rf_eval (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F4CABE: ??? (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F42F67: ??? (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F50D55: ??? (in /usr/lib/R/lib/libR.so)
==26398==    by 0x4F512FF: ??? (in /usr/lib/R/lib/libR.so)
==26398==  Address 0x1dd75010 is not stack'd, malloc'd or (recently) free'd
==26398== 

 *** caught segfault ***
address 0x1dd75010, cause 'memory not mapped'
An irrecoverable exception occurred. R is aborting now ...
==26398== 
==26398== Process terminating with default action of signal 11 (SIGSEGV)
==26398==    at 0x547451D: raise (raise.c:53)
==26398==    by 0x547466F: ??? (in /lib/x86_64-linux-gnu/libpthread-2.24.so)
==26398==    by 0x1DD7500F: ???
==26398== 
==26398== HEAP SUMMARY:
==26398==     in use at exit: 259,162,338 bytes in 28,465 blocks
==26398==   total heap usage: 315,728 allocs, 287,263 frees, 1,311,929,706 bytes allocated
==26398== 
==26398== LEAK SUMMARY:
==26398==    definitely lost: 0 bytes in 0 blocks
==26398==    indirectly lost: 0 bytes in 0 blocks
==26398==      possibly lost: 0 bytes in 0 blocks
==26398==    still reachable: 259,162,338 bytes in 28,465 blocks
==26398==                       of which reachable via heuristic:
==26398==                         newarray           : 4,264 bytes in 1 blocks
==26398==         suppressed: 0 bytes in 0 blocks
==26398== Reachable blocks (those to which a pointer was found) are not shown.
==26398== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==26398== 
==26398== For counts of detected and suppressed errors, rerun with: -v
==26398== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

Script, not minimal but does the job:

library(crashtest)
simulate.data <- function(Tn=10, x0=1, sigma_x=1, sigma_y=1, mu=0, dt=1) {
  x <- x0
  y <- rep(NA,Tn)
  for (k in 1:Tn) {
    x <- x*exp((mu-0.5*sigma_x^2)*dt + sqrt(dt) * rnorm(1, sd=sigma_x))
    y[k] <- rnorm(1, mean=log(x), sd=sigma_y)
  }
  return(y)
}
set.seed(834)
n <- 50
y <- simulate.data(n, mu = 0.05, sigma_x = 0.2, sigma_y = 1)

for (i in 1:10) {
  print(paste0("iteration ", i))
  Rcpp::sourceCpp("sde_functions1.cpp", rebuild = TRUE)
  pntrs <- create_pntrs()
  # this seems to be crucial, altough looks to be irrelevant. Probably triggers gc at some point?
  print("create matrices")
  a <- matrix(1000^2, 1000, 1000)

  print("repeat millstein function 100,000 times")
  for (j in 1:100000)
    out <- milstein(1, 4, 1, c(0.05, 0.2, 1), pntrs$drift, pntrs$diffusion, pntrs$ddiffusion, TRUE, 1)

}
@helske
Copy link
Owner Author

helske commented Sep 5, 2017

Well, this is weird. The prior pdf function, which is not used in this simplified package at all, contains currently following code:

// [[Rcpp::export]]
double log_prior_pdf(const arma::vec& theta) {

  double log_pdf = 0.0;
  double infinite = -arma::datum::inf; //comment out this line and everything works??
  return log_pdf;
}

If I remove the assignment with arma:datum::inf, I get rid of the segfault?

@helske
Copy link
Owner Author

helske commented Sep 5, 2017

Hmm. Cleaning cache with cleanupCacheDir = TRUE in sourceCpp does not help, but after setting new cache directory at each iteration, I don't get segfaults anymore.

@eddelbuettel
Copy link

Wild guess: Maybe the repeated use of sourceCpp() in the loop is at the heart of it? Maybe by removing all files related to sde_function1.cpp you can make it more robust?

@helske
Copy link
Owner Author

helske commented Sep 5, 2017

Yes that is the issue, what I am trying to do here is to simulate the case with bssm package where you first build your model with sourceCpp, then do some MCMC runs etc, then you tweak your model by running sourceCpp again, and so on. After few of these "iterations" I (and my collaborators, so not just my configuration) always get a segfault. Emptying the cache dir manually or changing it with cacheDir argument seems to fix the issue for now, but is slightly annoying.

@eddelbuettel
Copy link

We could maybe add a new argument besides rebuild=TRUE for it which unlinks the expected files.

I think we simply run into trouble with the make based build logic mixing old and new code, hence a segfault. So I would just play it safe and prefix sourceCpp() with "Mr Propper" function unlinking files.

You could also try to see and spy what Stan, Nimble, ... and other users do.

@Enchufa2
Copy link

Enchufa2 commented Sep 5, 2017

This is the shortest version I managed to get so far:

code <- '
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

void func() {
  double infinite = arma::datum::inf;
}

// [[Rcpp::export]]
SEXP create_pntrs() {
  typedef void (*funcPtr)();
  return Rcpp::XPtr<funcPtr>(new funcPtr(&func));
}'

for (i in 1:10) {
  print(paste0("iteration ", i))
  Rcpp::sourceCpp(code = code, rebuild = TRUE)
  pntrs <- create_pntrs()
  # this seems to be crucial
  a <- matrix(1000^2, 1000, 1000)
}

@eddelbuettel
Copy link

Another issue: changing code under the same filename may be a bad idea.

Maybe just append the iteration number, or the Sys.time() as an int, or a random draw, to the filename?

@Enchufa2
Copy link

Enchufa2 commented Sep 5, 2017

I don't understand what's the problem with arma::datum::inf. Another version just with the minimum code from Armadillo that triggers the segfault:

code <- '
#include <Rcpp.h>

class Datum_helper {
public:
  static double inf() { return 0; }
};

template<typename eT>
class Datum {
public:
  static const eT inf;
};

template<typename eT> const eT Datum<eT>::inf = Datum_helper::inf();

void func() {
  double infinite = Datum<double>::inf;
}

// [[Rcpp::export]]
SEXP create_pntrs() {
  typedef void (*funcPtr)();
  return Rcpp::XPtr<funcPtr>(new funcPtr(&func));
}'

for (i in 1:10) {
  print(paste0("iteration ", i))
  Rcpp::sourceCpp(code = code, rebuild = TRUE)
  pntrs <- create_pntrs()
  # this seems to be crucial
  a <- matrix(1000^2, 1000, 1000)
}

@Enchufa2
Copy link

Enchufa2 commented Sep 5, 2017

And for the record, I'm unable to cause a segfault with clang.

@helske
Copy link
Owner Author

helske commented Sep 5, 2017

I don't think the issue is really about arma::datum::inf, that just happens to trigger something in similar way as the independent matrix construction. Like I said, originally I stumbled upon this issue with bssm package and I am pretty sure that there we have some cases where none of the user-supplied functions use `arma::datum::inf. Although there might be infs in the MCMC codes which use these functions.

I can't be certain, but I have a feeling that we also have some (complex) cases where we get a segfault even when we are not compiling the same file twice but two different files which contain (some) functions with same names.

Just to be clear, I am not doing any automatic modifications to cpp files or any iterations in loop, true workflow is more like here:

sourceCpp("model.cpp")
pntrs <- create_pntrs()
model <- sde_ssm(y, pntrs$drift, pntrs$diffusion, pntrs$ddiffusion)
out <- run_mcmc(model, ...)
...
# check results, do something, modify the model
...
sourceCpp("model.cpp", rebuild = TRUE)
pntrs <- create_pntrs()
model <- sde_ssm(y, pntrs$drift, pntrs$diffusion, pntrs$ddiffusion)
out <- run_mcmc(model, ...) #segfault! but not always...

Of course it could make more sense to always save each model version as a new file but in exploratory phase that feels like old school version control. ;)

And I'm pretty sure one of my coworkers experiencing this issue is using clang with OSX.

@eddelbuettel
Copy link

Could you alter the code to not repeat function names / file names when doing repeated compiliations? I have the feeling we are violating some unspoken agreement.

@helske
Copy link
Owner Author

helske commented Sep 5, 2017

Yes I could, and although I had a feeling that I previously got segfault with different file names, at least so far I haven't been able to reproduce that. So yes it seems to caused by repeated compilation of same file (modified or not).

@Enchufa2
Copy link

Enchufa2 commented Sep 5, 2017

And I'm pretty sure one of my coworkers experiencing this issue is using clang with OSX.

Have you double-checked this? Just to be reasonably sure that this is not a compiler issue.

@helske
Copy link
Owner Author

helske commented Sep 5, 2017

I am in a process of double checking it, but actually after looking old emails he is probably not using clang after all as he had some problems with clang and openmp in the past.

@Enchufa2
Copy link

Enchufa2 commented Sep 5, 2017

I don't think the issue is really about arma::datum::inf.

I'm not a C++ expert, but calling a class static method from a templated static initialisation sounds like trouble to me.

(Edited: forget the code, the initialisation cannot take place in a constructor, of course)

@helske
Copy link
Owner Author

helske commented Sep 5, 2017

Hmm, perhaps, I'm no C++ expert either, but I'll repeat tomorrow some real workflows without inf and see how it goes. Meanwhile, I ran crash.R like above but this time with using clang with sourceCpp. Still got segfault, although it took few iterations more.

Removing previous .so file before sourceCpp seems to be enough to get rid of segfault though.

@Enchufa2
Copy link

Enchufa2 commented Sep 5, 2017

Avoiding the Datum_helper::inf call works for me without segfault:

code <- '
#include <Rcpp.h>

template<typename eT>
class Datum {
public:
  static const eT inf;
};

template<typename eT> const eT Datum<eT>::inf = 
  std::numeric_limits<eT>::has_infinity ? 
    std::numeric_limits<eT>::infinity() : std::numeric_limits<eT>::max();

void func() {
  double infinite = Datum<double>::inf;
}

// [[Rcpp::export]]
SEXP create_pntrs() {
  typedef void (*funcPtr)();
  return Rcpp::XPtr<funcPtr>(new funcPtr(&func));
}'

for (i in 1:100) {
  print(paste0("iteration ", i))
  Rcpp::sourceCpp(code = code, rebuild = TRUE)
  pntrs <- create_pntrs()
  # this seems to be crucial
  a <- matrix(1000^2, 1000, 1000)
}

So, I would try using std::numeric_limits<double>::infinity() directly instead of arma::datum::inf.

@helske
Copy link
Owner Author

helske commented Sep 6, 2017

I now changed all the instances of arma::datum::inf in bssm codes and user supplied functions to std::numeric_limits<double>::infinity() and haven't experienced any issues so far.

@eddelbuettel
Copy link

Interesting. I don't quite see why this would bite, but good to know you are making progress...

@helske
Copy link
Owner Author

helske commented Sep 13, 2017

I am seeing segfaults again in more complex setting, so infinity wasn't the issue (which felt really strange anyway).

@Enchufa2
Copy link

Just to be sure, is there any other mention to arma::datum? Because arma::datum::nan for instance shares implementation with arma::datum::inf.

@helske
Copy link
Owner Author

helske commented Sep 13, 2017

Yes there is one instance of arma::datum::nan as well, and several cases with arma::datum::eps. Maybe I'll try replacing those as well.

@Enchufa2
Copy link

Yes, try replacing them, because as I said, the implementation for arma::datum::nan is structurally the same. In that sense, arma::datum::eps shouldn't be a problem, because it doesn't call the Datum_helper class. But better safe than sorry.

@helske
Copy link
Owner Author

helske commented Sep 14, 2017

I now changed both the nan and the eps cases, but still got a segfault.

@Enchufa2
Copy link

Then I would try to simplify the code as much as possible to try to find similitudes with the case above.

It's not rocket science, but it's clear that you are hitting a quite rare memory problem in a quite complex setup and we don't even have a clue about where it is. It could be in R, in Armadillo, in Rcpp or even in the compiler itself. From the information we have now, I would bet on the first two, but there's not evidence enough to support this.

@helske
Copy link
Owner Author

helske commented Sep 14, 2017

Yeah, it is pretty difficult to get the segfault by command in such complex setting, it feels like it has something to do with garbage collection or something similar as typically I am running some relatively big MCMC runs between the compilations when the segfault happens. And the segfault sometimes happens when I am back in R side doing some stuff after the rebuilding the model and running the MCMC. Like:

sourceCpp(model.cpp)
model <- ...
results <- run_mcmc(model)
plot(results)
summary(results)
sourceCpp(model.cpp)
model <- ...
results <- run_mcmc(model)
plot(results)
summary(results) segfault!

But unlinking the previous .so file helps, so this isn't super important issue, expect that I need to instruct the users of bssm about this.

@eddelbuettel
Copy link

Yes, it is clearly related to your use case which is a little special and a good stress case.

Removing the file, or randomizing fiile names, or ... may also work to be sure to really get 'fresh' code for fresh models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants