Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R RandomForest Leads to a Crash #3021

Closed
nkanrar opened this issue Jul 21, 2021 · 37 comments · Fixed by #3034
Closed

R RandomForest Leads to a Crash #3021

nkanrar opened this issue Jul 21, 2021 · 37 comments · Fixed by #3034

Comments

@nkanrar
Copy link

nkanrar commented Jul 21, 2021

Issue description

Hi,

Thanks for this great package. I was interested in using the random forest classifier in R's mlpack for a classification problem using relatively large matrices (thousands of rows each with 200 features, so a 1000s x 200 dimension). The random forest classifier works great when I run it once, but within a few seconds after one run of a call of the classifier, my RStudio session crashes and forces a restart, with a message "Error occurred during transmission." I looked up why this could happen in RStudio and the most likely cause is that the memory in the session is maxed out by running the function.

I would appreciate any insights in understanding what exactly is causing this issue.

Thank you so much!

Your environment

  • version of mlpack: mlpack 3.4.2.1
  • operating system: not relevant (this is via a cluster)
  • compiler:
  • version of dependencies (Boost/Armadillo):
  • any other environment information you think is relevant:

Steps to reproduce

Expected behavior

Actual behavior

@coatless
Copy link
Contributor

You may need to request a larger memory size when running the interactive job on the cluster. This would need to be set with a flag that is dependent on the underlying scheduler (e.g. SLURM, PBS, ... ).

@nkanrar
Copy link
Author

nkanrar commented Jul 22, 2021

Thanks @coatless!
I guess one question I have is should running the random_forest in mlpack lead to this memory issue?
I've run random_forest implementations from other packages on this same dataset and haven't faced any memory limit issues, so I'm a bit confused by this behavior.

@rcurtin
Copy link
Member

rcurtin commented Jul 22, 2021

Hey @nkanrar, sorry this isn't working out of the box for you. What are the values of the labels you're providing as input?

@nkanrar
Copy link
Author

nkanrar commented Jul 22, 2021

Thanks for you response @rcurtin!

The labels I am using as input are a numeric matrix, with values ranging from 1:12, corresponding to the 12 different classes in the dataset.

@rcurtin
Copy link
Member

rcurtin commented Jul 22, 2021

👍 are there any NaN values in the input? Is it possible you can provide the dataset or a small subset of it that causes the crash?

@nkanrar
Copy link
Author

nkanrar commented Jul 22, 2021

Thanks @rcurtin could I e-mail the dataset to you directly? I would prefer not sharing it publicly over GitHub.

Also, there are no NaNs in the input.

Also: edit to my original post, my dataset is ~20k measurements with 200 features. But even subsetting it to ~10k measurements leads to the failure.

@rcurtin
Copy link
Member

rcurtin commented Jul 22, 2021

Sure, my email is ryan@ratml.org. I'll try to run it and see if I can reproduce the issue. If you could send corresponding R code that can be used that would be great too. 👍

@nkanrar
Copy link
Author

nkanrar commented Jul 22, 2021

@rcurtin Sent! Thank you so much!

@rcurtin
Copy link
Member

rcurtin commented Jul 23, 2021

I ran this locally in an R session with

source("mlpack_issue.R")

and then saw that res seemed to be populated correctly, so, I think I am not able to reproduce the issue. Do you have any more information on what might be going on? I wonder if this is not an mlpack issue but instead something else. (I am not very familiar with R, so, unfortunately, I don't have many suggestions...)

I also ran the command-line bindings with the given data and that too ran without any issues.

@nkanrar
Copy link
Author

nkanrar commented Jul 23, 2021

Hi @rcurtin thanks for trying it out! One thing I noticed is if I try to run my function multiple times, then RStudio would stall and then crash. If you were to try that, does it lead to a crash?
Otherwise same here, I am pretty stumped about what could be the cause if it's not an mlpack issue.

@coatless
Copy link
Contributor

I suppose we need more information about the RStudio version on the cluster, size of the data set, and average amount of RAM the interactive job is launched with.

@nkanrar
Copy link
Author

nkanrar commented Jul 24, 2021

Thanks @coatless!

  • RStudio version is 4.0.3
    Screen Shot 2021-07-23 at 3 45 27 PM

  • The size of the training dataset is 20249 samples x 300 features, with one column of labels. It is 50.6 MB.

  • 600-700 M of RAM is being used when I run the process on RStudio.

@coatless
Copy link
Contributor

@nkanrar I need to see the RStudio Server version, please run:

RStudio.Version()

@nkanrar
Copy link
Author

nkanrar commented Jul 24, 2021

@coatless

It is 1.3.959

image

@evanbiederstedt
Copy link

Hi there

I'm working with @nkanrar and we're using the same server.

You may need to request a larger memory size when running the interactive job on the cluster. This would need to be set with a flag that is dependent on the underlying scheduler (e.g. SLURM, PBS, ... ).

To clarify, she's interactively running R on a server with over 2TB of memory. Lack of available memory on the machine isn't the problem here.

I've reinstalled all system libraries with sudo using sudo apt-get install libmlpack-dev mlpack-bin. I also installed the package mlpack from CRAN.

Multiple invocations of rf_mlpack() leads to a segfault in seconds; I see no evidence of a massive amount of memory consumed (maybe <1 MB before it fails). I can always run this function once; I can never run it twice without an immediate segmentation fault.

> res <- rf_mlpack(train, test)
> res <- rf_mlpack(train, test)

 *** caught segfault ***
address 0x400, cause 'memory not mapped'

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault (core dumped)

Here is the R session (I'm not using RStudio):

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlpack_3.4.2.1

loaded via a namespace (and not attached):
[1] compiler_4.0.3 Rcpp_1.0.7    
> 

Is it possible that there's a problem here with memory accession in the C++ code (or the Rcpp implementation)? That would explain an immediate segfault like this. There could be other explanations as well, of course.

Thank you for the help with this. Best, Evan

@coatless
Copy link
Contributor

@evanbiederstedt thanks for chiming in and providing more details.

I think this is a configuration hiccup on the actual server as we're not seeing a regression that would cause this to segfault elsewhere. Glancing at the two different R session overviews provided, I see:

BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/blas/liblapack.so.3.7.1

vs.

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

So, I think there are multiple local R copies in play.

For simplicity and to aide the debugging process, could we try starting from scratch? In particular, let's aim for:

  1. Removing the mlpack system libraries installed via sudo
  2. Removing the mlpack R package via remove.packages()
  3. Upgrade to R 4.1
  4. Installing mlpack and its dependencies from scratch with:
install.packages("mlpack", dependencies = TRUE)

Alternatively, we could use the pre-built mlpack binary from the c2d4u repository via:

# Registry the reposiitory
sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+

# Install mlpack pre-builit binary 
apt install --no-install-recommends r-cran-mlpack

https://launchpad.net/~c2d4u.team/+archive/ubuntu/c2d4u4.0+/+packages?field.name_filter=mlpack&field.status_filter=published&field.series_filter=

https://cran.r-project.org/bin/linux/ubuntu/#get-5000-cran-packages

@evanbiederstedt
Copy link

evanbiederstedt commented Jul 27, 2021

Hi @coatless

Thanks for the help with this---I appreciate your time here

So, I think there are multiple local R copies in play.

Yes, this is true. I wanted to try a different version of R than the R studio copy.

I'm trying your instructions for the server right now.

I tried installing on my Mac OS just now. There's a similar problem:

> res <- rf_mlpack(train, test)

> 
> res <- rf_mlpack(train, test)
R(56177,0x113b4ee00) malloc: *** error for object 0x7feb150e2000: pointer being freed was not allocated
R(56177,0x113b4ee00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      R

Here's the R session info:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlpack_3.4.2.1

loaded via a namespace (and not attached):
[1] compiler_4.1.0 Rcpp_1.0.7    

I think you have the *rds files? They're not very big:

4.0M	test.rds
7.9M	train.rds

Could they be something in the inputs which trigger segfaults?

RE:

Alternatively, we could use the pre-built mlpack binary from the c2d4u repository via:

This fails for me, actually.

Command:

# Registry the reposiitory
sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+

Error:

....
Reading package lists... Done

W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease: The following signatures were invalid: EXPKEYSIG 51716619E084DAB9 Michael Rutter <marutter@gmail.com>

W: Failed to fetch https://cloud.r-project.org/bin/linux/ubuntu/bionic-cran40/InRelease  The following signatures were invalid: EXPKEYSIG 51716619E084DAB9 Michael Rutter <marutter@gmail.com>

W: Some index files failed to download. They have been ignored, or old ones used instead.

...

@rcurtin
Copy link
Member

rcurtin commented Jul 27, 2021

@evanbiederstedt thanks for mentioning that. I am not an R expert, but when I tried to run the code locally in an R session, I too noticed that I got a segfault. I haven't had a chance to dig any deeper yet, so I was initially assuming it might have to do with my ignorance of R details.

I will try to find some time shortly to set up a better environment that will tell me more of what is going on with the R bindings, and that should help diagnose what the issue is here. 👍

@evanbiederstedt
Copy link

There is something strange here, and I think we're just following the API: https://www.mlpack.org/doc/mlpack-3.4.2/doxygen/r_quickstart.html

I think this is either an issue whereby we're using the package incorrectly, or there's an architectural issue. We're using the function mlpack::random_forest()

I realize I didn't share the function we're using in the examples above:

# Random forest using mlpack
rf_mlpack <- function(train, test){
  # mlpack requires numeric labels
  # Make a data.frame that maps true labels to numeric labels
  lkp_tbl <- data.frame(class_lab = unique(train$class_label), num_lab = 1:length(unique(train$class_label)))
  # Map the corresponding numeric values to the class labels
  labels = as.matrix(lkp_tbl$num_lab[match(train$class_label,lkp_tbl$class_lab)])
  # Construct the model
  output <- mlpack::random_forest(training = train[,-ncol(train)], labels = labels)
  sr_model <- output$output_model
  # Make the predictions
  output <- mlpack::random_forest(input_model=sr_model, test=test[,-ncol(test)])
  predictions <- output$predictions
  # Extract the original class labels from the numeric values
  predictions <- lkp_tbl$class_lab[match(predictions,lkp_tbl$num_lab)]
  # Get probabilities for each testing datapoint belonging to each class
  probabilities <- output$probabilities
  colnames(probabilities) <- lkp_tbl$class_lab[match(colnames(probabilities),lkp_tbl$num_lab)]
  pred <- list(predictions, probabilities)
  # Return class predictions and probabilities for each class
  return(pred)
}

# Run the random forest function
res <- rf_mlpack(train, test)

The segfault happens here at some point here:

output <- mlpack::random_forest(training = train1[,-ncol(train1)], labels = labels)

sr_model <- output$output_model

output <- mlpack::random_forest(input_model=sr_model, test=test1[,-ncol(test1)])
predictions <- output$predictions

whereby the datatype for sr_model strikes me as something that might cause problems (maybe ?):

> sr_model
<pointer: 0x7f824d6828f0>
attr(,"type")
[1] "RandomForestModel"

@coatless
Copy link
Contributor

@evanbiederstedt

I do not have the RDS files. @rcurtin mind forwarding?

How were the matrices constructed?

For a quick reprex, I plugged iris into the documentation example to get:

# Load the mlpack library
library("mlpack")

# Construct model design
my_model_design = model.frame(Species ~., data = iris)

# Extract the model design matrix
X = model.matrix(my_model_design, data = iris)

# Extract the labels as a vector of factors
y_factor = model.response(my_model_design)

# Convert factor labels to integer encodings. 
y_integer_encoded = as.matrix(as.integer(y_factor))

# Train the model
output <- random_forest(training=X, labels=y_integer_encoded, minimum_leaf_size=20,
                        num_trees=10, print_training_accuracy=TRUE)

# See model output
output
#> $output_model
#> <pointer: 0x7fe45345f240>
#> attr(,"type")
#> [1] "RandomForestModel"
#> 
#> $predictions
#>      [,1]
#> 
#> $probabilities
#> <0 x 0 matrix>

# Extract the pointer to the underlying model for predicctions
rf_model = output$output_model

# Attempt to predict using the underlying model
my_model_test_data = random_forest(
  input_model=rf_model, 
  test=X,
  test_labels=y_integer_encoded)

# View predictions
my_model_test_data
#> $output_model
#> <pointer: 0x7fe45345f240>
#> attr(,"type")
#> [1] "RandomForestModel"
#> 
#> $predictions
#>        [,1]
#>   [1,]    1
#>   [2,]    1
#>   [3,]    1
#>   [4,]    1
#>   [5,]    1
#> 
#> $probabilities
#>               [,1]        [,2]        [,3]
#>   [1,] 0.980851064 0.019148936 0.000000000
#>   [2,] 0.970851064 0.024148936 0.005000000
#>   [3,] 0.970851064 0.024148936 0.005000000
#>   [4,] 0.970851064 0.024148936 0.005000000
#>   [5,] 0.980851064 0.019148936 0.000000000

This seems to work well for testing the predictions; however, the results on the training data set are MIA.

Perhaps the segfault is relating to accessing the training results, e.g. output$predictions and output$probabilities?

@coatless
Copy link
Contributor

Thanks for the data forward @rcurtin.

So, I can reliably reproduce the segfault only after performing multiple calls with the rf_model() function. I'll throw it in a debugging container ala r-debug and see what's going on.

@coatless
Copy link
Contributor

Sad news, I'm short on memory to handle this compile within the docker container. I won't be able to easily look into this until I get a new office computer in August. :(

g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[1]: *** [/usr/local/RDvalgrind/lib/R/etc/Makeconf:177: adaboost.o] Error 1
make[1]: Leaving directory '/tmp/RtmpIytvGc/R.INSTALL31a78511356/mlpack/src'
ERROR: compilation failed for package ‘mlpack’
* removing ‘/usr/local/RDvalgrind/lib/R/site-library/mlpack’

@evanbiederstedt
Copy link

I appreciate the help, @coatless

Sad news, I'm short on memory to handle this compile within the docker container. I won't be able to easily look into this until I get a new office computer in August. :(

I hope I'm not being too nosy about how you're using this (as I'm very likely wrong), but if you use the container interactively, shouldn't it use your computer's memory?

The RDS files feel very small; how memory intensive is this operation? I am running this on my Mac OS. The segfault happens immediately; I don't even notice a massive surge in memory.

I mention the above simply to help with collectively debugging :)

@coatless
Copy link
Contributor

@evanbiederstedt So, the mlpack R package needs about 2.5 gb per core to compile. I needed to tweak the container memory limitation to about 9 Gb and drop the number of cores associated with my machine.

From there, I'm currently running valgrind from r-debug container on the procedure with all tracking enabled. Some quick steps to replicate:

Step 1: Setup the debugging environment

# Download wch1/r-debug and launch it under the rd tag. 
# Map access to where the files are stored.
docker run --rm -ti --name rd --security-opt seccomp=unconfined -v ~/Downloads/mlpack_issue_3021/:/mlpack_issue_3021 wch1/r-debug

# Install the mlpack r package from source
RDvalgrind -e "install.packages('mlpack')"

Step 2: Tag the container to avoid another 30min compile

Outside of the docker container's terminal, we'll create a checkpoint once mlpack is installed.

docker commit rd mlpackvalgrind

Step 3: Start trying to diagnose what's up.

Back inside the container with mlpack installed, let's open up a valgrind session and run the code with maximum logging.

RDvalgrind -d "valgrind --track-origins=yes --leak-check=full --show-reachable=yes" -f /mlpack_issue_3021/mlpack-issue-code-only.R

@coatless
Copy link
Contributor

Alrighty, so... Looks like there is a leak, but I wasn't able to immediately trigger a segfault.

See attached for some logs.
mlpack-leak-valgrind-round-1.txt

@evanbiederstedt
Copy link

@coatless

So, the mlpack R package needs about 2.5 gb per core to compile. I needed to tweak the container memory limitation to about 9 Gb and drop the number of cores associated with my machine.

Ah, this makes sense. That would cause a few problems, yes :)

Here's what I see running the commands above:

==1029== Memcheck, a memory error detector
==1029== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1029== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==1029== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R -f /mlpack_issue_3021/mlpack-issue-code-only.R
==1029== 
Fatal error: cannot open file '/mlpack_issue_3021/mlpack-issue-code-only.R': No such file or directory
==1029== 
==1029== HEAP SUMMARY:
==1029==     in use at exit: 118 bytes in 5 blocks
==1029==   total heap usage: 358 allocs, 353 frees, 148,182 bytes allocated
==1029== 
==1029== 8 bytes in 1 blocks are still reachable in loss record 1 of 3
==1029==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1029==    by 0x56E224C: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==1029==    by 0x56F2BAA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==1029==    by 0x56E0679: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==1029==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==1029==    by 0x4011C90: call_init (dl-init.c:30)
==1029==    by 0x4011C90: _dl_init (dl-init.c:119)
==1029==    by 0x4001139: ??? (in /usr/lib/x86_64-linux-gnu/ld-2.31.so)
==1029==    by 0x2: ???
==1029==    by 0x1FFF0005EE: ???
==1029==    by 0x1FFF000615: ???
==1029==    by 0x1FFF000618: ???
==1029== 
==1029== 24 bytes in 1 blocks are still reachable in loss record 2 of 3
==1029==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1029==    by 0x48B8F67: R_set_command_line_arguments (CommandLineArgs.c:59)
==1029==    by 0x4B04F1F: Rf_initialize_R (system.c:340)
==1029==    by 0x109199: main (Rmain.c:28)
==1029== 
==1029== 86 bytes in 3 blocks are still reachable in loss record 3 of 3
==1029==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1029==    by 0x4DE350E: strdup (strdup.c:42)
==1029==    by 0x48B8FC3: R_set_command_line_arguments (CommandLineArgs.c:64)
==1029==    by 0x4B04F1F: Rf_initialize_R (system.c:340)
==1029==    by 0x109199: main (Rmain.c:28)
==1029== 
==1029== LEAK SUMMARY:
==1029==    definitely lost: 0 bytes in 0 blocks
==1029==    indirectly lost: 0 bytes in 0 blocks
==1029==      possibly lost: 0 bytes in 0 blocks
==1029==    still reachable: 118 bytes in 5 blocks
==1029==         suppressed: 0 bytes in 0 blocks
==1029== 
==1029== For lists of detected and suppressed errors, rerun with: -s
==1029== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)```

@coatless
Copy link
Contributor

@evanbiederstedt Nice!

RE: Valgrind

The -f argument is on a custom path I set up. That's why we're seeing:

Fatal error: cannot open file '/mlpack_issue_3021/mlpack-issue-code-only.R': No such file or directory

I'm not sure if you have unpacked the tar that @nkanrar sent into downloads. If you do, I think the path:

~/Downloads/mlpack_issue_3021

should exist. Thus, when the container is launched, we should have the following file mapping into the docker container:

-v ~/Downloads/mlpack_issue_3021/:/mlpack_issue_3021

From there, I think the file name needs to be slightly changed in the launch command to:

RDvalgrind -d "valgrind --track-origins=yes --leak-check=full --show-reachable=yes" -f /mlpack_issue_3021/mlpack_issue.R

I have this going again... It takes about an hour to fully run with all tracking.

@evanbiederstedt
Copy link

evanbiederstedt commented Jul 27, 2021

Thank you for explaining this:

Fatal error: cannot open file '/mlpack_issue_3021/mlpack-issue-code-only.R': No such file or directory

I didn't quite understand the issue, as I didn't have the file.

I've re-run everything with this file in ~/Downloads/mlpack_issue_3021, and I guess there's no leak:

==1027== LEAK SUMMARY:
==1027==    definitely lost: 0 bytes in 0 blocks
==1027==    indirectly lost: 0 bytes in 0 blocks
==1027==      possibly lost: 0 bytes in 0 blocks
==1027==    still reachable: 55,694,619 bytes in 11,076 blocks
==1027==                       of which reachable via heuristic:
==1027==                         newarray           : 4,264 bytes in 1 blocks
==1027==         suppressed: 0 bytes in 0 blocks
==1027== 
==1027== For lists of detected and suppressed errors, rerun with: -s
==1027== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Well, I remain confused....
output1.txt

@rcurtin
Copy link
Member

rcurtin commented Jul 27, 2021

Is it easier to run with gdb? R -d gdb -e 'source("mlpack_issue.R"); source("mlpack_issue.R")' seems to work fine for me to get a stack trace. Just guessing, but the backtrace I get suggests that the random forest model is being freed twice. I'm trying to instrument the code now to get a bit more information.

@coatless
Copy link
Contributor

@rcurtin I can throw it into gdb.

One note that I found interesting was a set seed call warning.

output <- mlpack::random_forest(training = training_data, labels = labels)
Warning message:
In random_forest_mlpackMain() :
  When called from R, the RNG seed has to be set at the R level via set.seed()

@coatless
Copy link
Contributor

coatless commented Jul 28, 2021

On the second run with gdb enabled, I'm getting the segmentation fault at:

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
mlpack::tree::DecisionTree<mlpack::tree::GiniGain, mlpack::tree::BestBinaryNumericSplit, mlpack::tree::AllCategoricalSplit, mlpack::tree::MultipleRandomDimensionSelect, double, false>::~DecisionTree (this=0x555556589230, 
    __in_chrg=<optimized out>) at ./mlpack/methods/decision_tree/decision_tree_impl.hpp:445

Steps

Step 0: Launch container with mapping

docker run --rm -ti --name rd --security-opt seccomp=unconfined -v ~/Downloads/mlpack_issue_3021/:/mlpack_issue_3021 wch1/r-debug

Step 1: Install mlpack package under R-devel

# Install mlpack under R-devel
RD -e "install.packages('mlpack')"

Note: R does not have gfortran on the r-debug image.

Step 2: Load R-devel with a debugger

RD -d gdb

Once gdb is loaded, type:

run

To begin the session.

Step 3: Run R code and aim to trigger the segfault.

source('/mlpack_issue_3021/mlpack_issue.R')

@rcurtin
Copy link
Member

rcurtin commented Jul 28, 2021

(I wrote this comment last night, but somehow did not hit the comment button. Oops!)

Interesting; I guess for the R distribution of mlpack we may want to modify the mlpack::RandomSeed() method for the R bindings.

Anyway, I dug a bit and found that the RandomForestModel created by random_forest() is being deleted twice. Is it possible that we are handling the XPtr unsafely? Here are the two functions that are used from Rcpp to get and set the random forest model:

// [[Rcpp::export]]
SEXP IO_GetParamRandomForestModelPtr(const std::string& paramName)
{
  return std::move((Rcpp::XPtr<RandomForestModel>) IO::GetParam<RandomForestModel*>(paramName));
}

// [[Rcpp::export]]
void IO_SetParamRandomForestModelPtr(const std::string& paramName, SEXP ptr)
{
  IO::GetParam<RandomForestModel*>(paramName) =  Rcpp::as<Rcpp::XPtr<RandomForestModel>>(ptr);
  IO::SetPassed(paramName);
}

I'll try to keep digging a little more tomorrow. The gdb backtrace was a little bit unclear on why the free was being called twice; I didn't see where or

@rcurtin
Copy link
Member

rcurtin commented Jul 28, 2021

Here is a reduced script that produces the issue:

library(mlpack)

# Load training and testing dataset
# The last column of both files contain the labels, the column name is "class_label"
train <- readRDS("train.rds")

lkp_tbl <- data.frame(class_lab = unique(train$class_label), num_lab = 1:length(unique(train$class_label)))
labels = as.matrix(lkp_tbl$num_lab[match(train$class_label,lkp_tbl$class_lab)])

# Train the first model and create a pointer to it.
output <- mlpack::random_forest(training = train[,-ncol(train)], labels = labels)
# Get some predictions with the first model; overwrite `output`.
output <- mlpack::random_forest(input_model=output$output_model, test=train[,-ncol(train)])

# Now overwrite `output` so that any XPtrs have now gone out of scope.
output <- 3

# Force double free
gc()

I think I understand what is happening here.

After the first call to mlpack::random_forest(), the variable output$output_model is an XPtr to the model that was created. But then, we overwrite output with a second call to mlpack::random_forest(). This causes that first XPtr to go out of scope, and if we called gc() at this moment, that XPtr would be freed.

But that second call to mlpack::random_forest() also produces a new output$output_model (another XPtr), which happens to point to the exact same memory address as what we set input_model to (which is the previous call's output$output_model, which has now gone out of scope).

Then, we set output to something new to make this second output$output_model (which points to the same memory as the first output$output_model) go out of scope. Thus, we have two XPtrs pointing to the exact same memory that are out of scope, and so calling gc() will try to free both of them... which of course causes a segfault.

I am not an Rcpp expert. Is there a way to mark an XPtr as an "alias", e.g., it doesn't need to be freed in its finalizer? (Either in R or in C++, although actually here doing it in R would be preferable.) If there is a way to do that, then I can think of a solution that we can apply here: if a model output parameter is the same as any model input parameter, we just mark it as an alias.

@rcurtin
Copy link
Member

rcurtin commented Jul 28, 2021

Actually we have code in the generated Python bindings that does this exact check. Here's the generated code for the part of the random forest binding in Python that sets the output model:

  result['output_model'] = RandomForestModelType()
  (<RandomForestModelType?> result['output_model']).modelptr = GetParamPtr[RandomForestModel]('output_model')
  if input_model is not None:
    if (<RandomForestModelType> result['output_model']).modelptr == (<RandomForestModelType> input_model).modelptr:
      (<RandomForestModelType> result['output_model']).modelptr = <RandomForestModel*> 0
      result['output_model'] = input_model

In Python, it looks like the choice being made is to make the output model a Python-level copy/alias of the given input model, instead of using the model pointer directly from C++. Perhaps a similar strategy would work in R? I'd prefer to get some advice from some R experts before implementing something though. 😄

@rcurtin
Copy link
Member

rcurtin commented Aug 18, 2021

Ok, I got the PR #3034 to compile, and so if you are brave, you can download the tarball directly from the Github Actions build:

https://github.com/mlpack/mlpack/suites/3531895147/artifacts/84500190

(That is a tarball that... contains another tarball. You can unpack that and then install it directly with install.packages().)

And, I think that should fix the issue! If you're up for giving a shot, let us know if it works. 😄

@coatless
Copy link
Contributor

I can confirm that #3034 fixes the Xptr issue with RandomForest.

Though, we need to remove the debug statements showing pointer creation:

R> # Attempt to predict using the underlying model
R> my_model_test_data = random_forest(
+   input_model=rf_model, 
+   test=X,
+   test_labels=y_integer_encoded)
create params 0x7fa91ae83230
create timers 0x7fa9157bc220
Full reprex
# Load the mlpack library
library("mlpack")

# Construct model design
my_model_design = model.frame(Species ~., data = iris)

# Extract the model design matrix
X = model.matrix(my_model_design, data = iris)

# Extract the labels as a vector of factors
y_factor = model.response(my_model_design)

# Convert factor labels to integer encodings. 
y_integer_encoded = as.matrix(as.integer(y_factor))

# Train the model
output <- random_forest(training=X, labels=y_integer_encoded, minimum_leaf_size=20,
                        num_trees=10, print_training_accuracy=TRUE)

# See model output
output
#> $output_model
#> <pointer: 0x7fc6eed34460>
#> attr(,"type")
#> [1] "RandomForestModel"
#> 
#> $predictions
#>      [,1]
#> 
#> $probabilities
#> <0 x 0 matrix>
#> $output_model
#> <pointer: 0x7fe45345f240>
#> attr(,"type")
#> [1] "RandomForestModel"
#> 
#> $predictions
#>      [,1]
#> 
#> $probabilities
#> <0 x 0 matrix>

# Extract the pointer to the underlying model for predicctions
rf_model = output$output_model

# Attempt to predict using the underlying model
my_model_test_data = random_forest(
  input_model=rf_model, 
  test=X,
  test_labels=y_integer_encoded)

# View predictions
my_model_test_data
#> $output_model
#> <pointer: 0x7fc6eed34460>
#> attr(,"type")
#> [1] "RandomForestModel"
#> 
#> $predictions
#>        [,1]
#>   [1,]    1
#>   [2,]    1
#>   [3,]    1
#>   [4,]    1
#>   [5,]    1
#>   [6,]    1
#>   [7,]    1
#>   [8,]    1
#>   [9,]    1
#>  [10,]    1
#>  [11,]    1
#>  [12,]    1
#>  [13,]    1
#>  [14,]    1
#>  [15,]    1
#>  [16,]    1
#>  [17,]    1
#>  [18,]    1
#>  [19,]    1
#>  [20,]    1
#>  [21,]    1
#>  [22,]    1
#>  [23,]    1
#>  [24,]    1
#>  [25,]    1
#>  [26,]    1
#>  [27,]    1
#>  [28,]    1
#>  [29,]    1
#>  [30,]    1
#>  [31,]    1
#>  [32,]    1
#>  [33,]    1
#>  [34,]    1
#>  [35,]    1
#>  [36,]    1
#>  [37,]    1
#>  [38,]    1
#>  [39,]    1
#>  [40,]    1
#>  [41,]    1
#>  [42,]    1
#>  [43,]    1
#>  [44,]    1
#>  [45,]    1
#>  [46,]    1
#>  [47,]    1
#>  [48,]    1
#>  [49,]    1
#>  [50,]    1
#>  [51,]    2
#>  [52,]    2
#>  [53,]    2
#>  [54,]    2
#>  [55,]    2
#>  [56,]    2
#>  [57,]    2
#>  [58,]    2
#>  [59,]    2
#>  [60,]    2
#>  [61,]    2
#>  [62,]    2
#>  [63,]    2
#>  [64,]    2
#>  [65,]    2
#>  [66,]    2
#>  [67,]    2
#>  [68,]    2
#>  [69,]    2
#>  [70,]    2
#>  [71,]    3
#>  [72,]    2
#>  [73,]    2
#>  [74,]    2
#>  [75,]    2
#>  [76,]    2
#>  [77,]    2
#>  [78,]    2
#>  [79,]    2
#>  [80,]    2
#>  [81,]    2
#>  [82,]    2
#>  [83,]    2
#>  [84,]    3
#>  [85,]    2
#>  [86,]    2
#>  [87,]    2
#>  [88,]    2
#>  [89,]    2
#>  [90,]    2
#>  [91,]    2
#>  [92,]    2
#>  [93,]    2
#>  [94,]    2
#>  [95,]    2
#>  [96,]    2
#>  [97,]    2
#>  [98,]    2
#>  [99,]    2
#> [100,]    2
#> [101,]    3
#> [102,]    3
#> [103,]    3
#> [104,]    3
#> [105,]    3
#> [106,]    3
#> [107,]    2
#> [108,]    3
#> [109,]    3
#> [110,]    3
#> [111,]    3
#> [112,]    3
#> [113,]    3
#> [114,]    3
#> [115,]    3
#> [116,]    3
#> [117,]    3
#> [118,]    3
#> [119,]    3
#> [120,]    3
#> [121,]    3
#> [122,]    3
#> [123,]    3
#> [124,]    3
#> [125,]    3
#> [126,]    3
#> [127,]    3
#> [128,]    3
#> [129,]    3
#> [130,]    3
#> [131,]    3
#> [132,]    3
#> [133,]    3
#> [134,]    3
#> [135,]    3
#> [136,]    3
#> [137,]    3
#> [138,]    3
#> [139,]    3
#> [140,]    3
#> [141,]    3
#> [142,]    3
#> [143,]    3
#> [144,]    3
#> [145,]    3
#> [146,]    3
#> [147,]    3
#> [148,]    3
#> [149,]    3
#> [150,]    3
#> 
#> $probabilities
#>               [,1]        [,2]        [,3]
#>   [1,] 1.000000000 0.000000000 0.000000000
#>   [2,] 0.944000000 0.048000000 0.008000000
#>   [3,] 1.000000000 0.000000000 0.000000000
#>   [4,] 1.000000000 0.000000000 0.000000000
#>   [5,] 1.000000000 0.000000000 0.000000000
#>   [6,] 1.000000000 0.000000000 0.000000000
#>   [7,] 1.000000000 0.000000000 0.000000000
#>   [8,] 1.000000000 0.000000000 0.000000000
#>   [9,] 0.944000000 0.048000000 0.008000000
#>  [10,] 1.000000000 0.000000000 0.000000000
#>  [11,] 1.000000000 0.000000000 0.000000000
#>  [12,] 1.000000000 0.000000000 0.000000000
#>  [13,] 0.944000000 0.048000000 0.008000000
#>  [14,] 0.944000000 0.048000000 0.008000000
#>  [15,] 0.905882353 0.088235294 0.005882353
#>  [16,] 0.905882353 0.088235294 0.005882353
#>  [17,] 1.000000000 0.000000000 0.000000000
#>  [18,] 1.000000000 0.000000000 0.000000000
#>  [19,] 0.905882353 0.088235294 0.005882353
#>  [20,] 1.000000000 0.000000000 0.000000000
#>  [21,] 1.000000000 0.000000000 0.000000000
#>  [22,] 1.000000000 0.000000000 0.000000000
#>  [23,] 1.000000000 0.000000000 0.000000000
#>  [24,] 1.000000000 0.000000000 0.000000000
#>  [25,] 1.000000000 0.000000000 0.000000000
#>  [26,] 0.944000000 0.048000000 0.008000000
#>  [27,] 1.000000000 0.000000000 0.000000000
#>  [28,] 1.000000000 0.000000000 0.000000000
#>  [29,] 1.000000000 0.000000000 0.000000000
#>  [30,] 1.000000000 0.000000000 0.000000000
#>  [31,] 1.000000000 0.000000000 0.000000000
#>  [32,] 1.000000000 0.000000000 0.000000000
#>  [33,] 1.000000000 0.000000000 0.000000000
#>  [34,] 1.000000000 0.000000000 0.000000000
#>  [35,] 1.000000000 0.000000000 0.000000000
#>  [36,] 1.000000000 0.000000000 0.000000000
#>  [37,] 1.000000000 0.000000000 0.000000000
#>  [38,] 1.000000000 0.000000000 0.000000000
#>  [39,] 0.944000000 0.048000000 0.008000000
#>  [40,] 1.000000000 0.000000000 0.000000000
#>  [41,] 1.000000000 0.000000000 0.000000000
#>  [42,] 0.944000000 0.048000000 0.008000000
#>  [43,] 1.000000000 0.000000000 0.000000000
#>  [44,] 1.000000000 0.000000000 0.000000000
#>  [45,] 1.000000000 0.000000000 0.000000000
#>  [46,] 0.944000000 0.048000000 0.008000000
#>  [47,] 1.000000000 0.000000000 0.000000000
#>  [48,] 1.000000000 0.000000000 0.000000000
#>  [49,] 1.000000000 0.000000000 0.000000000
#>  [50,] 1.000000000 0.000000000 0.000000000
#>  [51,] 0.005882353 0.868539053 0.125578594
#>  [52,] 0.005882353 0.868539053 0.125578594
#>  [53,] 0.005882353 0.689274839 0.304842808
#>  [54,] 0.044000000 0.944666667 0.011333333
#>  [55,] 0.005882353 0.859250515 0.134867132
#>  [56,] 0.005882353 0.962174688 0.031942959
#>  [57,] 0.005882353 0.868539053 0.125578594
#>  [58,] 0.044000000 0.944666667 0.011333333
#>  [59,] 0.005882353 0.864012420 0.130105227
#>  [60,] 0.044000000 0.882761905 0.073238095
#>  [61,] 0.044000000 0.944666667 0.011333333
#>  [62,] 0.005882353 0.868649373 0.125468274
#>  [63,] 0.000000000 0.943569334 0.056430666
#>  [64,] 0.005882353 0.812107658 0.182009989
#>  [65,] 0.005882353 0.984901961 0.009215686
#>  [66,] 0.005882353 0.891266326 0.102851321
#>  [67,] 0.005882353 0.845922100 0.148195547
#>  [68,] 0.000000000 0.968095238 0.031904762
#>  [69,] 0.000000000 0.906080156 0.093919844
#>  [70,] 0.000000000 0.968095238 0.031904762
#>  [71,] 0.000000000 0.383985507 0.616014493
#>  [72,] 0.005882353 0.960376057 0.033741590
#>  [73,] 0.000000000 0.654710874 0.345289126
#>  [74,] 0.005882353 0.874012420 0.120105227
#>  [75,] 0.005882353 0.950376057 0.043741590
#>  [76,] 0.005882353 0.891266326 0.102851321
#>  [77,] 0.005882353 0.766942823 0.227174824
#>  [78,] 0.005882353 0.503714163 0.490403484
#>  [79,] 0.005882353 0.875744022 0.118373625
#>  [80,] 0.000000000 0.968095238 0.031904762
#>  [81,] 0.044000000 0.944666667 0.011333333
#>  [82,] 0.044000000 0.944666667 0.011333333
#>  [83,] 0.000000000 0.968095238 0.031904762
#>  [84,] 0.000000000 0.422007340 0.577992660
#>  [85,] 0.044000000 0.805686806 0.150313194
#>  [86,] 0.005882353 0.821396196 0.172721451
#>  [87,] 0.005882353 0.868539053 0.125578594
#>  [88,] 0.000000000 0.933569334 0.066430666
#>  [89,] 0.005882353 0.930554135 0.063563512
#>  [90,] 0.044000000 0.944666667 0.011333333
#>  [91,] 0.044000000 0.944666667 0.011333333
#>  [92,] 0.005882353 0.821396196 0.172721451
#>  [93,] 0.000000000 0.968095238 0.031904762
#>  [94,] 0.044000000 0.944666667 0.011333333
#>  [95,] 0.000000000 0.968095238 0.031904762
#>  [96,] 0.005882353 0.930554135 0.063563512
#>  [97,] 0.005882353 0.984901961 0.009215686
#>  [98,] 0.005882353 0.950376057 0.043741590
#>  [99,] 0.044000000 0.944666667 0.011333333
#> [100,] 0.005882353 0.984901961 0.009215686
#> [101,] 0.000000000 0.003571429 0.996428571
#> [102,] 0.000000000 0.162916431 0.837083569
#> [103,] 0.000000000 0.007692308 0.992307692
#> [104,] 0.000000000 0.048630717 0.951369283
#> [105,] 0.000000000 0.007692308 0.992307692
#> [106,] 0.000000000 0.007692308 0.992307692
#> [107,] 0.044000000 0.860034632 0.095965368
#> [108,] 0.000000000 0.052751596 0.947248404
#> [109,] 0.000000000 0.124180168 0.875819832
#> [110,] 0.000000000 0.007692308 0.992307692
#> [111,] 0.000000000 0.147692308 0.852307692
#> [112,] 0.000000000 0.120059289 0.879940711
#> [113,] 0.000000000 0.007692308 0.992307692
#> [114,] 0.000000000 0.154220779 0.845779221
#> [115,] 0.000000000 0.082792208 0.917207792
#> [116,] 0.000000000 0.003571429 0.996428571
#> [117,] 0.000000000 0.016387960 0.983612040
#> [118,] 0.000000000 0.007692308 0.992307692
#> [119,] 0.000000000 0.115484515 0.884515485
#> [120,] 0.000000000 0.422007340 0.577992660
#> [121,] 0.000000000 0.007692308 0.992307692
#> [122,] 0.000000000 0.274696970 0.725303030
#> [123,] 0.000000000 0.044055944 0.955944056
#> [124,] 0.000000000 0.395619964 0.604380036
#> [125,] 0.000000000 0.007692308 0.992307692
#> [126,] 0.000000000 0.016387960 0.983612040
#> [127,] 0.000000000 0.415495741 0.584504259
#> [128,] 0.000000000 0.290684899 0.709315101
#> [129,] 0.000000000 0.039935065 0.960064935
#> [130,] 0.005882353 0.363714163 0.630403484
#> [131,] 0.000000000 0.052751596 0.947248404
#> [132,] 0.000000000 0.007692308 0.992307692
#> [133,] 0.000000000 0.039935065 0.960064935
#> [134,] 0.005882353 0.485956920 0.508160727
#> [135,] 0.000000000 0.422007340 0.577992660
#> [136,] 0.000000000 0.007692308 0.992307692
#> [137,] 0.000000000 0.003571429 0.996428571
#> [138,] 0.000000000 0.012267081 0.987732919
#> [139,] 0.000000000 0.377641421 0.622358579
#> [140,] 0.000000000 0.007692308 0.992307692
#> [141,] 0.000000000 0.007692308 0.992307692
#> [142,] 0.000000000 0.147692308 0.852307692
#> [143,] 0.000000000 0.162916431 0.837083569
#> [144,] 0.000000000 0.007692308 0.992307692
#> [145,] 0.000000000 0.007692308 0.992307692
#> [146,] 0.000000000 0.007692308 0.992307692
#> [147,] 0.000000000 0.210059289 0.789940711
#> [148,] 0.000000000 0.007692308 0.992307692
#> [149,] 0.000000000 0.003571429 0.996428571
#> [150,] 0.000000000 0.105124224 0.894875776

@rcurtin
Copy link
Member

rcurtin commented Sep 20, 2021

Awesome, #3034 is merged, so hopefully this is fixed. There is at least one more little R fix that I want to get merged, and then I'll get a new version posted to CRAN. 👍 Thanks again for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants