Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in unserialize(socklist[[n]]) : error reading from connection #44

Closed
OlexiyPukhov opened this issue Oct 24, 2022 · 4 comments
Closed

Comments

@OlexiyPukhov
Copy link

OlexiyPukhov commented Oct 24, 2022

I am dealing with a large dataset (27000 x 140) and when I first try without parallel computation of kernalSHAP, there is no movement in the progress bar (it stays at 0). When I turn on parallel processing using what you have described on the home page of your repo, I get the error:

Error in unserialize(socklist[[n]]) : error reading from connection

What should I do? I am using windows, core i7 12700k.

@mayer79
Copy link
Collaborator

mayer79 commented Oct 24, 2022

Hello and sorry you run into a problem.

I probably cannot solve the error coming from parallel computing on Windows. But maybe I can help you with the single-threaded problem (no movement in error bar):

  1. SHAP analyses are usually done by decomposing 200-2000 predictions. Maybe you can start with an X consisting of 500 randomly sampled rows.
  2. What is the size of your background data set? Scott Lundberg proposes to use small sets of only a few rows (up to 100). Also here I'd suggest to use a bg_X of 20. If the speed is acceptable, you can increase to 100.
  3. I guess you are not using exact = TRUE?
  4. Are you using the current CRAN version?

What do you observe?

@OlexiyPukhov
Copy link
Author

OlexiyPukhov commented Oct 24, 2022

Thank you for the swift answer. I was able to solve the problem thanks to your ideas and some tinkering. Initially, my bg_x and x were the same, both being 27k x 140. I set x to be a subset of 2000 rows and bg_x to be 500 with parallel processing for 12 threads enabled with exact = FALSE. Processing finished after about 5 minutes.

Will the fact that I am now using a smaller dataset for x and bg_x change my results relative to using my full dataset for x and bg_x? I am able to use the full dataset for x and bg_x with treeSHAP.

@mayer79
Copy link
Collaborator

mayer79 commented Oct 25, 2022

Sweet! Thanks for testing. With 500 rows, your background data is still very large, but 5 minutes is quite acceptable. Using even larger X and bg_X will change the result, but only slightly. I usually decompose between 1000 and 2000 predictions, even with TreeSHAP.

TreeSHAP is indeed magnitudes faster than KernelSHAP but can only be used for tree-based methods, while KernelSHAP works for all model classes. For trees, I would not use KernelSHAP in practice.

By default, in your case, exact = FALSE, so you don't need to specify it explicitly.

@OlexiyPukhov
Copy link
Author

Since I'm doing research, a longer processing time is acceptable but I suppose for production TreeSHAP would be preferred. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants