Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to config max number of threads? #241

Closed
tdhock opened this issue Jun 12, 2023 · 23 comments · Fixed by #720
Closed

how to config max number of threads? #241

tdhock opened this issue Jun 12, 2023 · 23 comments · Fixed by #720
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@tdhock
Copy link

tdhock commented Jun 12, 2023

hi @sorhawell is there a function that I can call to tell polars the max number of threads I want it to use?
something like data.table::setDTthreads(4) if I only want to use 4 threads max.
Thanks!
I searched the documentation for "thread" and got a few hits but nothing that helped.

@sorhawell
Copy link
Collaborator

sorhawell commented Jun 12, 2023

Good question

The default worker pool size is likely the double of number of cores. I did not know it was possible to modify, but I found this function in py-polars docs and learned you can use Sys.setenv("POLARS_MAX_THREADS"=3) before loading library polars

polars:::test_get_pool_size() is not published yet but it shows that envvar option seems to work

> polars:::test_get_pool_size()
[1] 8

Restarting R session...

> library(polars)

Restarting R session...

> Sys.setenv("POLARS_MAX_THREADS"=3)
> library(polars)
> polars:::test_get_pool_size()
[1] 3

The thread pool could be used I guess to limit cores, but it is not completely exact:

on top of this r-polars always reserves one extra thread for running the R interpreter. If not injecting R code into polars with $map() or $apply(), that thread will mostly be sleeping.

extendr also uses a thread for catching rust panics , but that should be very light weight

rust-polars could potentially here and there spawns a few threads for special tasks

@tdhock
Copy link
Author

tdhock commented Jun 13, 2023

Hi again, thanks for the quick response. That is helpful to know that we can set an environment variable to set the max number of threads before polars is loaded, but is there a way to change it after it has already been loaded/attached via library(polars)? (similar to data.table::setDTthreads)
Also that is beneficial to know that a thread is reserved for running the R interpreter. So does that mean the minimal number for POLARS_MAX_THREADS is 0? or is it 1?
Would be nice to see some documentation about this topic at some point.
For context I am trying to do some benchmarking of single threaded vs multi-threaded csv reading code.
Also, is there a polars function for csv writing? I did not see docs about that.

@sorhawell
Copy link
Collaborator

sorhawell commented Jun 13, 2023

So does that mean the minimal number for POLARS_MAX_THREADS is 0? or is it 1?

POLARS_MAX_THREADS could have been named POLARS_RAYON_POOL_SIZE. I don't know what would happen on 0. Either 0 is not allowed or rayon hang I guess. Rayon is a rust crate which implements a thread pool. 1 would be lowest meaningful value I guess.

in the rust-polars for the csv reader and other readers it is possible to specifically decide on threads. Likely because these readers come from external crates with their own threadding implementation. These are likely not bounded by POLARS_MAX_THREADS. I will look into if r-polars could easily expose these reader thread settings.

Let's also keep this issue open until these thread findings have been added to docs.

@etiennebacher etiennebacher added the question Further information is requested label Jun 28, 2023
@eitsupi
Copy link
Collaborator

eitsupi commented Oct 7, 2023

This issue may be necessary for the CRAN release.

The CRAN policy requires that the number of logical threads used on a CRAN machine be limited to 2. (For example, see this thread on the R package devel mailing list https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009454.html)

When submitting polars to CRAN, we may need to limit the number of threads, e.g. for tests.

@eitsupi eitsupi added documentation Improvements or additions to documentation enhancement New feature or request and removed question Further information is requested labels Oct 7, 2023
@eitsupi eitsupi added this to the 1st CRAN Release milestone Oct 7, 2023
@eitsupi

This comment was marked as outdated.

@eitsupi

This comment was marked as resolved.

@eitsupi eitsupi added the help wanted Extra attention is needed label Dec 23, 2023
@eitsupi
Copy link
Collaborator

eitsupi commented Dec 24, 2023

Useful resources:

It's really surprising that CRAN is forcing package developers to set the default to 2.
Polars currently does not have a way to change the number of threads after loading, so we may have to opt in to something other than 2 any time use r-polars if we submit to CRAN.

@eitsupi eitsupi removed the help wanted Extra attention is needed label Dec 26, 2023
@eitsupi eitsupi self-assigned this Dec 26, 2023
@eitsupi
Copy link
Collaborator

eitsupi commented Dec 26, 2023

I came up with a solution of the default thread number: introduce a new feature flag not_set_max_threads in the Rust library, which automatically sets the POLARS_MAX_THREADS=2 environment variable if that flag is not set and the POLARS_MAX_THREADS environment variable is empty via the .onLoad() function.
If this feature is explicitly selected and built, POLARS_MAX_THREADS is not set and the maximum number of cores is defaulted.

Something like:

if (!check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == "") Sys.setenv(POLARS_MAX_THREADS = 2) 

@sorhawell @etiennebacher What do you think about this?

@etiennebacher
Copy link
Collaborator

Shouldn't the if condition be check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == ""?

We want to set the number of threads to 2 if not_set_max_threads is TRUE, right?

@eitsupi
Copy link
Collaborator

eitsupi commented Dec 26, 2023

In my proposal not_set_max_threads is not included in the default feature.
This means the following:

[features]
default = []
not_set_max_threads = []

In other words, for builds that do not explicitly include not_set_max_threads, the number of threads is automatically set to 2 unless the number of threads is specified by POLARS_MAX_THREADS.
Conversely, for builds that include not_set_max_threads (GitHub build), the number of threads is automatically determined by Polars (current behave).

if (!check_feature("not_set_max_threads") && Sys.getenv("POLARS_MAX_THREADS") == "") Sys.setenv(POLARS_MAX_THREADS = 2) 

@sorhawell
Copy link
Collaborator

sorhawell commented Dec 27, 2023

Hi I was little confused of "not_set_max_threads", I think of it as "do_not_limit_max_threads". I hope I got it right.

"not_set_max_threads" on R-universe channel?

As I read the email exchange, the CRAN team was given plenty of good reasons from smart people, why not to enforce such a limit - but it is what it is. I recommend also running with "not_set_max_threads" for R-universe, as that is our main recommended channel to install from. I guess data.table would neither limit thread for any non CRAN release channel.

There are some other threads, we may need to deal with eventually

The POLARS_MAX_THREADS likely only apply to the polars thread pool size. Our own rust code also spawns threads. When ever using collect() an extra thread will be spawned to interface polars multiple threads with the single R session. It will sleep a lot though. Our in_background can spawn extra threads too.
Maybe our interface code could be rewritten to be limited by the polars threadpool also. Not sure. Maybe it could cause some deadlock situation, because polars workers-thread must wait and sleep for the R-handler-thread and vice versa. Other rust crates may freely spawn threads too.

suggestion how about not CRAN is the general default?

I understand CRAN refuses to have any "IS_CRAN" environment variable, and therefore everybody else just have to hardcode CRAN policies to the default and revert backwards with environment variables. This opens up for some user/dev foot guns. Just a suggestion. How about using an automated CRAN release script/macro which alters the hardcoded defaults of any code bundle submitted to CRAN? Then CRAN can have it in one way, alongside a repository that retain the most sensible defaults. Of course there are also many wise / non-contoversial CRAN policies, which should just be adapted immediately everywhere.

@sorhawell
Copy link
Collaborator

@eitsupi it is you who put a lot of hard work and thought into this. If you see this as the best way, then I applaud that too.

@eitsupi
Copy link
Collaborator

eitsupi commented Dec 27, 2023

Thank you for your reply.

I strongly oppose patching source code submitted to CRAN.
That's because people who want that feature won't necessarily install it from CRAN, and it's very likely to mislead people who read the source code on GitHub.

In other words, the issue of whether the default number of threads is unlimited or 2 is irrelevant to CRAN, and I think users should be able to choose between them no matter where they install from.
Since R packages don't have a mechanism like feature flags, I just suggested that it could be easily implemented using Rust's feature flags.

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

Once #693 is merged, users will be able to freely select features during installation, making it easier to add features.

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 20, 2024

Implemented in #720.
The other crate had a function called "disable_*", so I named the feature as disable_auto_limit_max_threads.
Of course this is enabled on binaries built on GitHub (includes R-universe).

@tdhock
Copy link
Author

tdhock commented Jan 29, 2024

great, thanks!
related: can you please document some way to query the number of threads that polars is using?
I guess it is possible using polars:::test_get_pool_size() ?

@tdhock
Copy link
Author

tdhock commented Jan 29, 2024

btw, gsoc'24 is coming up, so if you want to participate again with R project, please add an idea to our wiki, https://github.com/rstats-gsoc/gsoc2024/wiki/table%20of%20proposed%20coding%20projects

@etiennebacher
Copy link
Collaborator

I guess it is possible using polars:::test_get_pool_size() ?

pl$threadpool_size() is what you're looking for (polars:::test_get_pool_size() was removed)

btw, gsoc'24 is coming up

I won't be able to participate, maybe @eitsupi ?

@sorhawell
Copy link
Collaborator

@tdhock I was blessed with one more child this year, and somehow any spare time went up in smoke for me :) It was pleasure to mentor @Sicheng-Pan last year!

@tdhock
Copy link
Author

tdhock commented Jan 29, 2024

Sicheng may consider mentoring this year

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 30, 2024

I don't think I'm good enough to be a mentor here.

@Sicheng-Pan
Copy link
Collaborator

I would like to help, but I'm sure that I am way less experienced than anyone here

@tdhock
Copy link
Author

tdhock commented Jan 30, 2024

well I'm sure you know more about polars than I do! I could co-mentor again, if you need somebody with basic R expertise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
5 participants