-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random seed is replication across child processes #9650
Comments
If you copy a Any correct method requires you to initialize a |
Worth noting that the
|
Huh, I thought we had an issue open for this already, but I can't find it. Related: #9248 This is very difficult to fix in Python versions before 3.7. In 3.7 it's not too bad using |
For some reason, when using multiprocessing.Pool, one gets the desired behavior using Python's random but not using NumPy's! This behaves differently than the joblib example of @bashtage. Here is my code:
Output on Python 3.6.2, Numpy 1.13.1:
|
The difference is that NumPy is using It is very difficult to enforce the behavior you are seeing from |
See this answer for details and the complexity it introduces. |
Here is a minimal example using a shared from joblib import Parallel, delayed
import numpy as np
from multiprocessing.managers import BaseManager
class SharedRandomState(object):
random_state = np.random.RandomState(1234)
def random_sample(self):
return self.random_state.random_sample()
class MyManager(BaseManager):
pass
MyManager.register('SharedRandomState', SharedRandomState)
def worker(rs):
return rs.random_sample()
if __name__ == '__main__':
manager = MyManager()
manager.start()
shared = manager.SharedRandomState()
res = Parallel(n_jobs=4)(delayed(worker)(shared) for i in range(10))
print('From multiple workers')
print(np.array(res))
# Verify
random_state = np.random.RandomState(1234)
print('From a single instance')
print(random_state.random_sample(size=10)) From multiple workers
[ 0.19151945 0.62210877 0.43772774 0.78535858 0.77997581 0.27259261
0.27646426 0.80187218 0.95813935 0.87593263]
From a single instance
[ 0.19151945 0.62210877 0.43772774 0.78535858 0.77997581 0.27259261
0.27646426 0.80187218 0.95813935 0.87593263] |
Upon further examination, Python 3.6's default behavior of reseeding in the child processes was hard-coded in the _launch method of multiprocessing.popen_fork.Popen and is specific to the built-in random module. Starting from Python 3.7 this was changed to use the new os.register_at_fork mechanism. So it may be difficult to fix on older versions, but surely we can reseed on Python 3.7 using the same mechanism as @njsmith suggested. I'm willing to do the work if one of the core developers is willing to accept the pull request. |
The hard question for Reseeding in child processes is not a good way to generate multiple streams of random numbers since it is not easy to verify the statistical properties of the generators. It would be intellectually better to use a generator that supports multiple streams -- this would probably require a new version of |
In 3.5 import random
from joblib import Parallel, delayed
def f(v):
random.random()
return random.getstate()
res = Parallel(n_jobs=4)(delayed(f)(i) for i in range(10))
for i in range(len(res)-1):
print('624 element state identical? {0}'.format(res[i][1][:-1] == res[i+1][1][:-1]))
print('Position identical? {0}'.format(res[i][1][-1] == res[i+1][1][-1]))
print('Positions: {0}, {1}'.format(res[i][1][-1],res[i+1][1][-1])) 624 element state identical? True
Position identical? False
Positions: 44, 46
624 element state identical? True
Position identical? False
Positions: 46, 48
624 element state identical? True
Position identical? False
Positions: 48, 50
624 element state identical? True
Position identical? False
Positions: 50, 52
624 element state identical? True
Position identical? False
Positions: 52, 54
624 element state identical? True
Position identical? False
Positions: 54, 56
624 element state identical? True
Position identical? False
Positions: 56, 58
624 element state identical? True
Position identical? False
Positions: 58, 60
624 element state identical? True
Position identical? False
Positions: 60, 62 |
@mosco You are right -- my examples were too trivial. |
We should re-seed each # no initial seed, re-seeded on fork
r1 = np.random.RandomState()
# explicit seed given, NOT re-seeded on fork
r2 = np.random.RandomState(0)
# seed was given, but then randomized; re-seeded on fork
r3 = np.random.RandomState(0); r3.seed() |
That is probably reasonable since when you use a default seed you don't really know what you get anyway. I took a look at |
If I write
```
r = np.random.RandomState()
z = r.random()
pid = os.fork()
```
then why should I expect the internal state of r changes in the child
process? It also sounds a bit dangerous that it would get reseeded based
on the *time*.
|
It is normally reseeded with |
Ok, indeed the time-based was the fallback behavior. I guess this is
purity vs. practicality conflict here; the current behavior in Numpy is
fairly "pure" but otoh the multiprocessing caveat is easy to overlook.
|
You can't exhaust |
The implicit global But really, I don't understand caring about their results and wanting to use the |
@rkern: So it sounds like the places where we disagree are:
In general, the intuition here is that unseeded generation acts like a totally unpredictable and non-deterministic source of bits, and a seeded generation acts like a deterministic but well-distributed source of bits. As for how, it's easy (though not particularly pretty): keep a global ...in any case, we shouldn't let these details derail us from the core feature of making sure that unseeded |
...I guess I could even be argued into the position of just re-seeding everything on |
The issue is when you foil code that is explicit about wanting to doing so, with setup of r1 = RandomState()
r2 = r1.copy() # or copy.copy(r1), whatever this is called
# or
r1 = RandomState(some_seed)
r2 = RandomState(some_seed) And then code that evolves from single-process to multiprocess:
Ideally, the behaviour would not change here in either of the two setup cases. Perhaps |
@eric-wieser: is there ever any reason to do that though? The case that I can think of where you might have something like this is when someone explicitly saves the seed at the beginning of a run so that they can reproduce it later if necessary, then forks and fails to save the seeds in child processes. It's tricky to do this correctly right now though, and "your calculations are subtly broken" is much worse then "everything worked correctly but you can't reproduce an earlier run". Also note that such code is already broken on Windows (no |
Honestly, I think But I didn't say that it should be reseeded, only that the hidden global I don't think that you should reseed explicitly-created |
@mosco @bashtage @njsmith @rkern I think I have a very similar issue and I've been reading the posts here and still cannot wrap my head around the issue on how I should resolve this. I want to share numpy random state of a parent process with a child process. I've tried using Note: I use Python 3.5 |
Don't worry about understanding this issue. The issue being discussed here isn't relevant to #10769. This is only talking about how |
@njsmith I support your stand-point that forks of an unseeded RandomState object should be reseeded. The usual scenario I have come across, involves parallel Monte-Carlo simulations using the multiprocessing module. Simulation code that produces valid results when run sequentially, without further mention produces invalid results as soon as is it passed to In my understanding seeding RandomState with |
I just came across this thread and I feel it would help to describe how the SystemVerilog language describes random stability. Anyone interested to read more can have a look at IEEE Std 1800-2017, section 18.14.
In other words: When a child thread is created, its RNG ( |
A simple workaround is re-seed the generator in workers: def worker(num):
np.random.seed()
for i in range(5):
print(i, np.random.rand())
if __name__ == '__main__':
with multiprocessing.Pool(4) as pool:
pool.map(worker, range(4)) |
OK, thanks for the clarification |
We are currently discussing a SeedSequence like
will work. But I suppose this issue would be something like automattically doing:
for the "default" one that we use for the xref: gh-13685 EDIT: Of course that would all only be necessary if you want a reproducible stream. In which case the fork also would need a counter or so, to have a reliable identical stream. But I am not sure that can even be done reliably, so maybe this is not necessary at all, and instead it should just reseed with fresh system entropy or raise an error (upon usage?). |
I think this is less of an issue for |
Now that we discussed this again today, I propose to do the following:
|
I don't think it is this easy since I think this issue will persist with the new generator infrastructure as well. This should stay open until 3.7 is the minimum version, at which point it can be relatively easily addressed. The fix for the generator infrastructure will automatically fix the RandomState as well since the fix will need to be applied to the basic generators. |
@bashtage what do you mean with "easly fixed" in 3.7? I think further up was the idea of a warning, if we can give a warning in 3.7+ we reach a huge chunk of users and probably all library authors (assuming it is easy to avoid said warning, and that there only a very small chance of giving it spuriously). |
This is not an issue with the new generator infrastructure because there is no shared global instance as per your previous comment. |
I suppose I was thinking of someone who gets |
It would have to be some kind of new API, some And I'm just not thrilled about introducing APIs for users to deliberately make their code irreproducible. We've introduced good APIs for handling multiprocessing reproducibly; we should encourage their use. |
Hmmm, probably simply not much we can do except warnings in the documentation? I just realized that if you have code which parallelizes using threads, it probably works fine due to the GIL/internal locking mechanism? But if you then change your I am -1 on any "magic" API, but I would be +10 if someone invents a mechanism to give a warning when the user is probably doing something that is bad. However, I somewhat doubt there is a simple solution for giving such warnings? You could give a warning when unpickling, but there are probably good use cases where that would be annoying. (in case you are wondering, this came up, because it is one of the most referenced open issues on NumPy, so its definitely something people run into) |
Perhaps this issue is highly-referenced, but it's not clear to me that entropy-reseeding anything is actually the sought-after solution in them. Most of them do want reproducibility, so the required intervention is to use the mechanisms we introduced in the new framework. |
There is a lot to unpack in the discussion here, but if I'm understanding things correctly, it seems like the concerns from the initial issue are well addressed by the new |
This issue got me too while using a library that creates worker processes with fork() - and caused a rather complicated to track down bug. numpy random state is preserved across fork, this is absolutely not intuitive. I think numpy should reseed itself per-process. This is certainly what I'd expect, and likely follows the principle of least surprise: numpy random in a new process should act like numpy random in a new interpreter, it auto-seeds. The current behavior makes about as much sense as using seed(0) by default. I think most people using the library would expect different random numbers always, unless they do something to explicitly manage the state. |
@aizvorski |
When spawning child processes using the multiprocessing module, it appears that all child processes share the parent's random seed.
This creates a subtle and difficult to detect bug in the common use case of embarrassingly parallel multicore monte-carlo simulations. In such simulations the different processes will generate the same data, thus the final averaged result will be much less accurate (i.e. have higher variance) than expected.
Many people seem to have been bitten by this issue (and surely, many more are unaware of a silent bug in their simulation). e.g.
https://stackoverflow.com/questions/12915177/same-output-in-different-workers-in-multiprocessing
http://forum.cogsci.nl/index.php?p=/discussion/1441/solved-numpy-random-state-seems-to-repeat-across-multiple-os-runs
Note that Python's random module does not suffer from this problem.
The text was updated successfully, but these errors were encountered: