Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random seed is replication across child processes #9650

Open
mosco opened this issue Sep 4, 2017 · 45 comments
Open

Random seed is replication across child processes #9650

mosco opened this issue Sep 4, 2017 · 45 comments
Labels
15 - Discussion component: numpy.random triaged Issue/PR that was discussed in a triage meeting

Comments

@mosco
Copy link

mosco commented Sep 4, 2017

When spawning child processes using the multiprocessing module, it appears that all child processes share the parent's random seed.

This creates a subtle and difficult to detect bug in the common use case of embarrassingly parallel multicore monte-carlo simulations. In such simulations the different processes will generate the same data, thus the final averaged result will be much less accurate (i.e. have higher variance) than expected.

Many people seem to have been bitten by this issue (and surely, many more are unaware of a silent bug in their simulation). e.g.
https://stackoverflow.com/questions/12915177/same-output-in-different-workers-in-multiprocessing
http://forum.cogsci.nl/index.php?p=/discussion/1441/solved-numpy-random-state-seems-to-repeat-across-multiple-os-runs

Note that Python's random module does not suffer from this problem.

@bashtage
Copy link
Contributor

bashtage commented Sep 4, 2017

If you copy a RandomState you get that RandomState. That means the state -- not the seed -- is the same. The state is what matters for determining the sequence of random numbers.

Any correct method requires you to initialize a RandomState within your child processes. Better is to use the improved RandomState here which explicitly supports generating 1000s or guaranteed distinct streams using either a stream-based generator (PCG) or jump-ahead to advance to state as-if many draws have been made (2**128).

@bashtage
Copy link
Contributor

bashtage commented Sep 4, 2017

Worth noting that the random module also doesn't automatically reseed.

import random
from joblib import Parallel, delayed
def gs(v):
    return random.getstate()

res = Parallel()(delayed(gs)(i) for i in range(10))

for r in res:
    print(r==res[0])
    
True
True
True
True
True
True
True
True
True
True

@njsmith
Copy link
Member

njsmith commented Sep 5, 2017

Huh, I thought we had an issue open for this already, but I can't find it.

Related: #9248

This is very difficult to fix in Python versions before 3.7. In 3.7 it's not too bad using os.register_at_fork, but no-one has done the work yet.

@mosco
Copy link
Author

mosco commented Sep 5, 2017

For some reason, when using multiprocessing.Pool, one gets the desired behavior using Python's random but not using NumPy's!

This behaves differently than the joblib example of @bashtage. Here is my code:

import multiprocessing
import random
import numpy.random

def python_random(whatever):
    return random.random()

def numpy_random(whatever):
    return numpy.random.random()

pool = multiprocessing.Pool(10)
print('Python random:', pool.map(python_random, [0]*10))
print('Numpy random:', pool.map(numpy_random, [0]*10))

Output on Python 3.6.2, Numpy 1.13.1:

Python random: [0.3232979890381721, 0.3440537969129853, 0.8325653603083111, 0.6378306617495579, 0.5425156816989956, 0.2300538134522857, 0.27904636944778316, 0.45440811632035105, 0.6899393498838224, 0.9521238318896579]

Numpy random: [0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144, 0.7768863343872144]

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

The difference is that NumPy is using n copies of a RandomState while random is using a single instance. This means that when you are drawing randoms from random you will be effectively in single threaded mode since they are all using the same state. This will limit multiprocessing if random number generation is an important part of your time. NumPy should, OTOH, have no restrictions across workers and so you will get the maximum benefit.

It is very difficult to enforce the behavior you are seeing from random using NumPy AFAICT since the sharing a single complex object in multiprocessing is challenging (see remote manager in the multiprocessing docs)

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

See this answer for details and the complexity it introduces.

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

Here is a minimal example using a shared RandomState

from joblib import Parallel, delayed
import numpy as np
from multiprocessing.managers import BaseManager


class SharedRandomState(object):
    random_state = np.random.RandomState(1234)
    def random_sample(self):
        return self.random_state.random_sample()

class MyManager(BaseManager):
    pass

MyManager.register('SharedRandomState', SharedRandomState)

def worker(rs):
    return rs.random_sample()

if __name__ == '__main__':
    manager = MyManager()
    manager.start()
    shared = manager.SharedRandomState()
    res = Parallel(n_jobs=4)(delayed(worker)(shared) for i in range(10))
    print('From multiple workers')
    print(np.array(res))
    
    # Verify
    random_state = np.random.RandomState(1234)
    print('From a single instance')
    print(random_state.random_sample(size=10))
From multiple workers
[ 0.19151945  0.62210877  0.43772774  0.78535858  0.77997581  0.27259261
  0.27646426  0.80187218  0.95813935  0.87593263]
From a single instance
[ 0.19151945  0.62210877  0.43772774  0.78535858  0.77997581  0.27259261
  0.27646426  0.80187218  0.95813935  0.87593263]

@mosco
Copy link
Author

mosco commented Sep 5, 2017

Upon further examination, Python 3.6's default behavior of reseeding in the child processes was hard-coded in the _launch method of multiprocessing.popen_fork.Popen and is specific to the built-in random module. Starting from Python 3.7 this was changed to use the new os.register_at_fork mechanism.

So it may be difficult to fix on older versions, but surely we can reseed on Python 3.7 using the same mechanism as @njsmith suggested. I'm willing to do the work if one of the core developers is willing to accept the pull request.

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

The hard question for numpy.random is whether this is a behavior break since the stream of generated randoms will differ.

Reseeding in child processes is not a good way to generate multiple streams of random numbers since it is not easy to verify the statistical properties of the generators. It would be intellectually better to use a generator that supports multiple streams -- this would probably require a new version of RandomState due to the compat guarantee.

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

In 3.5 random is not reseeding but is using a shared state across workers. This can be seen below since the only thing that changes across draws is the position within the 624 element state vector that underlies the MT.

import random
from joblib import Parallel, delayed

def f(v):
    random.random()
    return random.getstate()
    
    res = Parallel(n_jobs=4)(delayed(f)(i) for i in range(10))
    
for i in range(len(res)-1):
    print('624 element state identical? {0}'.format(res[i][1][:-1] == res[i+1][1][:-1]))
    print('Position identical? {0}'.format(res[i][1][-1] == res[i+1][1][-1]))
    print('Positions: {0}, {1}'.format(res[i][1][-1],res[i+1][1][-1]))
624 element state identical? True
Position identical? False
Positions: 44, 46
624 element state identical? True
Position identical? False
Positions: 46, 48
624 element state identical? True
Position identical? False
Positions: 48, 50
624 element state identical? True
Position identical? False
Positions: 50, 52
624 element state identical? True
Position identical? False
Positions: 52, 54
624 element state identical? True
Position identical? False
Positions: 54, 56
624 element state identical? True
Position identical? False
Positions: 56, 58
624 element state identical? True
Position identical? False
Positions: 58, 60
624 element state identical? True
Position identical? False
Positions: 60, 62

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

@mosco You are right -- my examples were too trivial. joblib is finishing 10 before spawning a second process and so they were all in the same process. Running more shows clearly many different states after seed has been called.

@njsmith
Copy link
Member

njsmith commented Sep 5, 2017

The hard question for numpy.random is whether this is a behavior break since the stream of generated randoms will differ.

We should re-seed each RandomState object that had seed=None, and only those.

# no initial seed, re-seeded on fork
r1 = np.random.RandomState()
# explicit seed given, NOT re-seeded on fork
r2 = np.random.RandomState(0)
# seed was given, but then randomized; re-seeded on fork
r3 = np.random.RandomState(0); r3.seed()

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

That is probably reasonable since when you use a default seed you don't really know what you get anyway.

I took a look at random and it ignores provided seed values when using in multiprocessing so that values produced in multiple processes after random.seed(123) is called in the main process are not reproducible.

@pv
Copy link
Member

pv commented Sep 5, 2017 via email

@bashtage
Copy link
Contributor

bashtage commented Sep 5, 2017

It is normally reseeded with /dev/urandom or the Windows equiv. In theory if one tried to spawn a huge number it might be possible to exhaust the pool, in which case there is a has of time + pid which should be safe (but not necessarily very random, even after hashing).

@pv
Copy link
Member

pv commented Sep 5, 2017 via email

@njsmith
Copy link
Member

njsmith commented Sep 5, 2017

You can't exhaust /dev/urandom or equivalent. They are infinite and inexhaustible.

@rkern
Copy link
Member

rkern commented Sep 6, 2017

The implicit global np.random.mtrand._rand instance of RandomState can be reseeded (because there are almost zero guarantees about its behavior across different processes), but other explicitly-created RandomState objects shouldn't be. I don't know how you'd try to find them in any case.

But really, I don't understand caring about their results and wanting to use the np.random functions unseeded. These two things don't go together.

@njsmith
Copy link
Member

njsmith commented Sep 6, 2017

@rkern: So it sounds like the places where we disagree are:

  • I'm suggesting that if someone does np.random.seed(0) and then forks, then we should not re-seed; you're saying we should. I guess I don't care that much either way here, but if they've explicitly seeded then preserving that feels a little more natural to me? It also lets us avoid having to special case the global RandomState versus other ones, assuming we implement the next item. But I could be argued around; it seems unlikely anyone's depending on this on purpose.

  • If someone creates an unseeded RandomState object (RandomState()), and then forks, I'm suggesting the two copies should give different outputs; you're suggesting they shouldn't. I think we should do this to play it safe, because I can't imagine a case where someone does this on purpose (if you really want to replicate the streams, use an explicit seed!), and every time we don't reseed it creates the risk of subtle invisible wrong results and withdrawn publications and all that fun stuff. Obviously people should be more careful, but that seems like a severe punishment (and it may not be the person who wrote the bug who suffers).

In general, the intuition here is that unseeded generation acts like a totally unpredictable and non-deterministic source of bits, and a seeded generation acts like a deterministic but well-distributed source of bits.

As for how, it's easy (though not particularly pretty): keep a global WeakSet of all unseeded RandomState objects.

...in any case, we shouldn't let these details derail us from the core feature of making sure that unseeded np.random calls are different in parent and child processes. That's just dangerous.

@njsmith
Copy link
Member

njsmith commented Sep 6, 2017

...I guess I could even be argued into the position of just re-seeding everything on fork, no exceptions. I'm having a lot of trouble thinking of any use case for generating two copies of the same stream in two different processes.

@eric-wieser
Copy link
Member

eric-wieser commented Sep 6, 2017

The issue is when you foil code that is explicit about wanting to doing so, with setup of

r1 = RandomState()
r2 = r1.copy()  # or copy.copy(r1), whatever this is called
# or
r1 = RandomState(some_seed)
r2 = RandomState(some_seed)

And then code that evolves from single-process to multiprocess:

run_single_process(r1, r2)
# or
run_multi_process(r1, r2)  # each process only uses one of these

Ideally, the behaviour would not change here in either of the two setup cases. Perhaps __copy__ should simply wipe the "wasn't seeded" flag you're proposing, for both original and copy?

@njsmith
Copy link
Member

njsmith commented Sep 6, 2017

@eric-wieser: is there ever any reason to do that though?

The case that I can think of where you might have something like this is when someone explicitly saves the seed at the beginning of a run so that they can reproduce it later if necessary, then forks and fails to save the seeds in child processes. It's tricky to do this correctly right now though, and "your calculations are subtly broken" is much worse then "everything worked correctly but you can't reproduce an earlier run".

Also note that such code is already broken on Windows (no fork), or on Unix if you use one of the non-fork-based multiprocessing modes (mandatory with our MacOS wheels b/c they don't work with fork).

@rkern
Copy link
Member

rkern commented Sep 6, 2017

Honestly, I think np.random.seed(0) should raise an AttributeError. :-)

But I didn't say that it should be reseeded, only that the hidden global RandomState instance is the only instance that ought to be reseeded if you are going to reseed on a fork. It's the only natural one to reseed, IMO. It's not a special case unless if you strain to formulate the default as reseeding all unseeded RandomState instances.

I don't think that you should reseed explicitly-created RandomStates ever, even if unseeded. The reason is that there is a quite convenient, principled, and replication-responsible workflow that uses them: instantiate the unseeded RandomState and save it by pickling or whatever, then use the instance. To replicate the experiment, unpickle it and pass it through the code. If you reseed it on a fork, this will cause differences between the initial experiment and the replications. You don't know whether or not someone is reusing the RandomState inappropriately across a fork boundary. They could be doing something like the tree-jumping technique that you proposed earlier. I also don't want to constrain @bashtage's work on the next-gen RandomState which has generators that can take the inherited RandomState and just switch the streams.

@Amir-Arsalan
Copy link

Amir-Arsalan commented Mar 20, 2018

@mosco @bashtage @njsmith @rkern I think I have a very similar issue and I've been reading the posts here and still cannot wrap my head around the issue on how I should resolve this. I want to share numpy random state of a parent process with a child process. I've tried using Manager to share the random state but still no luck. It does not matter if I move the for loop inside drawNumpySamples or leave it in main.py; I still cannot get different numbers and the random state is always the same. Could you please take a look at my question here and see if you can offer a solution? The only way I can get different random numbers is if I do np.random.seed(None) every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated.

Note: I use Python 3.5

@rkern
Copy link
Member

rkern commented Mar 20, 2018

Don't worry about understanding this issue. The issue being discussed here isn't relevant to #10769. This is only talking about how numpy.random behaves under multiprocessing when the default global RandomState is unseeded. You want to control the seed.

@RinaldoB
Copy link

@njsmith I support your stand-point that forks of an unseeded RandomState object should be reseeded. The usual scenario I have come across, involves parallel Monte-Carlo simulations using the multiprocessing module. Simulation code that produces valid results when run sequentially, without further mention produces invalid results as soon as is it passed to multiprocessing. I don't think this is obvious for most people using numpy.random module. In my experience, with the current behavior, monte-carlo runs using multiprocessing with an unseeded RandomState don't produce identical results. Thats why, this issue, which I consider broken/unwanted behavior will often remain unnoticed.

In my understanding seeding RandomState with np.random.seed(None) is not required to generate random numbers. Documentation of np.random methods does not mention it (example).

@djoffe
Copy link

djoffe commented May 10, 2018

I just came across this thread and I feel it would help to describe how the SystemVerilog language describes random stability. Anyone interested to read more can have a look at IEEE Std 1800-2017, section 18.14.

The RNG is localized to threads and objects. Because the sequence of random values returned by a thread or object is independent of the RNG in other threads or objects, this property is called random stability.

Thread stability. Each thread has an independent RNG for all randomization system calls invoked from that thread. When a new dynamic thread is created, its RNG is seeded with the next random value from its parent thread. This property is called hierarchical seeding.

Manual seeding. All noninitialization RNGs can be manually seeded. Combined with hierarchical seeding, this facility allows users to define the operation of a subsystem (hierarchy subtree) completely with a single seed at the root thread of the subsystem.

In other words: When a child thread is created, its RNG (RandomState) is seeded with the next value of the parent thread. This fixes the numpy issue where RandomState gives similar random numbers in different threads, and this also fixes the random module issue where the order thread are ran has an influence on the final result (meaning reproducibility is broken).

@FrankRouter
Copy link

A simple workaround is re-seed the generator in workers:

def worker(num):
    np.random.seed()
    for i in range(5):
        print(i, np.random.rand())

if __name__ == '__main__':
    with multiprocessing.Pool(4) as pool:
        pool.map(worker, range(4))

@njsmith
Copy link
Member

njsmith commented Jun 6, 2019

This issue is about automatically reseeding or calling jumped or whatever after fork. AFAICT there was some debate about exactly which circumstances to do this in, but general agreement that we should do it in a lot of common circumstances, esp. since it demonstrably bites people.

@njsmith njsmith reopened this Jun 6, 2019
@mattip
Copy link
Member

mattip commented Jun 6, 2019

OK, thanks for the clarification

@seberg
Copy link
Member

seberg commented Jun 6, 2019

We are currently discussing a SeedSequence like spawn or similar addition, so that you:

child_gen = Generator.spawn()
grandchild_gen = child_gen.spanw()

will work. But I suppose this issue would be something like automattically doing:

DefaultGenerator.replace_with_spawned()

for the "default" one that we use for the np.random.random_function() interface?

xref: gh-13685

EDIT: Of course that would all only be necessary if you want a reproducible stream. In which case the fork also would need a counter or so, to have a reliable identical stream. But I am not sure that can even be done reliably, so maybe this is not necessary at all, and instead it should just reseed with fresh system entropy or raise an error (upon usage?).

@bashtage
Copy link
Contributor

bashtage commented Jun 6, 2019

I think this is less of an issue for Generator since there is no singleton. This is still an issue for RandomState which should, on fork, read a non-trivial amount of entropy (128 bits+) and seed using this new entropy.

@rgommers
Copy link
Member

Now that we discussed this again today, I propose to do the following:

  1. Add a .. warning:: to the docs at https://numpy.org/devdocs/reference/random/legacy.html#numpy.random.RandomState spelling out the problem
  2. Recommend in the Notes section to use https://numpy.org/devdocs/reference/random/parallel.html, and if authors need compatibility with older NumPy versions then use the solution for seeding as @rkern proposed in BUG: Fix random state bug multiscale_graphcorr scipy/scipy#11152 (comment)
  3. Close the rest of this as wontfix; we have better solutions for the new infrastructure.

@bashtage
Copy link
Contributor

I don't think it is this easy since I think this issue will persist with the new generator infrastructure as well. This should stay open until 3.7 is the minimum version, at which point it can be relatively easily addressed. The fix for the generator infrastructure will automatically fix the RandomState as well since the fix will need to be applied to the basic generators.

@seberg
Copy link
Member

seberg commented Jan 29, 2020

@bashtage what do you mean with "easly fixed" in 3.7? I think further up was the idea of a warning, if we can give a warning in 3.7+ we reach a huge chunk of users and probably all library authors (assuming it is easy to avoid said warning, and that there only a very small chance of giving it spuriously).

@rkern
Copy link
Member

rkern commented Jan 29, 2020

This is not an issue with the new generator infrastructure because there is no shared global instance as per your previous comment.

@bashtage
Copy link
Contributor

I suppose I was thinking of someone who gets default_rng() instance and then starts passing that around in multiprocessing. Is there any reasonable way of automatically getting distinct streams from workers in this scenario, other than preallocating the RNGs and passing distinct instances with different initialization ex-ante?

@rkern
Copy link
Member

rkern commented Jan 30, 2020

It would have to be some kind of new API, some ReseedUponSpawnGenerator or something. Or a new factory that registers the new Generator via a WeakRef somewhere to reseed it upon a respawn. The user needs to explicitly state their intention to have this happen.

And I'm just not thrilled about introducing APIs for users to deliberately make their code irreproducible. We've introduced good APIs for handling multiprocessing reproducibly; we should encourage their use.

@seberg
Copy link
Member

seberg commented Jan 30, 2020

Hmmm, probably simply not much we can do except warnings in the documentation?

I just realized that if you have code which parallelizes using threads, it probably works fine due to the GIL/internal locking mechanism? But if you then change your threadpool to a multiprocessing pool it could blow up?

I am -1 on any "magic" API, but I would be +10 if someone invents a mechanism to give a warning when the user is probably doing something that is bad. However, I somewhat doubt there is a simple solution for giving such warnings? You could give a warning when unpickling, but there are probably good use cases where that would be annoying.

(in case you are wondering, this came up, because it is one of the most referenced open issues on NumPy, so its definitely something people run into)

@rkern
Copy link
Member

rkern commented Jan 30, 2020

Perhaps this issue is highly-referenced, but it's not clear to me that entropy-reseeding anything is actually the sought-after solution in them. Most of them do want reproducibility, so the required intervention is to use the mechanisms we introduced in the new framework.

@rossbar
Copy link
Contributor

rossbar commented Jan 31, 2020

There is a lot to unpack in the discussion here, but if I'm understanding things correctly, it seems like the concerns from the initial issue are well addressed by the new random infrastructure (e.g. SeedSequence, no "hidden" global instance of default_rng, etc.) and the associated docs that explicitly highlight the parallel use-case. I like Ralf's suggestion - in my opinion, it is important to point users towards the new tools/docs that explicitly address the parallel use-case.

@aizvorski
Copy link

aizvorski commented Jul 14, 2020

This issue got me too while using a library that creates worker processes with fork() - and caused a rather complicated to track down bug. numpy random state is preserved across fork, this is absolutely not intuitive. I think numpy should reseed itself per-process. This is certainly what I'd expect, and likely follows the principle of least surprise: numpy random in a new process should act like numpy random in a new interpreter, it auto-seeds. The current behavior makes about as much sense as using seed(0) by default. I think most people using the library would expect different random numbers always, unless they do something to explicitly manage the state.

@bashtage
Copy link
Contributor

@aizvorski RandomState is now unchangable except for bugs that prevent compilation, so I don't think it will be possible to address this change. The new random generator doesn't have a singleton instance so this is somewhat less of a concern. I think there was some thought about whether it would be possible to have something (e.g., a ForkableGenerator) that would automatically reseed on fork, but I don't think it went anywhere yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
15 - Discussion component: numpy.random triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

No branches or pull requests