Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Enhance random generator repeatibility #14
The docs mention being able to call the seed() method so you can use a generated dataset as part of a unit test.
Due to the way Faker uses the random module, this usecase is a bit fragile. Any modification to the data requested, or any outside uses of the random module during generation will diverge the dataset.
Here is a quick script demonstrating the problem along with a couple of potential solutions:
import random from faker import Faker fake = Faker() # initial run fake.seed(1234) print fake.name() print fake.name() print fake.name() # repeated run with same data fake.seed(1234) print fake.name() print fake.name() print fake.name() # adding new fake calls prevent us from getting the same names we had originally fake.seed(1234) print fake.name(), fake.email() print fake.name(), fake.email() print fake.name(), fake.email() # One way is to implement a preserve/restore mechanism so that the user can get back to the previous trail of data fake.seed(1234) print fake.name() r = random.getstate() print fake.email() random.setstate(r) print fake.name() r = random.getstate() print fake.email() random.setstate(r) print fake.name() r = random.getstate() print fake.email() random.setstate(r) # A similar problem arises if the program using faker happens to use a non-instance random call during generation. # The best way to prevent this issue is to have faker use an instance of random rather than the module version. # If faker used an instance version of random, you could also resolve the original problem by using different faker instances fake.seed(1234) fake2 = Faker() fake2.seed(1234) print fake.name(), fake2.email() print fake.name(), fake2.email() print fake.name(), fake2.email()
Thank you very much for the very good explanation.
I spent quite a bit of time to find an easy way to correct this behavior, but I think we will have to remove all the classmethod/staticmethod.
Then change the faker.generators.Generator:
and a lot of other provider methods.
Anyway I think it will be difficult to use correctly the
Currently, this method seems to be buggy.
I don't think that it will be that painful to try to clean up this issue. I wish I wasn't so rusty with Python app development outside a small single script, or I'd try to put together a pull request for you, but I fear it wouldn't be adequate unless I put more time into it than I have available.
I definitely agree that Generator needs to get an instance of random rather than using the module static. That is step 0. Even just doing that one simple change would remove most of the problem and enable the approach I put at the bottom of my example where people could just spawn more than one faker object when they want to keep the seed stable. I suggest testing against regression and then start by just releasing that.
If that goes well, the next thing I'd recommend would be adding two more methods to generator to allow checkpointing the random instance via getstate and setstate. The end user wouldn't even have to store the state themselves, you would probably be fine introducing it as a type of transactional system where they could call a method like "savestate" before they start doing "weird" stuff, then call "restorestate" when they want to get back to normal. save would call getstate and store the state in a instance variable, and restorestate would just call setstate to update the random instance back to where it was.
As for your worry about some methods using random a variable number of times, I'd be happy to look at it if you could point me at one, but I suspect they might not be a problem due to the pseudo nature of the prg. Here is why I think that, let me know if I'm wrong.
Since all of the randomness in methods a, b, and c are based on the same rng, the random instance stored inside Generator, if a program calls a, then b, then c using the same initial seed, it should still get the same output. The random number between 1 and 5 that b used to determine how many strings to output will always be the same given the same initial seed.
Just to jump in with some initial assessment. @deinspanjer is correct. Checking the current state of the seed would be optimal. Based on my assessment, it seems that you have one single instance of the Generator object per factory. Thus, I have a possible simpler idea...