New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance random generator repeatibility #14

Closed
deinspanjer opened this Issue Jan 22, 2014 · 4 comments

Comments

Projects
None yet
4 participants
@deinspanjer

deinspanjer commented Jan 22, 2014

The docs mention being able to call the seed() method so you can use a generated dataset as part of a unit test.

Due to the way Faker uses the random module, this usecase is a bit fragile. Any modification to the data requested, or any outside uses of the random module during generation will diverge the dataset.

Here is a quick script demonstrating the problem along with a couple of potential solutions:

import random
from faker import Faker
fake = Faker()

# initial run
fake.seed(1234)
print fake.name()
print fake.name()
print fake.name()

# repeated run with same data
fake.seed(1234)
print fake.name()
print fake.name()
print fake.name()

# adding new fake calls prevent us from getting the same names we had originally
fake.seed(1234)
print fake.name(), fake.email()
print fake.name(), fake.email()
print fake.name(), fake.email()

# One way is to implement a preserve/restore mechanism so that the user can get back to the previous trail of data
fake.seed(1234)
print fake.name()
r = random.getstate()
print fake.email()
random.setstate(r)
print fake.name()
r = random.getstate()
print fake.email()
random.setstate(r)
print fake.name()
r = random.getstate()
print fake.email()
random.setstate(r)

# A similar problem arises if the program using faker happens to use a non-instance random call during generation.
# The best way to prevent this issue is to have faker use an instance of random rather than the module version.

# If faker used an instance version of random, you could also resolve the original problem by using different faker instances
fake.seed(1234)
fake2 = Faker()
fake2.seed(1234)
print fake.name(), fake2.email()
print fake.name(), fake2.email()
print fake.name(), fake2.email()
@joke2k

This comment has been minimized.

Show comment
Hide comment
@joke2k

joke2k Feb 4, 2014

Owner

Thank you very much for the very good explanation.

I spent quite a bit of time to find an easy way to correct this behavior, but I think we will have to remove all the classmethod/staticmethod.

Then change the faker.generators.Generator:

 class Generator(object):

    def __init__(self):
        self.providers = []
        self.random = random.Random()

    def seed(self, seed=None):
        """
        calls self.random.seed
        """
        self.random.seed(seed)

and a lot of other provider methods.

Anyway I think it will be difficult to use correctly the seed(), because some methods can use a variable number of times the instance self.random to generate data...

Currently, this method seems to be buggy.

Owner

joke2k commented Feb 4, 2014

Thank you very much for the very good explanation.

I spent quite a bit of time to find an easy way to correct this behavior, but I think we will have to remove all the classmethod/staticmethod.

Then change the faker.generators.Generator:

 class Generator(object):

    def __init__(self):
        self.providers = []
        self.random = random.Random()

    def seed(self, seed=None):
        """
        calls self.random.seed
        """
        self.random.seed(seed)

and a lot of other provider methods.

Anyway I think it will be difficult to use correctly the seed(), because some methods can use a variable number of times the instance self.random to generate data...

Currently, this method seems to be buggy.

@deinspanjer

This comment has been minimized.

Show comment
Hide comment
@deinspanjer

deinspanjer Feb 7, 2014

I don't think that it will be that painful to try to clean up this issue. I wish I wasn't so rusty with Python app development outside a small single script, or I'd try to put together a pull request for you, but I fear it wouldn't be adequate unless I put more time into it than I have available.

I definitely agree that Generator needs to get an instance of random rather than using the module static. That is step 0. Even just doing that one simple change would remove most of the problem and enable the approach I put at the bottom of my example where people could just spawn more than one faker object when they want to keep the seed stable. I suggest testing against regression and then start by just releasing that.

If that goes well, the next thing I'd recommend would be adding two more methods to generator to allow checkpointing the random instance via getstate and setstate. The end user wouldn't even have to store the state themselves, you would probably be fine introducing it as a type of transactional system where they could call a method like "savestate" before they start doing "weird" stuff, then call "restorestate" when they want to get back to normal. save would call getstate and store the state in a instance variable, and restorestate would just call setstate to update the random instance back to where it was.

As for your worry about some methods using random a variable number of times, I'd be happy to look at it if you could point me at one, but I suspect they might not be a problem due to the pseudo nature of the prg. Here is why I think that, let me know if I'm wrong.
Given three methods, a, b, and c. Methods a and c each make a single call to random to do their work. b will output an array of strings, between one and five. The content of the strings is random, and the number of strings to be returned is also random.

Since all of the randomness in methods a, b, and c are based on the same rng, the random instance stored inside Generator, if a program calls a, then b, then c using the same initial seed, it should still get the same output. The random number between 1 and 5 that b used to determine how many strings to output will always be the same given the same initial seed.

deinspanjer commented Feb 7, 2014

I don't think that it will be that painful to try to clean up this issue. I wish I wasn't so rusty with Python app development outside a small single script, or I'd try to put together a pull request for you, but I fear it wouldn't be adequate unless I put more time into it than I have available.

I definitely agree that Generator needs to get an instance of random rather than using the module static. That is step 0. Even just doing that one simple change would remove most of the problem and enable the approach I put at the bottom of my example where people could just spawn more than one faker object when they want to keep the seed stable. I suggest testing against regression and then start by just releasing that.

If that goes well, the next thing I'd recommend would be adding two more methods to generator to allow checkpointing the random instance via getstate and setstate. The end user wouldn't even have to store the state themselves, you would probably be fine introducing it as a type of transactional system where they could call a method like "savestate" before they start doing "weird" stuff, then call "restorestate" when they want to get back to normal. save would call getstate and store the state in a instance variable, and restorestate would just call setstate to update the random instance back to where it was.

As for your worry about some methods using random a variable number of times, I'd be happy to look at it if you could point me at one, but I suspect they might not be a problem due to the pseudo nature of the prg. Here is why I think that, let me know if I'm wrong.
Given three methods, a, b, and c. Methods a and c each make a single call to random to do their work. b will output an array of strings, between one and five. The content of the strings is random, and the number of strings to be returned is also random.

Since all of the randomness in methods a, b, and c are based on the same rng, the random instance stored inside Generator, if a program calls a, then b, then c using the same initial seed, it should still get the same output. The random number between 1 and 5 that b used to determine how many strings to output will always be the same given the same initial seed.

@defaltd

This comment has been minimized.

Show comment
Hide comment
@defaltd

defaltd Feb 7, 2014

Just to jump in with some initial assessment. @deinspanjer is correct. Checking the current state of the seed would be optimal. Based on my assessment, it seems that you have one single instance of the Generator object per factory. Thus, I have a possible simpler idea...

  1. Make it so that seeding generates a new instance of the generator into a hashable table.

    Reasoning: Since python already states that the contents of calling random.seed([x]) includes a value that must be hashable, including a hash array of generators would be ok. Due to O(1) complexity for accessing.
  2. Focus should then be placed on how random is being used throughout the factory.

defaltd commented Feb 7, 2014

Just to jump in with some initial assessment. @deinspanjer is correct. Checking the current state of the seed would be optimal. Based on my assessment, it seems that you have one single instance of the Generator object per factory. Thus, I have a possible simpler idea...

  1. Make it so that seeding generates a new instance of the generator into a hashable table.

    Reasoning: Since python already states that the contents of calling random.seed([x]) includes a value that must be hashable, including a hash array of generators would be ok. Due to O(1) complexity for accessing.
  2. Focus should then be placed on how random is being used throughout the factory.

@joke2k joke2k added this to the Release 0.4.1 milestone Jun 4, 2014

@joke2k joke2k modified the milestones: Release 0.4.1, Release 0.5 Aug 19, 2014

amygdalama pushed a commit to venmo/faker that referenced this issue Jun 17, 2015

Amy Hanlon
seed a Random instance instead of the module
Seeding the random module will seed that module globally --
this affects calls to random outside the faker package.
Seeding an instance of the Random class will not affect
other calls to the random module.
This addresses one concern in issue #14.

amygdalama pushed a commit to venmo/faker that referenced this issue Jun 17, 2015

Amy Hanlon
seed a Random instance instead of the module
Seeding the random module will seed that module globally --
this affects calls to random outside the faker package.
Seeding an instance of the Random class will not affect
other calls to the random module.
This addresses one concern in issue #14.

amygdalama pushed a commit to venmo/faker that referenced this issue Jun 17, 2015

Amy Hanlon
seed a Random instance instead of the module
Seeding the random module will seed that module globally --
this affects calls to random outside the faker package.
Seeding an instance of the Random class will not affect
other calls to the random module.
This addresses one concern in issue #14.

amygdalama pushed a commit to venmo/faker that referenced this issue Jun 17, 2015

Amy Hanlon
seed a Random instance instead of the module
Seeding the random module will seed that module globally --
this affects calls to random outside the faker package.
Seeding an instance of the Random class will not affect
other calls to the random module.
This addresses one concern in issue #14.
@fcurella

This comment has been minimized.

Show comment
Hide comment
@fcurella

fcurella Oct 29, 2015

Collaborator

Fixed in #259

Collaborator

fcurella commented Oct 29, 2015

Fixed in #259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment