Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple locale data generation #1052

Merged
merged 14 commits into from Dec 4, 2019
Merged

Conversation

@malefice
Copy link
Contributor

malefice commented Nov 13, 2019

What does this change

Faker can now accept multiple locales while retaining the same signature to preserve backward compatibility.

What was wrong

Faker can only handle a single locale.

How this fixes it

The new Faker proxy class will internally create locale-specific generators as needed, and then proxy calls to the generators that can accommodate the call. Generator selection logic may also be specified via weights.

Notes

Needless to say, there will be a performance penalty, and the severity of the penalty depends on:

  • How many locales were specified
  • How many locale-specific generators can handle a certain call
  • If a custom distribution will be used

Fixes #691, fixes #976, partial solution-ish to #453, related to #230

@malefice malefice force-pushed the malefice:issue-691 branch from 4fa5d36 to 104564b Nov 15, 2019
@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 15, 2019

The changes needed to ensure all tests will pass will most likely require backward-incompatible changes if the Faker object will be converted from a shortcut of Factory.create to the new proxy class. This is because the non-provider methods/properties of Generator that will have to be properly proxied as well.

The random generator related stuff can probably be proxied in a sane way like seeding the same value across all internal generators, but the others do not make as much sense like add_provider, since the internal generators should have been done with that.

We can name this proxy class something else to keep changes simple while still providing multiple locale support. I would like to know your thoughts on this @fcurella.

@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Nov 15, 2019

Thank you much for looking into this thorny issue @malefice !

I'm ok with breaking compatibility, as long as we document what we're breaking and provide users with documentation on how to migrate their code.

One feature I want to be very careful about not breaking is the command-line -i option.

@joke2k

This comment has been minimized.

Copy link
Owner

joke2k commented Nov 15, 2019

Yeah, it is a very thorny issue!
It was one of the initial thoughts about the evolution of this package.

@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 15, 2019

Alright. Thanks for the guidance. I will try to think of something more elegant as well.

@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 18, 2019

The flake8 errors are related to PR #1058. There is another error because assert_called_once() is not available in 3.5's unittest.mock. On top of that, I found a very minor bug involving locale strings (more on that later).

Other than that, I think the new Faker proxy class is almost ready. It is completely backwards-compatible with the old Faker shortcut for Factory.create, since all attribute and method calls will be proxied to the internal Generator object. This is what I call single locale mode.

Single locale mode

>>> from faker import Faker
>>> f = Faker('en_US')
>>> f.factories
[<faker.generator.Generator object at 0x7f0f227fffd0>]
>>> f.locales
['en_US']
>>> f['en_US']
<faker.generator.Generator object at 0x7f0f227fffd0>

# Set seed using the proxy class
>>> f.seed(0)
>>> f.name()
'Norma Fisher'
>>> f.name()
'Jorge Sullivan'

# Set seed using the internal generator
>>> f['en_US'].seed(0)
>>> f['en_US'].name()
'Norma Fisher'
>>> f['en_US'].name()
'Jorge Sullivan'

In what I call multiple locale mode, attempting to proxy any Generator attributes, methods, and what-have-you will just raise a NotImplementedError, either because it does not make sense to proxy like add_provider or because it is potentially dangerous to do so like the random setter.

Instead, I think we should just leave it to users to subclass and create their own implementation if they really want to. After all, each of the internal Generator objects can already be easily accessed, and from there, users can perform the operations they want to on individual generators like seeding or even forcing them to share the same instance of random.Random.

As for data generation, multiple locale mode will randomly choose a Generator object capable of handling the specified provider method. In other words, fake.foo() should not fail as long as at least one locale supports the foo provider method.

Multiple locale mode

>>> from faker import Faker
>>> f = Faker(['en_US', 'en_PH'])
>>> f.factories
[<faker.generator.Generator object at 0x7f987da97080>, <faker.generator.Generator object at 0x7f987b2b6400>]
>>> f.locales
['en_US', 'en_PH']
>>> f['en_US']
<faker.generator.Generator object at 0x7f987da97080>
>>> f['en_PH']
<faker.generator.Generator object at 0x7f987b2b6400>

# Will generate both en_US and en_PH style addresses
# since `address()` is available in both locales
>>> addresses = [f.address() for _ in range(1000)]

# Will not fail, because it knows to only use en_PH
# since `luzon_province()` is only available in en_PH
>>> provinces = [f.luzon_province() for _ in range(1000)]

As for the locale strings I mentioned earlier, both en_US and en-US should be considered duplicates, but for the purposes of __getitem__(), what key should be used? Do we use en_US or do we use en-US or are you okay with either? Other than that, please let me know if you are okay with the overall design decision of this new proxy class. @fcurella @joke2k

@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Nov 19, 2019

or because it is potentially dangerous to do so like the random setter.

Would it still possible to seed the generator in multiple locale mode?

As for the locale strings I mentioned earlier, both en_US and en-US should be considered duplicates, but for the purposes of __getitem__(), what key should be used? Do we use en_US or do we use en-US or are you okay with either?

I think it should be consistent with what the constructor does. In other words, it should accept both, and normalize to _.

@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 20, 2019

Would it still possible to seed the generator in multiple locale mode?

Yes but on an individual Generator basis and not on the proxy class level. In code form:

import random
from faker import Faker
f = Faker(['en_US', 'en_PH')]

# These two commands have the same effect because
# a Random object is shared across Generator instances
f['en_US'].seed(0)
f['en_PH'].seed(0)

# This will create and seed a Random object per Generator instance
f['en_US'].seed_instance(0)
f['en_PH'].seed_instance(0)

# Alternatively, you can assign a new Random object per Generator instance
f['en_US'].random = random.Random()
f['en_PH'].random = random.Random()

# Check the state of each Generator's Random object
f['en_US'].random.getstate()
f['en_PH'].random.getstate()

# You just cannot do any of these in multiple locale mode
# Each of these will raise a NotImplementedError
f.seed(0)
f.seed_instance(0)
f.random = random.Random()
f.random.getstate()

If you really want a default implementation, we can probably proxy seed() to the class method Generator.seed(), and we can proxy seed_instance() to invoke each internal Generator's seed instance() method. For the random getter and setter, maybe we can make the random getter return a list of results of calls to each internal Generator's random getter, and the setter will expect a similarly sized list and then in order, pass each element to each internal Generator's random setter.

Personally, I do not know if that will be a common enough use case, which is why I did not provide a proxy level implementation, and why I think it should be up to the users to subclass for their use cases.

@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Nov 20, 2019

the problem with called .seed is that it will seed the random module, which is might be shared across the user's code.

Ideally, we want three ways to seed random values:

  1. the python-provided random.seed(). The existing .seed() would do (although I find it ambiguous and I would like to remove it at some point).
  2. A way to seed all subgenerators with a single call. I think we should use .seed_instance for this. From the user's perspective, there is just one instance with multiple locales.
  3. A way to seed a specific local of an instance. We could have a seed_locale(locale, seed) method.
@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 21, 2019

the problem with called .seed is that it will seed the random module, which is might be shared across the user's code.

Actually Generator.seed() will not affect the random module itself. The Generator instances share a separate random.Random object which is declared here, so seeding it will not affect user's code, unless of course they deliberately used the random.Random object of the Generator objects.

On that note, the proxy class can have a seed method that will invoke Generator.seed which effectively seeds that specific shared random.Random object, and the only ones that will be affected are Generator instances that still use said random.Random object. In code form using Python3:

>>> from faker import Faker
>>> import random
>>> f = Faker('en_US')

# Seed the random module
>>> random.seed(0)
>>> random.random()
0.8444218515250481
>>> random.random()
0.7579544029403025
>>> random.random()
0.420571580830845

# Seed the shared random.Random object
# Results should be the same if `random` module was affected
>>> f.seed(0)
>>> random.random()
0.25891675029296335
>>> random.random()
0.5112747213686085
>>> random.random()
0.4049341374504143

So what are your thoughts?

@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Nov 21, 2019

Thank you for the clarification @malefice .

This is how I'm thinking it could work:

  1. User could use the global random.seed() module to seed globally. This would really be out of scope for Faker, and TBH not any of our concern.
  2. If I'm understanding you correctly, calling the shared Generator.seed() behaves like this:
    f1 = Faker("en-US")
    f2 = Faker("en-PH")
    f1.seed(0)
    # now f2 has seed == 0 too
    I think this behavior is a little unexpected. Would be nicer if .seed() could be just a class method on the Proxy class:
    >>> from faker import Faker
    
    >>> Faker.seed(0)
    >>> f1 = Faker("en-US")
    >>> f2 = Faker("en-PH")
    
    >>> f1.seed(0)
    AttributeError("Calling `.seed()` on instances is deprecated")
  3. The per-instance seeding: .seed_instance()
  4. A new, per-instance-locale seeding .seed_locale()
@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 22, 2019

calling the shared Generator.seed() behaves like this

Yes, that is how Generator.seed in the current master behaves.

>>> from faker.factory import Factory
>>> Faker = Factory.create
>>> f1 = Faker('en-US')
>>> f2 = Faker('en-PH')

>>> f1.seed(0)
>>> f1.name()
'Norma Fisher'
>>> f1.name()
'Jorge Sullivan'
>>> f1.name()
'Elizabeth Woods'

>>> f2.seed(0)
>>> f1.name()
'Norma Fisher'
>>> f1.name()
'Jorge Sullivan'
>>> f1.name()
'Elizabeth Woods'

Would be nicer if .seed() could be just a class method on the Proxy class:

Yeah that is certainly possible, but I suggest raising a TypeError instead. On top of keeping the code a bit neater because of how __getattribute__ silently consumes AttributeError, it is in my opinion more technically sound. In your proposal, it is not that the attribute does not exist, but instead, you want to disable calling the class method from a particular type of object (which is implicitly passed), in this case a Faker instance rather than the Faker class object itself.

Either way, doing this will definitely break any old code that calls seed(), but those can easily be fixed by changing those calls like so:

Old code that will break

from faker import Faker
f = Faker()  # Faker here is still a shortcut for Factory.create
f.seed(0)  # This will break

Method 1

from faker import Faker
f = Faker()  # Faker here is now the new proxy class
Faker.seed(0)  # Fix by invoking from the class itself

Method 2

from faker.factory import Factory
Faker = Factory.create 
f = Faker()  # Faker here is still the shortcut, so no multiple locale support
f.seed(0)

So are you okay with this?

@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Nov 22, 2019

I suggest raising a TypeError instead.

That's a much better idea! Let's do a TypeError

doing this will definitely break any old code that calls seed()

I'm comfortable with that, I think it's worth it for the additional features we gain and cleaner code.

As long as we increase the major version and have clear docs on how to migrate user's code, it should be fine.

@malefice malefice force-pushed the malefice:issue-691 branch from 997b3c1 to aeb0833 Nov 25, 2019
@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 25, 2019

@fcurella can you review my work? once it is okay, i will start writing docs.

@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Nov 25, 2019

@malefice Thank yo so much! That looks like a lot of work :)

LGTM, Go ahead with the docs 👍

malefice added 3 commits Nov 26, 2019
@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Nov 26, 2019

@fcurella please check if the docs are okay. You can view the sample here.

Copy link
Collaborator

fcurella left a comment

Nice work! jut a few minor things in the docs but overall this is 🌠

README.rst Outdated Show resolved Hide resolved
docs/fakerclass.rst Outdated Show resolved Hide resolved
docs/fakerclass.rst Outdated Show resolved Hide resolved
docs/fakerclass.rst Outdated Show resolved Hide resolved
docs/fakerclass.rst Outdated Show resolved Hide resolved
@malefice

This comment has been minimized.

Copy link
Contributor Author

malefice commented Dec 3, 2019

I would have preferred to use Sphinx's note or versionadded directive, but it looks absolutely horrible when rendered by Github.

@malefice malefice requested a review from fcurella Dec 3, 2019
@fcurella

This comment has been minimized.

Copy link
Collaborator

fcurella commented Dec 3, 2019

I would have preferred to use Sphinx's note or versionadded directive, but it looks absolutely horrible when rendered by Github.

It works for now, we can revisit later.

I'm going to release one last v2.x, then merge this and release v3.0.0

fcurella added 2 commits Dec 3, 2019
@fcurella fcurella merged commit 5257c8f into joke2k:master Dec 4, 2019
2 of 4 checks passed
2 of 4 checks passed
continuous-integration/appveyor/pr AppVeyor build failed
Details
coverage/coveralls Coverage decreased (-0.05%) to 96.501%
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@malefice malefice deleted the malefice:issue-691 branch Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.