Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random.RandomState with different versions of numpy has vastly different performance #2782

Open
alexcarterkarsus opened this issue Apr 3, 2020 · 15 comments · Fixed by #2864
Labels
help wanted impact MEDIUM Big annoyance for affected users performance Issue related to performance (in HW meaning) reach HIGH Affects most or all Gensim users

Comments

@alexcarterkarsus
Copy link

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1 

produced
0.28105926513671875
exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function.
I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2020

Thanks for reporting. What was the difference in random.RandomState that you nailed it down to?

We should replace RandomState if it's deprecated. What should it be replaced with, is the change 100% backward compatible? Can you open a PR?

@piskvorky piskvorky added help wanted performance Issue related to performance (in HW meaning) reach HIGH Affects most or all Gensim users impact MEDIUM Big annoyance for affected users labels Apr 3, 2020
@alexcarterkarsus
Copy link
Author

what I found is that numpy.random.RandomState defaults to the legacy random generator, i.e., the slowest, however, there are other options: PCG64, SFC64, etc. You could simply use them (after import numpy.random as random) with, e.g., random.SFC64, instead of random.RandomState.Same code as above but using SFC64, even with numpy= 1.18.1, would give
2.767242193222046
which is the fastest of the bunch, although still not as fast as in numpy=1.14.3
I don't know exactly what caused the performance change inside numpy, and I also don't know how you can easily amend the code without breaking compatibility.

@gojomo
Copy link
Collaborator

gojomo commented Apr 5, 2020

I don't think it's necessary (or reasonable for users to expect) that the same seeding initializes vectors identically across changed gensim versions. (The starting state shouldn't be particularly important for end results under realistic uses – such as multithreaded training, which introduces lots more uncontrollable randomness.)

So we should just choose whatever numpy recommends as a fast pseudorandom generator.

@piskvorky
Copy link
Owner

piskvorky commented Apr 5, 2020

Agreed. A 60x slowdown is really worrying.

@gojomo
Copy link
Collaborator

gojomo commented Jun 29, 2020

Ultimately any other uses of RandomState in gensim – https://github.com/RaRe-Technologies/gensim/search?q=RandomState&type=Code – will face the same slowdown unless updated to match numpy best-practices.

@piskvorky
Copy link
Owner

piskvorky commented Jun 29, 2020

@zygm0nt do you think you could take care of all the places gensim uses the old (slow) numpy RNG?

@zygm0nt
Copy link
Contributor

zygm0nt commented Jun 29, 2020

Sure, I can take care of that. Should this be done in this ticket? by reopening it?

@piskvorky
Copy link
Owner

Yeah, let me reopen. Thanks!

@piskvorky piskvorky reopened this Jun 29, 2020
@SagarDollin
Copy link

Hi, @zygm0nt @piskvorky ,
Any updates on this issue? I would like to contribute to a PR if it is already created. Also, please help me with the best practices I must follow during a PR. Are there any unit tests to pass before creating a PR. Or do I need to check the integrity of the code locally (in which case point me to any documentation that could help)

Thanks,
looking forward to your reply.

@piskvorky
Copy link
Owner

Sure! The code style and instructions for Gensim are here: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page
Thanks.

@SagarDollin
Copy link

SagarDollin commented Aug 26, 2021

Hi @piskvorky @gojomo,
I'm here with some updates about the work I did today.
From the above comments, it is clear that we need to replace all instances of RandomState and RandomState objects as it is deprecated from new versions of NumPy and also because it is 60 times slower.

What can we use to replace RandomState?
Ans: Instead of using np.random.RandomState(seed) we can use np.random.default_rng(seed) . default_rng instantiates a Generator with numpy's default BitGenerator(PCG64). For more info refer this documentation.

Will making the above change introduce new bugs?
Ans: The above change could potentially introduce new bugs if we do not take care of lines of code dependent on RandomState because we are using a Generator which is a replacement for RandomState , but they are not the exact same. To avoid any bugs, I have to do modifications wherever we have a dependency on RandomState.

To deal with any dependencies, I did some digging into the code, and I found few instances that have a dependency on RandomState, which will not cause any new bug given we do few modifications. Refer below for all the details(you can skip this part):

  1. In the file gensim/utils the function get_random_state(seed) has the line return np.random.mtrand._rand on line 90, but the returned object np.random.mtrand._rand is an instance of RandomState(refer to the screenshot below)
    image
    The simple solution to this dependency is to replace np.random.mtrand._rand with an instance of Generator i.e., np.random.deafult_rng().
    image

  2. There are other dependencies like in the file gensim/models/poincare.py there are instances of like these:

    1. self._np_random.uniform(..)
    2. self._np_random.randint(..)
    3. self._np_random.choice(..)
    4. self._np_random.shuffle(..)
      where self._np_random = np_random.RandomState(seed).
      Now since we are replacing RandomState , the self._np_random will be a Generator, and we have the implementation of uniform(..), choice(..) , shuffle(..) with the same parameters, except randint in the Generator. I need to replace all occurrences of randint(..) with integers(..) that basically take the same parameters as randint in our case and do the same job.

You can see more info about these methods here in the documentation.

Note: Here, in any line of code I have mentioned with np, I'm referring to the numpy package.

To Conclude
Let me know if anyone has any other ideas or concerns with the above method soon, as I will start working on the PR immediately.

@piskvorky
Copy link
Owner

piskvorky commented Aug 26, 2021

Sounds good to me, thanks.

After the replacement, can you do a sanity check re. performance? The new code should be faster (or at least not slower) than the existing code. I don't expect the overall impact will be too large (RNG is just a small part of the ML algos), but it would still be nice to include some concrete numbers in the release notes.

@SagarDollin
Copy link

SagarDollin commented Aug 27, 2021

Hey @piskvorky,
Thanks for your comment. Sure I will include a sanity test. I needed some clarifications for the sanity testing.

  1. The Poincare model in the gensim/models/poincare.py, which seems to be heavily relying on RandomState methods. So I will be testing this model after the change. However I might need a big dataset in order to test this. The dataset poincare_hypernyms_large might be a little small for that. Let me know if you can point me to any other bigger dataset, or should I go forward with the current dataset?
  2. Apart from Poincare there is word2vec that has RandomState but it is not using any of the methods of the RandomState. So I'm guessing testing this would give us the same results after making changes.
  3. Should I include the sanity test file in the PR? Or is a screenshot of the results enough?

Thanks,
Sagar B Dollin

@piskvorky
Copy link
Owner

piskvorky commented Aug 27, 2021

Should I include the sanity test file in the PR? Or is a screenshot of the results enough?

A summary of your benchmark as part of the PR description is enough. Nothing fancy or formal – really, a rudimentary sanity check. Thanks.

SagarDollin added a commit to SagarDollin/gensim that referenced this issue Aug 28, 2021
This is regarding the issue piskvorky#2782 .

Here are the benchmarks of before and after updating:

		Before Update      		After Update

Poincare	Ran 42 tests in 0.418s          Ran 42 tests in 0.417s
test_lda        Ran 48 tests in 223.845s        Ran 48 tests in 225.561s
utils      	Ran 24 tests in 0.007s          Ran 24 tests in 0.007s
test_matutils   Ran 18 tests in 0.071s          Ran 18 tests in 0.070s
word2vec        Ran 79 tests in 58.149s         Ran 79 tests in 57.950s

I don't find a big difference in time taken. However I feel it is good to be updated along with numpy.
@SagarDollin
Copy link

SagarDollin commented Aug 30, 2021

Hi @piskvorky,
The PR is ready for review.
Things achieved in this PR:

  1. I have resolved all the dependencies on RandomState.
  2. The code is still backward compatible in the sense, we can load a pre-trained model that relies on RandomState instead of the Generator and still be able to run it.

Note: I had to change few test files as well, as some tests were relying on hardcoded variables, and since we are using a new Random Generator, they tended to fail. So I have updated these tests. However, I'm not an expert in all the models implemented in this repo, so a review is required for any of these changes.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted impact MEDIUM Big annoyance for affected users performance Issue related to performance (in HW meaning) reach HIGH Affects most or all Gensim users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants