Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Error - Word2Vec #293

Closed
dav009 opened this issue Feb 11, 2015 · 9 comments
Closed

Memory Error - Word2Vec #293

dav009 opened this issue Feb 11, 2015 · 9 comments

Comments

@dav009
Copy link

dav009 commented Feb 11, 2015

Im running a fairly simple script[1] calling word2vec on a 15G corpus which is already tokenized.I have tried running on a 30G machine, and then on a 60G machine.
Both leading to the following error:

Traceback (most recent call last):
  File "word2vec.py", line 13, in <module>
    read_corpus("/mnt/data/corpus")
  File "word2vec.py", line 9, in read_corpus
    model = gensim.models.Word2Vec(sentences, min_count=10, size=500, window=10, sg=1, workers=4)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 312, in __init__
    self.build_vocab(sentences)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 414, in build_vocab
    self.reset_weights()
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 524, in reset_weights
    self.syn1 = zeros((len(self.vocab), self.layer1_size), dtype=REAL)
MemoryError

Any clues? Does the script contains something which is not meant to be done ?

[1] https://gist.github.com/dav009/fb9a42890d3048b3b745

@sebastien-j
Copy link
Contributor

How big is your vocabulary?

If I recall correctly, with a vocabulary of size |V|, the memory usage
should be approximately 8 * size * |V| bytes (plus some overhead).

For |V|=10^7 and size=500, this is 40 GB.

The simplest solution is probably to increase min_count.

@seelikat
Copy link

seelikat commented Oct 9, 2015

@dav009 , could you find the problem? gensim is also simply failing for me with a Memory Error although plenty of memory is available.

@piskvorky
Copy link
Owner

As @sebastien-j says, it's best to give a more detailed report. A link (gist) to the log of your run (at INFO level) would be ideal.

@seelikat
Copy link

The actual problem was that on my cluster node a 32Bit Python (in an Anaconda framework) was installed, so that had nothing to do with gensim.

@piskvorky
Copy link
Owner

@dav009 did you figure out the cause in your case? Was it 32bit Python as well?

Let's try to conclude & close this ticket.

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

Closing as abandoned.

@tmylk tmylk closed this as completed Jan 23, 2016
@shirish93
Copy link

Hello, I seem to have encountered a similar issue. I'm using gensim with WinPython in one of my virtual machines with 16G memory. A model I have loads fine and works perfectly on a different system with 8GB memory in similar conditions, but when I run it in the VM, I get this:

word2vec.py", line 1266, in init_sims self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL) MemoryError
when trying to use the 'most_similar' functions for Word2Vec. I can load the model fine, and can retrieve the word vectors of words fine, but it seems to explode just when I try to access similarity-related functions (including 'doesn't match') etc. The closest existing issue I was with this. Any ideas?

@gojomo
Copy link
Collaborator

gojomo commented Jan 26, 2016

@shirish93 – verify that it's the exact same Python version (both installed and specifically in use at the time of the error) on the system that works and doesn't. (That seems to have been the issue above.)

@shirish93
Copy link

This was right, installed 64-bit python, and the issue was resolved. Apologies for raising non-issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants