Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W2v negsam #162

Merged
merged 25 commits into from May 10, 2014
Merged

W2v negsam #162

merged 25 commits into from May 10, 2014

Conversation

cod3licious
Copy link

added negative sampling to the python train_sentence function for the skipgram model and all additionally needed functions. in build_vocab, if negative sampling is used, there is a big table for the noise word distribution created, which is saved in model.table. this takes quite a bit of RAM and should probably be deleted before saving the model (it's only needed for training anyways).

@piskvorky
Copy link
Owner

Great, thanks @cod3licious :)

I'll review and merge asap. I'm kinda overwhelmed at the moment with life stuff, sorry. Reviewing the python3 port is also still in my "gensim queue".

@mfcabrera
Copy link
Contributor

Hey, I would love to see this in Gensim soon. I am right now usong Word2Vec for my master thesis and definetely Neg works better for my but I want to stop using the original C version. Is there a way I could help (a particular test / verification) ?

@piskvorky
Copy link
Owner

Sure! Many things you can help with @mfcabrera :

  1. review the memory requirements that Franziska mentions above
  2. run different combinations of HS/negative, check and report accuracies or any problems
  3. integrate with gensim, i.e. pull request add negative sampling method for skip-gram in the slower version #177
  4. add optimized version in Cython

@sebastien-j
Copy link
Contributor

Hi all,

a few months ago, using the Gensim implementation of word2vec as a starting point, I added negative sampling (https://github.com/sebastien-j/Word2vec). The modified files are not ready to be merged into gensim yet, but they might be useful.

Here are some potential issues with my implementation. First, the different .py and .pyx files should be merged instead of having one for each kind of model. Second, I did not use the look-up table to compute the sigmoid function, but that can be changed quickly. Third, I was generating the vocabulary Huffman tree and using it during training to see whether a word was in the vocabulary or not. Moreover, the random number generation seems wasteful, but I didn't profile the code to see whether this is a bottleneck or not. There are also additional hyper-parameters that may not be useful in gensim. Finally, the code for "optimization 2" is not written.

If you have any questions, feel free to ask. I'll try to help.

Sébastien

@piskvorky
Copy link
Owner

I wouldn't worry about the sigmoid tables. That optimization doesn't bring much.

But I worry about merging this without a cython version. People will complain it's too slow :)

@sebastien-j
Copy link
Contributor

Putting the sigmoid table back is a really easy task anyway. I can try to make a cleaner version of the cython version. What would be the best way to submit it? Just make a new pull request?

@cod3licious, why are both 'hs' and 'negative' parameters? I think one of these should be sufficient.

@piskvorky
Copy link
Owner

Best to make a pull request to @cod3licious 's w2v-negsam branch, so all the changes are in this single pull request. Which can then be merged into gensim proper.

@sebastien-j
Copy link
Contributor

@cod3licious , could you please update your w2v-negsam branch so that it includes the recent changes made in the develop branch of pivorski/gensim? It would help me add Cython functionality.

Thank you.

@cod3licious
Copy link
Author

@sebastien-j concerning the 'hs' and 'negative' parameters: those are both present in the original C code, so I thought I'll add that in as well. This makes it possible to train the model using both methods and not just one, which might be useful for some cases.

I'm at work right now but I'll try to include the other changes asap. Thanks for helping out :)

@sebastien-j
Copy link
Contributor

@cod3licious, thanks. I guess having both 'hs' and 'negative' doesn't hurt and could in some cases be useful, although I still find it somewhat weird.

@piskvorky, I sent a pull request to cod3licious:w2w-negsam (for the Cython version). As I mention in my message there, I am unsure about the "best" way to generate random numbers in order to sample from the vocabulary.

@piskvorky
Copy link
Owner

@cod3licious @sebastien-j OK, great. Let's finish this pull request :)

Re. RNG: your approach seems fine, generating the random numbers directly. Certainly faster than calling external methods. But did you check the performance (speed, accuracy) with this RNG approach? No problems there?

@sebastien-j
Copy link
Contributor

@piskvorky , the performance seems ok. The speeds I report correspond to a single experiment on my laptop. There was generally some other light work done simultaneously.

You may want to compare these results to those obtained with Mikolov's word2vec tool. (I never got it to work properly on my Windows system.)

On fil9 (~123M words; http://mattmahoney.net/dc/textdata.html):


Hierarchical softmax:

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=1):
93,252 wps, 50.5% accuracy (restrict_vocab=30000)

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0):
304,471 wps, 37.9%


Negative sampling:

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5):
66,645 wps, 45.4%

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5):
278,443 wps, 32.8%


To see to impact of the RNG, I also tried using numpy's random.randint inside the fast_sentence functions (by using "with gil" to generate the random number, and then releasing the gil):

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5):
40,064 wps, 45.1%

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5):
184,715 wps, 33.1%

There is a clear speed penalty, but no obvious performance gain.

@piskvorky
Copy link
Owner

@sebastien-j great, thanks again.

Is this waiting only for a merge from @cod3licious now? Do we need anything else?

@sebastien-j
Copy link
Contributor

@piskvorky , the python version of cbow with negative sampling is not in the pull request yet. However, @cod3licious has done it in cod3licious/word2vec/trainmodel.py, so integrating it into gensim should be easy.

@piskvorky
Copy link
Owner

Wait, I'm confused -- we are integrating your changes into the w2w-negsam branch of @cod3licious fork, right?

How many pieces to integrate are there, in what branches?

I'd suggest putting everything into a single branch. It will then be easier to reason about the changes, test them and ultimately merge into gensim.

@sebastien-j
Copy link
Contributor

We are indeed integrating my changes into the w2v-negsam branch of @cod3licious 's gensim fork.

To recap (as far as I can tell), @cod3licious first implemented all four training methods (python only) in the master branch of cod3licious/word2vec. She then forked gensim, made the w2v-negsam branch and added skip-gram with negative sampling there (no CBOW).

I then made a pull request for CBOW (h.softmax only), which was merged into gensim. Once that was done, @cod3licious updated her w2v-negsam branch to incorporate those changes. At that point, the w2v-negsam branch had python versions of skip-gram (h.softmax and neg. sampling) and cbow(h.softmax only). My pull request into w2v-negsam contains the cython version for all training methods.

Thus, the only missing part in w2v-negsam is the python version of cbow with negative sampling. However, adding it shoud be easy since it is already in cod3licious/word2vec.

@piskvorky
Copy link
Owner

OK. I'll ping @cod3licious via email, I think she's not receiving github notifications.

I've added you as gensim collaborator @sebastien-j , so you can merge/commit directly. Please use with care :-)

@sebastien-j
Copy link
Contributor

I made a mistake in test_word2vec.py. I sent a pull request to @cod3licious 's w2v-negsam branch in order to fix it (and also to remove trailing whitespace).

@piskvorky
Copy link
Owner

How about we open a new pull request, one that you have full control over, @sebastien-j ?

May be easier and quicker. There's always new functionality being added into gensim, and the longer we wait with merging this PR, the more work it will be to resolve conflicts later & get the PR up-to-date.

@sebastien-j
Copy link
Contributor

I don't think that is necessary. @cod3licious gave me access to her gensim repository. I'll try to add the remaining functionality soon.

@piskvorky
Copy link
Owner

Ah, cool, I didn't know :)

Big thanks to both of you for polishing & pushing this!

@sebastien-j
Copy link
Contributor

I added some more Python code to cover all the training algorithms. Most of it is taken directly from @cod3licious 's word2vec repository. There are a few points that should be discussed.

  1. If one wants to use both hierarchical softmax and negative sampling simultaneously, the Python and Cython version do not behave in the same way. syn0 is updated twice for each context-target pair in the Cython version (once for h.s., once for negative sampling), whereas there is only one update in the Python version. The latter corresponds to what is done in the original Google code.
  2. In the Python version of negative sampling, @cod3licious added a criterion excluding context words from being noise. I don't know whether that helps or not. In any case, I think we should be consistent between the two versions.
  3. Right now, CBOW uses the average of the context word vectors, which corresponds to the description of the model in the original paper, but not to the Google code, where the sum is employed. From limited experiments, the mean seems to work better, but there is no guarantee it does in all cases. We could maybe add an additional parameter letting the user choose.
  4. I added an additional method (models_negative_equal) in test_word2vec.py. We could later use "if" statements in models_equal instead, but I could not do so now without rebasing.
  5. In the Python version of CBOW, numpy's sum is used. I imported it as np_sum, but there may be a more practical way.
  6. The latest updates cause a conflict in word2vec.py, but it shouldn't be too hard to fix. Is there a way do do so without rebasing?

piskvorky and others added 6 commits April 23, 2014 20:10
sebastien-j and others added 7 commits April 23, 2014 21:16
These modifications are mostly copied from @cod3licious 's wordvec
repository.
Index exclusion now matches code in @cod3licious 's word2vec repository.
However, this differs from the cython version and from the original
word2vec implementation (or at least I think it does).

I don't know if we should exclude indices in word2_indices.
@piskvorky
Copy link
Owner

Re. differences and variation: I think it's best to aim at replicating the C code as closely as possible (rather than the original paper).

The code paths seem pretty complex. We'll need to try some larger corpora (for example text8 or text9) on the various option combinations, comparing the model/accuracy with the C version, to make sure there are no "major" issues.

Thanks for the work and testing as usual @sebastien-j ! I have rebased this PR on top of the current develop branch, to make your work a little easier. The result is in the w2v-negsam branch of my fork (I can't push into the @cod3licious fork). Let's use that, to avoid more merging problems.

@sebastien-j
Copy link
Contributor

I have updated the w2v-negsam branch of cod3licious/gensim.

Some comments on the points I raised in my previous post:

  1. Modifying the Cython version to match the C code would be moderately demanding. As this issue only arises when using both hierarchical softmax and negative sampling, I think we should modify this later in another pull request rather than delaying this one too much.
  2. I removed the additional constraint for noise words.
  3. The CBOW algorithms now use the sum of the context word vectors by default, but users can choose to use the average instead.
  4. No longer applies.

There is another difference between the Cython version and the C code. The logistic function approximation used with negative sampling is not the same. I tried doing as in the C code, but for some reasons I don't understand, it didn't work well. Right now, I employ the approximation used in hierarchical softmax (See commit 9332b43).

I agree that some additional testing is needed to make suke that everything is ok. @piskvorky , do you mind running some tests with the Google tool and posting the results here? Installing the C version on my Windows laptop is not straightforward. I could run the same tests for this pull request.

@piskvorky , do you want me to push the changes I make into your w2v-negsam branch?

@piskvorky
Copy link
Owner

OK. I'll run a few combinations on text9 (cbow on/off, some negative sampling; any other parameter suggestions?) using the C tool. I'll post the accuracy+time results here.

I don't think pushing will work; my w2v-negsam branch is already rebased on top of the latest develop.

It's probably easier if you rebase the changes you've made since my rebase (=since yesterday) on top of my w2v-negsam. Or else I could rebase your new code on top of my w2v-negsam again, and push that again into my w2v-negsam.

In any case, let's start using the rebased branch asap, or this will turn into a git nightmare :)

@piskvorky
Copy link
Owner

Sorry, scratch that. Now I see you've actually used the rebased branch already 👍

@sebastien-j
Copy link
Contributor

We should at least check all 4 basic combinations {skip-gram, CBOW} x {hierarchical softmax, negative sampling}. To check the Python version, using text8 may be useful.

@piskvorky
Copy link
Owner

-train text8 -output vectors_10.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -threads 4 -binary 1:
time 1x
accuracy 26.6%

-train text8 -output vectors_00.bin -cbow 1 -size 200 -window 5 -negative 0 -hs 1 -threads 4 -binary 1
time 0.25x
accuracy 15.7%

-train text8 -output vectors_15.bin -cbow 0 -size 200 -window 5 -negative 5 -hs 0 -threads 4 -binary 1
time 0.69x
accuracy 14.8%

-train text8 -output vectors_05.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -threads 4 -binary 1
time 0.2x
accuracy 13.2%

(I never used negative sampling; so I'm not sure whether -negative 5 was a good choice here)

@sebastien-j
Copy link
Contributor

With the same hyper-parameters (and 1 worker):

Skip-gram h. softmax: Cython 26.7%, 94.3k wps; Python 26.8%, 1016 wps

CBOW h. softmax: Cython 14.2%, 315.4k wps; Python 14.2%, 3363 wps

Skip-gram neg. sampling: Cython 13.8%, 70.4k wps; Python 14.1%, 996 wps

CBOW neg. sampling: Cython 13.0%, 293.4k wps; Python 12.6%, 4001 wps


I was able to run the C tool on a virtual machine. On fil9 (with the same hyper-parameters):

Skip-gram h. softmax: Cython 50.5%, C 48.9%

CBOW h. softmax: Cython 33.2%, C 35.5%

Skip-gram neg. sampling: Cython 45.4%, C 44.4%

CBOW neg. sampling: Cython 39.2%, C 42.3%

@piskvorky
Copy link
Owner

Sounds good! Are we ready to merge?

@sebastien-j
Copy link
Contributor

Yes, I think we are ready to merge (but you might want to review it just to be sure...).


if model.hs:
# work on the entire tree at once, to push as much work into numpy's C routines as possible (performance)
l2a = deepcopy(model.syn1[word.point]) # 2d matrix, codelen x layer1_size
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For numpy arrays, simply y = np.array(x) should be faster than y = deepcopy(x)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for fancy indexing (indexing by an array), NumPy should create a copy automatically. Is this explicit copy even necessary?

@piskvorky
Copy link
Owner

I looked at the code, but only spotted some very minor things (style inconsistencies). I could fix those myself after merging.

As for the code logic, it's hard to check in detail. That's why I suggested comparison with the existing implementation, on a real corpus. As a sort of high-level check.

I see you added some unit tests as well, which is great. Are there any more things that could be tested automatically?

piskvorky added a commit that referenced this pull request May 10, 2014
@piskvorky piskvorky merged commit 1b3a955 into piskvorky:develop May 10, 2014
@piskvorky
Copy link
Owner

Merged. Thanks @sebastien-j and @cod3licious !

Further changes and fixes can happen directly over develop branch.

From your tests it seems the Cython CBOW is consistently worse than the C CBOW... any idea why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants