W2v negsam #162

cod3licious · 2014-02-08T21:24:47Z

added negative sampling to the python train_sentence function for the skipgram model and all additionally needed functions. in build_vocab, if negative sampling is used, there is a big table for the noise word distribution created, which is saved in model.table. this takes quite a bit of RAM and should probably be deleted before saving the model (it's only needed for training anyways).

piskvorky · 2014-02-09T18:43:07Z

Great, thanks @cod3licious :)

I'll review and merge asap. I'm kinda overwhelmed at the moment with life stuff, sorry. Reviewing the python3 port is also still in my "gensim queue".

mfcabrera · 2014-03-17T11:29:51Z

Hey, I would love to see this in Gensim soon. I am right now usong Word2Vec for my master thesis and definetely Neg works better for my but I want to stop using the original C version. Is there a way I could help (a particular test / verification) ?

piskvorky · 2014-03-17T13:30:34Z

Sure! Many things you can help with @mfcabrera :

review the memory requirements that Franziska mentions above
run different combinations of HS/negative, check and report accuracies or any problems
integrate with gensim, i.e. pull request add negative sampling method for skip-gram in the slower version #177
add optimized version in Cython

sebastien-j · 2014-03-19T00:19:51Z

Hi all,

a few months ago, using the Gensim implementation of word2vec as a starting point, I added negative sampling (https://github.com/sebastien-j/Word2vec). The modified files are not ready to be merged into gensim yet, but they might be useful.

Here are some potential issues with my implementation. First, the different .py and .pyx files should be merged instead of having one for each kind of model. Second, I did not use the look-up table to compute the sigmoid function, but that can be changed quickly. Third, I was generating the vocabulary Huffman tree and using it during training to see whether a word was in the vocabulary or not. Moreover, the random number generation seems wasteful, but I didn't profile the code to see whether this is a bottleneck or not. There are also additional hyper-parameters that may not be useful in gensim. Finally, the code for "optimization 2" is not written.

If you have any questions, feel free to ask. I'll try to help.

Sébastien

piskvorky · 2014-03-19T10:31:23Z

I wouldn't worry about the sigmoid tables. That optimization doesn't bring much.

But I worry about merging this without a cython version. People will complain it's too slow :)

sebastien-j · 2014-03-19T23:55:04Z

Putting the sigmoid table back is a really easy task anyway. I can try to make a cleaner version of the cython version. What would be the best way to submit it? Just make a new pull request?

@cod3licious, why are both 'hs' and 'negative' parameters? I think one of these should be sufficient.

piskvorky · 2014-03-20T09:23:24Z

Best to make a pull request to @cod3licious 's w2v-negsam branch, so all the changes are in this single pull request. Which can then be merged into gensim proper.

sebastien-j · 2014-03-20T16:54:44Z

@cod3licious , could you please update your w2v-negsam branch so that it includes the recent changes made in the develop branch of pivorski/gensim? It would help me add Cython functionality.

Thank you.

cod3licious · 2014-03-26T15:06:58Z

@sebastien-j concerning the 'hs' and 'negative' parameters: those are both present in the original C code, so I thought I'll add that in as well. This makes it possible to train the model using both methods and not just one, which might be useful for some cases.

I'm at work right now but I'll try to include the other changes asap. Thanks for helping out :)

sebastien-j · 2014-03-28T18:32:10Z

@cod3licious, thanks. I guess having both 'hs' and 'negative' doesn't hurt and could in some cases be useful, although I still find it somewhat weird.

@piskvorky, I sent a pull request to cod3licious:w2w-negsam (for the Cython version). As I mention in my message there, I am unsure about the "best" way to generate random numbers in order to sample from the vocabulary.

piskvorky · 2014-04-08T10:32:59Z

@cod3licious @sebastien-j OK, great. Let's finish this pull request :)

Re. RNG: your approach seems fine, generating the random numbers directly. Certainly faster than calling external methods. But did you check the performance (speed, accuracy) with this RNG approach? No problems there?

sebastien-j · 2014-04-09T16:35:47Z

@piskvorky , the performance seems ok. The speeds I report correspond to a single experiment on my laptop. There was generally some other light work done simultaneously.

You may want to compare these results to those obtained with Mikolov's word2vec tool. (I never got it to work properly on my Windows system.)

On fil9 (~123M words; http://mattmahoney.net/dc/textdata.html):

Hierarchical softmax:

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=1):
93,252 wps, 50.5% accuracy (restrict_vocab=30000)

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0):
304,471 wps, 37.9%

Negative sampling:

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5):
66,645 wps, 45.4%

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5):
278,443 wps, 32.8%

To see to impact of the RNG, I also tried using numpy's random.randint inside the fast_sentence functions (by using "with gil" to generate the random number, and then releasing the gil):

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, hs=0, negative=5):
40,064 wps, 45.1%

Word2Vec(Text8Corpus(infile), size=200, window=5, min_count=5, workers=1, sg=0, hs=0, negative=5):
184,715 wps, 33.1%

There is a clear speed penalty, but no obvious performance gain.

piskvorky · 2014-04-10T08:59:49Z

@sebastien-j great, thanks again.

Is this waiting only for a merge from @cod3licious now? Do we need anything else?

sebastien-j · 2014-04-10T14:13:10Z

@piskvorky , the python version of cbow with negative sampling is not in the pull request yet. However, @cod3licious has done it in cod3licious/word2vec/trainmodel.py, so integrating it into gensim should be easy.

piskvorky · 2014-04-10T14:57:12Z

Wait, I'm confused -- we are integrating your changes into the w2w-negsam branch of @cod3licious fork, right?

How many pieces to integrate are there, in what branches?

I'd suggest putting everything into a single branch. It will then be easier to reason about the changes, test them and ultimately merge into gensim.

sebastien-j · 2014-04-10T16:24:01Z

We are indeed integrating my changes into the w2v-negsam branch of @cod3licious 's gensim fork.

To recap (as far as I can tell), @cod3licious first implemented all four training methods (python only) in the master branch of cod3licious/word2vec. She then forked gensim, made the w2v-negsam branch and added skip-gram with negative sampling there (no CBOW).

I then made a pull request for CBOW (h.softmax only), which was merged into gensim. Once that was done, @cod3licious updated her w2v-negsam branch to incorporate those changes. At that point, the w2v-negsam branch had python versions of skip-gram (h.softmax and neg. sampling) and cbow(h.softmax only). My pull request into w2v-negsam contains the cython version for all training methods.

Thus, the only missing part in w2v-negsam is the python version of cbow with negative sampling. However, adding it shoud be easy since it is already in cod3licious/word2vec.

piskvorky · 2014-04-11T08:49:40Z

OK. I'll ping @cod3licious via email, I think she's not receiving github notifications.

I've added you as gensim collaborator @sebastien-j , so you can merge/commit directly. Please use with care :-)

sebastien-j · 2014-04-13T16:42:35Z

I made a mistake in test_word2vec.py. I sent a pull request to @cod3licious 's w2v-negsam branch in order to fix it (and also to remove trailing whitespace).

piskvorky · 2014-04-22T12:22:27Z

How about we open a new pull request, one that you have full control over, @sebastien-j ?

May be easier and quicker. There's always new functionality being added into gensim, and the longer we wait with merging this PR, the more work it will be to resolve conflicts later & get the PR up-to-date.

sebastien-j · 2014-04-22T13:16:43Z

I don't think that is necessary. @cod3licious gave me access to her gensim repository. I'll try to add the remaining functionality soon.

piskvorky · 2014-04-22T13:54:02Z

Ah, cool, I didn't know :)

Big thanks to both of you for polishing & pushing this!

sebastien-j · 2014-04-23T17:01:27Z

I added some more Python code to cover all the training algorithms. Most of it is taken directly from @cod3licious 's word2vec repository. There are a few points that should be discussed.

If one wants to use both hierarchical softmax and negative sampling simultaneously, the Python and Cython version do not behave in the same way. syn0 is updated twice for each context-target pair in the Cython version (once for h.s., once for negative sampling), whereas there is only one update in the Python version. The latter corresponds to what is done in the original Google code.
In the Python version of negative sampling, @cod3licious added a criterion excluding context words from being noise. I don't know whether that helps or not. In any case, I think we should be consistent between the two versions.
Right now, CBOW uses the average of the context word vectors, which corresponds to the description of the model in the original paper, but not to the Google code, where the sum is employed. From limited experiments, the mean seems to work better, but there is no guarantee it does in all cases. We could maybe add an additional parameter letting the user choose.
I added an additional method (models_negative_equal) in test_word2vec.py. We could later use "if" statements in models_equal instead, but I could not do so now without rebasing.
In the Python version of CBOW, numpy's sum is used. I imported it as np_sum, but there may be a more practical way.
The latest updates cause a conflict in word2vec.py, but it shouldn't be too hard to fix. Is there a way do do so without rebasing?

Changes made to word2vec.pyx word2vec.py has not beeen modified yet. Not tested.

For the Cython implementation of negative sampling

@cod3licious

These modifications are mostly copied from @cod3licious 's wordvec repository.

@cod3licious

Index exclusion now matches code in @cod3licious 's word2vec repository. However, this differs from the cython version and from the original word2vec implementation (or at least I think it does). I don't know if we should exclude indices in word2_indices.

piskvorky · 2014-04-23T19:54:21Z

Re. differences and variation: I think it's best to aim at replicating the C code as closely as possible (rather than the original paper).

The code paths seem pretty complex. We'll need to try some larger corpora (for example text8 or text9) on the various option combinations, comparing the model/accuracy with the C version, to make sure there are no "major" issues.

Thanks for the work and testing as usual @sebastien-j ! I have rebased this PR on top of the current develop branch, to make your work a little easier. The result is in the w2v-negsam branch of my fork (I can't push into the @cod3licious fork). Let's use that, to avoid more merging problems.

Also reformat some comments

The mean can be also be used.

sebastien-j · 2014-04-24T16:32:16Z

I have updated the w2v-negsam branch of cod3licious/gensim.

Some comments on the points I raised in my previous post:

Modifying the Cython version to match the C code would be moderately demanding. As this issue only arises when using both hierarchical softmax and negative sampling, I think we should modify this later in another pull request rather than delaying this one too much.
I removed the additional constraint for noise words.
The CBOW algorithms now use the sum of the context word vectors by default, but users can choose to use the average instead.
No longer applies.

There is another difference between the Cython version and the C code. The logistic function approximation used with negative sampling is not the same. I tried doing as in the C code, but for some reasons I don't understand, it didn't work well. Right now, I employ the approximation used in hierarchical softmax (See commit 9332b43).

I agree that some additional testing is needed to make suke that everything is ok. @piskvorky , do you mind running some tests with the Google tool and posting the results here? Installing the C version on my Windows laptop is not straightforward. I could run the same tests for this pull request.

@piskvorky , do you want me to push the changes I make into your w2v-negsam branch?

piskvorky · 2014-04-24T17:00:16Z

OK. I'll run a few combinations on text9 (cbow on/off, some negative sampling; any other parameter suggestions?) using the C tool. I'll post the accuracy+time results here.

I don't think pushing will work; my w2v-negsam branch is already rebased on top of the latest develop.

It's probably easier if you rebase the changes you've made since my rebase (=since yesterday) on top of my w2v-negsam. Or else I could rebase your new code on top of my w2v-negsam again, and push that again into my w2v-negsam.

In any case, let's start using the rebased branch asap, or this will turn into a git nightmare :)

piskvorky · 2014-04-24T17:03:34Z

Sorry, scratch that. Now I see you've actually used the rebased branch already 👍

sebastien-j · 2014-04-24T18:46:06Z

We should at least check all 4 basic combinations {skip-gram, CBOW} x {hierarchical softmax, negative sampling}. To check the Python version, using text8 may be useful.

piskvorky · 2014-04-26T22:44:24Z

-train text8 -output vectors_10.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -threads 4 -binary 1:
time 1x
accuracy 26.6%

-train text8 -output vectors_00.bin -cbow 1 -size 200 -window 5 -negative 0 -hs 1 -threads 4 -binary 1
time 0.25x
accuracy 15.7%

-train text8 -output vectors_15.bin -cbow 0 -size 200 -window 5 -negative 5 -hs 0 -threads 4 -binary 1
time 0.69x
accuracy 14.8%

-train text8 -output vectors_05.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -threads 4 -binary 1
time 0.2x
accuracy 13.2%

(I never used negative sampling; so I'm not sure whether -negative 5 was a good choice here)

sebastien-j · 2014-04-29T23:51:21Z

With the same hyper-parameters (and 1 worker):

Skip-gram h. softmax: Cython 26.7%, 94.3k wps; Python 26.8%, 1016 wps

CBOW h. softmax: Cython 14.2%, 315.4k wps; Python 14.2%, 3363 wps

Skip-gram neg. sampling: Cython 13.8%, 70.4k wps; Python 14.1%, 996 wps

CBOW neg. sampling: Cython 13.0%, 293.4k wps; Python 12.6%, 4001 wps

I was able to run the C tool on a virtual machine. On fil9 (with the same hyper-parameters):

Skip-gram h. softmax: Cython 50.5%, C 48.9%

CBOW h. softmax: Cython 33.2%, C 35.5%

Skip-gram neg. sampling: Cython 45.4%, C 44.4%

CBOW neg. sampling: Cython 39.2%, C 42.3%

piskvorky · 2014-04-30T13:06:54Z

Sounds good! Are we ready to merge?

sebastien-j · 2014-05-04T01:23:25Z

Yes, I think we are ready to merge (but you might want to review it just to be sure...).

piskvorky · 2014-05-04T15:29:19Z

gensim/models/word2vec.py

+
+                        if model.hs:
+                            # work on the entire tree at once, to push as much work into numpy's C routines as possible (performance)
+                            l2a = deepcopy(model.syn1[word.point])  # 2d matrix, codelen x layer1_size


For numpy arrays, simply y = np.array(x) should be faster than y = deepcopy(x)

Actually, for fancy indexing (indexing by an array), NumPy should create a copy automatically. Is this explicit copy even necessary?

piskvorky · 2014-05-04T17:33:30Z

I looked at the code, but only spotted some very minor things (style inconsistencies). I could fix those myself after merging.

As for the code logic, it's hard to check in detail. That's why I suggested comparison with the existing implementation, on a real corpus. As a sort of high-level check.

I see you added some unit tests as well, which is great. Are there any more things that could be tested automatically?

W2v negsam

piskvorky · 2014-05-10T08:57:21Z

Merged. Thanks @sebastien-j and @cod3licious !

Further changes and fixes can happen directly over develop branch.

From your tests it seems the Cython CBOW is consistently worse than the C CBOW... any idea why?

This was referenced Mar 9, 2014

add negative sampling method for skip-gram in the slower version #177

Closed

CBOW #176

Merged

piskvorky and others added 6 commits April 23, 2014 20:10

update CHANGELOG

5d26402

Add negative sampling (cython only)

aaa7dbd

Changes made to word2vec.pyx word2vec.py has not beeen modified yet. Not tested.

added negative sampling

fd11039

Add random_numbers (array) to word2vec

425b1d8

For the Cython implementation of negative sampling

Minor bug fix

72d813e

Correct syntaxt error in ctype def statement

4d65e31

sebastien-j and others added 7 commits April 23, 2014 21:16

Fix mistake in test_word2vec.py

2651bd6

Fix model comparison

be413c6

Add python negative sampling

66b7657

These modifications are mostly copied from @cod3licious 's wordvec repository.

Initialize neu1e

1db1148

Temporary fix for unit tests

b39976b

Ignore some tests (for Python 3.3)

29e72a0

sebastien-j added 5 commits April 23, 2014 23:50

Py3 fix

6f6ace4

Remove additional constraint

e558c8c

Also reformat some comments

CBOW: Use the sum by default

db5236b

The mean can be also be used.

Simplify tests

f2beed7

Adjust documentation and make code more readable

dd94829

piskvorky reviewed May 4, 2014
View reviewed changes

piskvorky added a commit that referenced this pull request May 10, 2014

Merge pull request #162 from cod3licious/w2v-negsam

1b3a955

W2v negsam

piskvorky merged commit 1b3a955 into piskvorky:develop May 10, 2014

piskvorky mentioned this pull request May 18, 2014

port phrases & negative sampling in word2vec #130

Closed

W2v negsam #162

W2v negsam #162

Conversation

cod3licious commented Feb 8, 2014

piskvorky commented Feb 9, 2014

mfcabrera commented Mar 17, 2014

piskvorky commented Mar 17, 2014

sebastien-j commented Mar 19, 2014

piskvorky commented Mar 19, 2014

sebastien-j commented Mar 19, 2014

piskvorky commented Mar 20, 2014

sebastien-j commented Mar 20, 2014

cod3licious commented Mar 26, 2014

sebastien-j commented Mar 28, 2014

piskvorky commented Apr 8, 2014

sebastien-j commented Apr 9, 2014

piskvorky commented Apr 10, 2014

sebastien-j commented Apr 10, 2014

piskvorky commented Apr 10, 2014

sebastien-j commented Apr 10, 2014

piskvorky commented Apr 11, 2014

sebastien-j commented Apr 13, 2014

piskvorky commented Apr 22, 2014

sebastien-j commented Apr 22, 2014

piskvorky commented Apr 22, 2014

sebastien-j commented Apr 23, 2014

piskvorky commented Apr 23, 2014

sebastien-j commented Apr 24, 2014

piskvorky commented Apr 24, 2014

piskvorky commented Apr 24, 2014

sebastien-j commented Apr 24, 2014

piskvorky commented Apr 26, 2014

sebastien-j commented Apr 29, 2014

piskvorky commented Apr 30, 2014

sebastien-j commented May 4, 2014

piskvorky May 4, 2014

Choose a reason for hiding this comment

piskvorky May 4, 2014

Choose a reason for hiding this comment

piskvorky commented May 4, 2014

piskvorky commented May 10, 2014