General Questions #13

jethrokuan · 2018-11-02T11:42:07Z

Hi, I've been trying to reimplement lda2vec as well. I can't seem to get your repo running (some dependency problems with sense2vec), and have a couple of questions that I hope you can answer:

do you get significant speedups when using a GPU? I'm getting slowdowns: I think it's because the model is small, and transferring data between the CPU and GPU takes more time than the time savings when running computations on the GPU.
How long does it generally take for one epoch for your 20newsgroups test case? I'm getting 18m training examples (word pairs + document id), and 1 epoch takes several hours, which is pretty terrible.
Are there any questions you have about lda2vec that you think is worth discussing?

nateraw · 2018-11-02T15:10:19Z

I have no idea, it has been suggested to use GPU. Both original repo and meereum's were using an extremely large batch size (500). Perhaps this would combat the transfer of data from CPU to GPU?
I do not recall how long it takes for one epoch. I know it takes a little while, though. I have a GTX 1080, not sure what kind of hardware you are using, but I'm sure this would greatly affect the speed.
As for relevant questions related to lda2vec, the thing I'm most interested in would be setting this up to work across sequences. I've started some experiments related to this...I want to see if we could use attention to highlight topics across sequences. This should also combat some of the data preprocessing problems as well as the transfer of data to GPU.

jethrokuan · 2018-11-03T09:42:08Z

Yeah, I tried an even larger one (4096) with little improvements. I've also tried profiling using by saving the run-metadata: it seems that the bulk of the time is still spent on calculating the gradients, which should see speedups on GPU. Perhaps something is borked with my tensorflow installation.
I see. As mentioned, GPUs aren't giving speedups for me, but the cluster I'm running on has a Titan X.
That's interesting. It'd definitely see improvements over from using the GPU. Topics across sequences can also be useful. I come from an application point of view. My primary interest in lda2vec was to learn topics in scientific documents. Since lda2vec allows for incorporation of arbitrary side information, I was thinking of learning author embeddings in addition to document, topic, and word embeddings. Perhaps lda2vec can also learn author contributions (the same way as document proportions work).

Beyond that I'm also looking at productionizing the pipeline, making training faster and more stable. This includes the preprocessing pipeline, which in your case now depends on trained language models. Sentencepiece is a good unsupervised tokenizer.

--
Here are some of my observations from training lda2vec so far:

The topic embeddings indeed learn to be identical, and this is driven by the nce_loss. Setting lda_loss weightage to 0 sees the same effect. Perhaps training a pure word2vec loss to learn good word embeddings first would alleviate the issue.
I'm confused as to why the topics would start to learn after training multiple epochs. My loss reaches a minimum a small number of steps in. Intuitively this means that the model has already found a local/global minimum. Why would the model then learn the topics, given that the gradients around the area are probably close to 0?

nateraw · 2018-11-09T22:08:29Z

The topic distributions will not stay the same. They indeed do stay the same for a while, but as the model trains, the Dirichlet loss, "...encourages document proportions vectors to become more concentrated (e.g. sparser) over time."

As noted in the original author's repo, this is an experimental model. Because of this, it can be very tricky to get it to work right out of the box. But, if you give it time, the model will eventually start to separate out these topics. Some may still overlap a bit, but there will definitely be some separation.

Oh, I had a thought on your GPU vs. CPU problem too:

How far into training are you checking for speeds of model?
What is your "switch_loss" epoch variable set to when doing these speed checks?

Remember from the paper that the switch loss epoch is the number of epochs to train word2vec before "turning on" lda loss. If you are checking within the epochs before the switch loss epoch, you are probably just noticing word2vec working better on CPU than on GPU, which I believe is normal. Check out this thread for further info on that: tensorflow/tensorflow#13048

nateraw added the question Further information is requested label Nov 2, 2018

nateraw closed this as completed Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General Questions #13

General Questions #13

jethrokuan commented Nov 2, 2018

nateraw commented Nov 2, 2018

jethrokuan commented Nov 3, 2018

nateraw commented Nov 9, 2018

General Questions #13

General Questions #13

Comments

jethrokuan commented Nov 2, 2018

nateraw commented Nov 2, 2018

jethrokuan commented Nov 3, 2018

nateraw commented Nov 9, 2018