Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

General Questions #13

Closed
jethrokuan opened this issue Nov 2, 2018 · 3 comments
Closed

General Questions #13

jethrokuan opened this issue Nov 2, 2018 · 3 comments
Labels
question Further information is requested

Comments

@jethrokuan
Copy link

Hi, I've been trying to reimplement lda2vec as well. I can't seem to get your repo running (some dependency problems with sense2vec), and have a couple of questions that I hope you can answer:

  1. do you get significant speedups when using a GPU? I'm getting slowdowns: I think it's because the model is small, and transferring data between the CPU and GPU takes more time than the time savings when running computations on the GPU.

  2. How long does it generally take for one epoch for your 20newsgroups test case? I'm getting 18m training examples (word pairs + document id), and 1 epoch takes several hours, which is pretty terrible.

  3. Are there any questions you have about lda2vec that you think is worth discussing?

@nateraw nateraw added the question Further information is requested label Nov 2, 2018
@nateraw
Copy link
Owner

nateraw commented Nov 2, 2018

  1. I have no idea, it has been suggested to use GPU. Both original repo and meereum's were using an extremely large batch size (500). Perhaps this would combat the transfer of data from CPU to GPU?

  2. I do not recall how long it takes for one epoch. I know it takes a little while, though. I have a GTX 1080, not sure what kind of hardware you are using, but I'm sure this would greatly affect the speed.

  3. As for relevant questions related to lda2vec, the thing I'm most interested in would be setting this up to work across sequences. I've started some experiments related to this...I want to see if we could use attention to highlight topics across sequences. This should also combat some of the data preprocessing problems as well as the transfer of data to GPU.

@jethrokuan
Copy link
Author

  1. Yeah, I tried an even larger one (4096) with little improvements. I've also tried profiling using by saving the run-metadata: it seems that the bulk of the time is still spent on calculating the gradients, which should see speedups on GPU. Perhaps something is borked with my tensorflow installation.

  2. I see. As mentioned, GPUs aren't giving speedups for me, but the cluster I'm running on has a Titan X.

  3. That's interesting. It'd definitely see improvements over from using the GPU. Topics across sequences can also be useful. I come from an application point of view. My primary interest in lda2vec was to learn topics in scientific documents. Since lda2vec allows for incorporation of arbitrary side information, I was thinking of learning author embeddings in addition to document, topic, and word embeddings. Perhaps lda2vec can also learn author contributions (the same way as document proportions work).

Beyond that I'm also looking at productionizing the pipeline, making training faster and more stable. This includes the preprocessing pipeline, which in your case now depends on trained language models. Sentencepiece is a good unsupervised tokenizer.

--
Here are some of my observations from training lda2vec so far:

  1. The topic embeddings indeed learn to be identical, and this is driven by the nce_loss. Setting lda_loss weightage to 0 sees the same effect. Perhaps training a pure word2vec loss to learn good word embeddings first would alleviate the issue.

  2. I'm confused as to why the topics would start to learn after training multiple epochs. My loss reaches a minimum a small number of steps in. Intuitively this means that the model has already found a local/global minimum. Why would the model then learn the topics, given that the gradients around the area are probably close to 0?

@nateraw
Copy link
Owner

nateraw commented Nov 9, 2018

The topic distributions will not stay the same. They indeed do stay the same for a while, but as the model trains, the Dirichlet loss, "...encourages document proportions vectors to become more concentrated (e.g. sparser) over time."

As noted in the original author's repo, this is an experimental model. Because of this, it can be very tricky to get it to work right out of the box. But, if you give it time, the model will eventually start to separate out these topics. Some may still overlap a bit, but there will definitely be some separation.

Oh, I had a thought on your GPU vs. CPU problem too:

  1. How far into training are you checking for speeds of model?
  2. What is your "switch_loss" epoch variable set to when doing these speed checks?

Remember from the paper that the switch loss epoch is the number of epochs to train word2vec before "turning on" lda loss. If you are checking within the epochs before the switch loss epoch, you are probably just noticing word2vec working better on CPU than on GPU, which I believe is normal. Check out this thread for further info on that: tensorflow/tensorflow#13048

@nateraw nateraw closed this as completed Feb 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants