You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hesitate to bring up this sort of basic question, but we've been getting some unusual results with mallet, using the "import-file" method, and we're not sure what we're doing wrong. We're using mallet 2.0. Here's what I think is a minimal example.
First, the input data. That's this "AskP.txt" file here:
These are pretty strange! In particular, consider the first document. It has only the word "certainly" in it. The model says it is equally weighted on topic 0 and topic 1. However, this doesn't make sense, because topic 1 is more heavily weighted on the word certainly (as can be seen by inspecting the word-topics.txt file).
Even stranger, documents 7, 8, and 9 are all the same (the word "necessarily", alone). However, the output gives document 8 a different topic decomposition (0.64, 0.36) than it does document 7 and 9 (which both get 0.5, 0.5).
Can anyone help us understand what is going on? (I am leaving my collaborator anonymous here, so he doesn't get embarrassed by what is likely a simple error on my side.)
We've never noticed problems like this before, but perhaps that's because we had been running with documents that were generally much longer?
Update: my collaborator ran the MALLET via the gensim interface; he has a completely different installation. He has a slightly different output, despite using the same random seed, but still a lot of the similar issues:
and meanwhile, LdaModel from GENSIM seems to give more normal answers:
Can anyone help us make sense of this? We're super-happy to run any tests here on our end as well, and happy to try experiments.
The text was updated successfully, but these errors were encountered:
I've checked in a fix and verified that this doesn't happen anymore. FeatureSequence objects sometimes allocate arrays with more positions than they actually use, it's necessary to use featureSequence.size() instead of the length of the internal array. This only seems to have happened in printing topic proportions to a file for length 1 documents, which might be why no one has noticed so far.
Hello Mallet Gurus,
I hesitate to bring up this sort of basic question, but we've been getting some unusual results with mallet, using the "import-file" method, and we're not sure what we're doing wrong. We're using mallet 2.0. Here's what I think is a minimal example.
First, the input data. That's this "AskP.txt" file here:
AskP.txt
we then run
mallet import-file --input AskP.txt --output AskP.mallet --keep-sequence
to get the mallet-input style file.
We then run
mallet train-topics --random-seed 1 --num-iterations 1000 --input AskP.mallet --num-topics 2 --num-top-words 100 --output-doc-topics doc-topics.txt --output-topic-keys topic-keys.txt --topic-word-weights-file word-topics.txt
which gives us the following output files:
doc-topics.txt
topic-keys.txt
word-topics.txt
These are pretty strange! In particular, consider the first document. It has only the word "certainly" in it. The model says it is equally weighted on topic 0 and topic 1. However, this doesn't make sense, because topic 1 is more heavily weighted on the word certainly (as can be seen by inspecting the word-topics.txt file).
Even stranger, documents 7, 8, and 9 are all the same (the word "necessarily", alone). However, the output gives document 8 a different topic decomposition (0.64, 0.36) than it does document 7 and 9 (which both get 0.5, 0.5).
Can anyone help us understand what is going on? (I am leaving my collaborator anonymous here, so he doesn't get embarrassed by what is likely a simple error on my side.)
We've never noticed problems like this before, but perhaps that's because we had been running with documents that were generally much longer?
Update: my collaborator ran the MALLET via the gensim interface; he has a completely different installation. He has a slightly different output, despite using the same random seed, but still a lot of the similar issues:
and meanwhile, LdaModel from GENSIM seems to give more normal answers:
Can anyone help us make sense of this? We're super-happy to run any tests here on our end as well, and happy to try experiments.
The text was updated successfully, but these errors were encountered: