puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? #199

sdedeo · 2021-05-07T01:51:50Z

Hello Mallet Gurus,

I hesitate to bring up this sort of basic question, but we've been getting some unusual results with mallet, using the "import-file" method, and we're not sure what we're doing wrong. We're using mallet 2.0. Here's what I think is a minimal example.

First, the input data. That's this "AskP.txt" file here:

AskP.txt

we then run

mallet import-file --input AskP.txt --output AskP.mallet --keep-sequence

to get the mallet-input style file.

We then run

mallet train-topics --random-seed 1 --num-iterations 1000 --input AskP.mallet --num-topics 2 --num-top-words 100 --output-doc-topics doc-topics.txt --output-topic-keys topic-keys.txt --topic-word-weights-file word-topics.txt

which gives us the following output files:

doc-topics.txt
topic-keys.txt
word-topics.txt

These are pretty strange! In particular, consider the first document. It has only the word "certainly" in it. The model says it is equally weighted on topic 0 and topic 1. However, this doesn't make sense, because topic 1 is more heavily weighted on the word certainly (as can be seen by inspecting the word-topics.txt file).

Even stranger, documents 7, 8, and 9 are all the same (the word "necessarily", alone). However, the output gives document 8 a different topic decomposition (0.64, 0.36) than it does document 7 and 9 (which both get 0.5, 0.5).

Can anyone help us understand what is going on? (I am leaving my collaborator anonymous here, so he doesn't get embarrassed by what is likely a simple error on my side.)

We've never noticed problems like this before, but perhaps that's because we had been running with documents that were generally much longer?

Update: my collaborator ran the MALLET via the gensim interface; he has a completely different installation. He has a slightly different output, despite using the same random seed, but still a lot of the similar issues:

and meanwhile, LdaModel from GENSIM seems to give more normal answers:

Can anyone help us make sense of this? We're super-happy to run any tests here on our end as well, and happy to try experiments.

The text was updated successfully, but these errors were encountered:

mimno · 2021-06-13T02:09:21Z

I've checked in a fix and verified that this doesn't happen anymore. FeatureSequence objects sometimes allocate arrays with more positions than they actually use, it's necessary to use featureSequence.size() instead of the length of the internal array. This only seems to have happened in printing topic proportions to a file for length 1 documents, which might be why no one has noticed so far.

sdedeo · 2021-06-13T04:53:40Z

Thank you so much David; this is very helpful to us here. Have a lovely evening.

mimno closed this as completed Jun 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? #199

puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? #199

sdedeo commented May 7, 2021 •

edited

Loading

mimno commented Jun 13, 2021

sdedeo commented Jun 13, 2021

puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? #199

puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? #199

Comments

sdedeo commented May 7, 2021 • edited Loading

mimno commented Jun 13, 2021

sdedeo commented Jun 13, 2021

sdedeo commented May 7, 2021 •

edited

Loading