Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? #199

Closed
sdedeo opened this issue May 7, 2021 · 2 comments

Comments

@sdedeo
Copy link

sdedeo commented May 7, 2021

Hello Mallet Gurus,

I hesitate to bring up this sort of basic question, but we've been getting some unusual results with mallet, using the "import-file" method, and we're not sure what we're doing wrong. We're using mallet 2.0. Here's what I think is a minimal example.

First, the input data. That's this "AskP.txt" file here:

AskP.txt

we then run

mallet import-file --input AskP.txt --output AskP.mallet --keep-sequence

to get the mallet-input style file.

We then run

mallet train-topics --random-seed 1 --num-iterations 1000 --input AskP.mallet --num-topics 2 --num-top-words 100 --output-doc-topics doc-topics.txt --output-topic-keys topic-keys.txt --topic-word-weights-file word-topics.txt

which gives us the following output files:

doc-topics.txt
topic-keys.txt
word-topics.txt

These are pretty strange! In particular, consider the first document. It has only the word "certainly" in it. The model says it is equally weighted on topic 0 and topic 1. However, this doesn't make sense, because topic 1 is more heavily weighted on the word certainly (as can be seen by inspecting the word-topics.txt file).

Even stranger, documents 7, 8, and 9 are all the same (the word "necessarily", alone). However, the output gives document 8 a different topic decomposition (0.64, 0.36) than it does document 7 and 9 (which both get 0.5, 0.5).

Can anyone help us understand what is going on? (I am leaving my collaborator anonymous here, so he doesn't get embarrassed by what is likely a simple error on my side.)

We've never noticed problems like this before, but perhaps that's because we had been running with documents that were generally much longer?

Update: my collaborator ran the MALLET via the gensim interface; he has a completely different installation. He has a slightly different output, despite using the same random seed, but still a lot of the similar issues:

MALLET_via_GENSIM

and meanwhile, LdaModel from GENSIM seems to give more normal answers:

LdaModel_via_GENSIM

Can anyone help us make sense of this? We're super-happy to run any tests here on our end as well, and happy to try experiments.

@mimno
Copy link
Owner

mimno commented Jun 13, 2021

I've checked in a fix and verified that this doesn't happen anymore. FeatureSequence objects sometimes allocate arrays with more positions than they actually use, it's necessary to use featureSequence.size() instead of the length of the internal array. This only seems to have happened in printing topic proportions to a file for length 1 documents, which might be why no one has noticed so far.

@mimno mimno closed this as completed Jun 13, 2021
@sdedeo
Copy link
Author

sdedeo commented Jun 13, 2021

Thank you so much David; this is very helpful to us here. Have a lovely evening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants