Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSA dimensionality #28

Open
piskvorky opened this issue May 4, 2011 · 11 comments
Open

LSA dimensionality #28

piskvorky opened this issue May 4, 2011 · 11 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@piskvorky
Copy link
Owner

Try the automated dimensionality setting for Latent Semantic Analysis, via MDL:

http://www.springerlink.com/content/500651582r310t05/

This means: reproduce Fig.1 from that article. See what it does on the Lee corpus. Does the curve make sense? Is MDL robust enough, across several corpora?

@cperreault
Copy link

Hi M. Řehůřek, thank you for opening an issue for automated dimensionality setting. Pierre-Yves Lafleur (pierreyves.lafleur.1@gmail.com) and I (christian.perreault.2@gmail.com) are currently trying to implement MDL with Gensim and test it. In a few days, we should be able to provide an answer in our context (small corpora) to your question "Is MDL robust enough, across several corpora?"
Does anyone one ever created a method to automate number of topics setting with LSA?

@piskvorky
Copy link
Owner Author

Great, getting rid of an extra parameter in LSA would be really cool!

Also note that @dedan added an easy way to select an LSA submodel (train on K topics, but use only L <= K topics for transformations). It is in commit 7711cbd . You simply set the lsa_model.numTopics attribute to some lower number.

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@cperreault Would you still like to add MDL to gensim?

@cperreault
Copy link

Yes, I would be interested! I would have to dive again in Gensim. I wrote an automatic number of topics "chooser", based on MDL, in my Master's thesis (http://www.theses.ulaval.ca/2013/29936/ - in French!). It is currently very custom implementation: analyzing a corpus with an increasing k, and storing results in a MySQL DB. If there is interest, I could contribute time and effort to (try to) implement it in a Gensim "native" way.

@piskvorky
Copy link
Owner Author

piskvorky commented Feb 9, 2016

Oh yes, we're still interested. Thanks!

Depending on what the "analysis" entails, we could even make this the default behaviour. There's no need to re-train models with LSA just to tweak k (the topic spaces are conveniently nested, as mentioned above), so hopefully this analysis doesn't take much extra time?

@ljdawn
Copy link

ljdawn commented Aug 5, 2016

@cperreault Hi, can we use the automatic number of topics "chooser" now ?

@cperreault
Copy link

cperreault commented Sep 27, 2016

@ljdawn @piskvorky. Hi! Sorry for my late response. I did not reworked on it and I don't expect to have time until a few months from now. However, if you think this could be useful, in the short term I can explain the simple principle that guided "my" automatic number of topics (k) chooser. (It is explained in my Master's thesis in French (see link above)) A given corpus is analyzed (LSA) with an increasing k, starting from k=1. For each k, the distribution of (dis)similarities among documents is calculated and rounded at each .1 between 0 and 1. The lowest k that allows most of documents to be considered dissimilar (between 0 and 0.1) is chosen. This is not rocket science, I know. It gave me interesting results among thousands of corpora analyzed, and, at least, a way to automatize analysis and a base of comparison between corpora. What do you think?

@tmylk
Copy link
Contributor

tmylk commented Sep 27, 2016

FYI @dsquareindia this is related to your recent work on selecting the number of topics through coherence.

@devashishd12
Copy link
Contributor

Yes coherence can surely be used as an automatic "chooser" for LSA as well. We could choose the number of topics corresponding to the best coherence from 100 topics. I'll test it out on LSA and get back here. Currently I've only tested it out on LDA but I think with LSA it'll work better.

@tmylk
Copy link
Contributor

tmylk commented Sep 27, 2016

It would be nice to compare coherence to the approach of @cperreault

@piskvorky
Copy link
Owner Author

LSA topics have no interpretation, so I don't think "coherence" (as in, semantically interpretable topics) makes much sense.

@menshikh-iv menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

6 participants