Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the functionality of Dataset #57

Open
bt2901 opened this issue May 8, 2020 · 3 comments
Open

Extend the functionality of Dataset #57

bt2901 opened this issue May 8, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@bt2901
Copy link
Contributor

bt2901 commented May 8, 2020

Something along the lines of "convert between Counter and vowpal_wabbit" would be very helpful.

Also, maybe we need to store more metadata (such as main modality and co-occurrences)

Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)

@Evgeny-Egorov-Projects
Copy link
Contributor

Just on the "off" note: I still don't understand the "sacred" meaning of the main modality. Isn't taking ANY modality as main and then recalculating the weights accordingly brings them to the "equal ground" so to speak? Or more mathematically: there exists hyperplane of "equal regularization effect" and setting coeeficient to one for of one of them scales others acordingly?

@Alvant
Copy link
Collaborator

Alvant commented May 8, 2020

Let me break into the discussion and say a couple of words in defense of main modality :)

This is not imho about equal weights or something. Topic modeling is about analyzing texts. So, it is reasonable to provide a way to tell somehow the plain text (aka main modality) from other modalities (which are either meta info such as author or title; or manually created fancy things like bigram, trigram, skipgram etc. god knows what else is possible to come up with). A user may want to build models solely on plain text. Or she may want to use this modality for coherence computation, for example (if words of main modality are in natural order in the VW, but other modalities are in bag-of-words). So, main modality == preprocessed raw text.

Or maybe it would be better to give it some other name (not main modality, but preprocessed_text or plain_text?)

@bt2901
Copy link
Contributor Author

bt2901 commented May 24, 2020

I agree with @Alvant, but I want to add another consideration.

In many models, multiplying every modality weight by the same constant should leave the model unchanged (as a consequence, you definitely could recalculate weights based on any modality). However, this is not the case when regularizers are involved. If we want to transfer good taus between different domains (and we do want that: that's the point of the relative coefficients tech), regularization coefficients must be "to the same scale", so to speak. I believe that relating them to the main_modality's weight (1 by default) is the most natural choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants