Extend the functionality of Dataset #57

bt2901 · 2020-05-08T12:55:16Z

Something along the lines of "convert between Counter and vowpal_wabbit" would be very helpful.

Also, maybe we need to store more metadata (such as main modality and co-occurrences)

Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)

The text was updated successfully, but these errors were encountered:

Evgeny-Egorov-Projects · 2020-05-08T14:15:47Z

Just on the "off" note: I still don't understand the "sacred" meaning of the main modality. Isn't taking ANY modality as main and then recalculating the weights accordingly brings them to the "equal ground" so to speak? Or more mathematically: there exists hyperplane of "equal regularization effect" and setting coeeficient to one for of one of them scales others acordingly?

Alvant · 2020-05-08T20:56:13Z

Let me break into the discussion and say a couple of words in defense of main modality :)

This is not imho about equal weights or something. Topic modeling is about analyzing texts. So, it is reasonable to provide a way to tell somehow the plain text (aka main modality) from other modalities (which are either meta info such as author or title; or manually created fancy things like bigram, trigram, skipgram etc. god knows what else is possible to come up with). A user may want to build models solely on plain text. Or she may want to use this modality for coherence computation, for example (if words of main modality are in natural order in the VW, but other modalities are in bag-of-words). So, main modality == preprocessed raw text.

Or maybe it would be better to give it some other name (not main modality, but preprocessed_text or plain_text?)

bt2901 · 2020-05-24T20:35:07Z

I agree with @Alvant, but I want to add another consideration.

In many models, multiplying every modality weight by the same constant should leave the model unchanged (as a consequence, you definitely could recalculate weights based on any modality). However, this is not the case when regularizers are involved. If we want to transfer good taus between different domains (and we do want that: that's the point of the relative coefficients tech), regularization coefficients must be "to the same scale", so to speak. I believe that relating them to the main_modality's weight (1 by default) is the most natural choice.

Alvant added the enhancement New feature or request label May 24, 2020

Alvant mentioned this issue May 24, 2020

We need to provide datasets in BOW and Natural-Order forms #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the functionality of Dataset #57

Extend the functionality of Dataset #57

bt2901 commented May 8, 2020

Evgeny-Egorov-Projects commented May 8, 2020

Alvant commented May 8, 2020 •

edited

Loading

bt2901 commented May 24, 2020

Extend the functionality of Dataset #57

Extend the functionality of Dataset #57

Comments

bt2901 commented May 8, 2020

Evgeny-Egorov-Projects commented May 8, 2020

Alvant commented May 8, 2020 • edited Loading

bt2901 commented May 24, 2020

Alvant commented May 8, 2020 •

edited

Loading