Skip to content

Support arbitrary values in string feature #658

unnonouno opened this Issue Feb 12, 2014 · 2 comments

2 participants

Jubatus member

word_splitter class only supports split method, that means we can use only sub-strings of an input text. However, sometimes we also want to use arbitrary feature values from texts, such as normalization forms of words in a text and the length of a text. These features cannot be supported now.

I want to propose to make string_feature interface that supports arbitrary feature values, and to make word_splitter a sub-class of string_feature. string_feature class have an interface that convert an input text to a list of arbitrary strings (such as words in a text) and arbitrary scores (1.0 for word_splitter).

Jubatus member

One problem is that we also need to support string feature weights, like TF/IDF. I think we can support IDF or other global weights easily, but cannot TF naturally because arbitrary weights do not indicates frequencies.

My idea is to use weighted frequencies. The current implementation counts frequencies of substrings with counter class.
In extended frequencies, each value has a weight. counter class counts sum of these weights. And then, word_splitter works same as the current if word_splitter assigns 1.0 to all values.

mmm, difficult to explain...

@unnonouno unnonouno was assigned by kmaehashi Feb 17, 2014
Jubatus member

From discussion in meeting on 2014-02-17:

  • @unnonouno will propose the implementation idea.
@kmaehashi kmaehashi added the algorithm label Feb 24, 2014
@kmaehashi kmaehashi added this to the Near Future milestone Feb 25, 2014
@unnonouno unnonouno referenced this issue Mar 3, 2014

String feature #703

@unnonouno unnonouno closed this May 24, 2014
@unnonouno unnonouno modified the milestone: 0.6.0, Near Future May 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.