Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Support arbitrary values in string feature #658

unnonouno opened this Issue · 2 comments

2 participants


word_splitter class only supports split method, that means we can use only sub-strings of an input text. However, sometimes we also want to use arbitrary feature values from texts, such as normalization forms of words in a text and the length of a text. These features cannot be supported now.

I want to propose to make string_feature interface that supports arbitrary feature values, and to make word_splitter a sub-class of string_feature. string_feature class have an interface that convert an input text to a list of arbitrary strings (such as words in a text) and arbitrary scores (1.0 for word_splitter).


One problem is that we also need to support string feature weights, like TF/IDF. I think we can support IDF or other global weights easily, but cannot TF naturally because arbitrary weights do not indicates frequencies.

My idea is to use weighted frequencies. The current implementation counts frequencies of substrings with counter class.
In extended frequencies, each value has a weight. counter class counts sum of these weights. And then, word_splitter works same as the current if word_splitter assigns 1.0 to all values.

mmm, difficult to explain...

@unnonouno unnonouno was assigned by kmaehashi

From discussion in meeting on 2014-02-17:

  • @unnonouno will propose the implementation idea.
@kmaehashi kmaehashi added the algorithm label
@kmaehashi kmaehashi added this to the Near Future milestone
@unnonouno unnonouno referenced this issue

String feature #703

@unnonouno unnonouno closed this
@unnonouno unnonouno modified the milestone: 0.6.0, Near Future
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.