word_splitter class only supports split method, that means we can use only sub-strings of an input text. However, sometimes we also want to use arbitrary feature values from texts, such as normalization forms of words in a text and the length of a text. These features cannot be supported now.
I want to propose to make string_feature interface that supports arbitrary feature values, and to make word_splitter a sub-class of string_feature. string_feature class have an interface that convert an input text to a list of arbitrary strings (such as words in a text) and arbitrary scores (1.0 for word_splitter).
One problem is that we also need to support string feature weights, like TF/IDF. I think we can support IDF or other global weights easily, but cannot TF naturally because arbitrary weights do not indicates frequencies.
My idea is to use weighted frequencies. The current implementation counts frequencies of substrings with counter class. https://github.com/jubatus/jubatus/blob/master/jubatus/core/fv_converter/datum_to_fv_converter.cpp#L427
In extended frequencies, each value has a weight. counter class counts sum of these weights. And then, word_splitter works same as the current if word_splitter assigns 1.0 to all values.
mmm, difficult to explain...
From discussion in meeting on 2014-02-17: