Tokenizer

Tokenizer for English Texts

Hivemall provides simple English text tokenizer UDF that has following syntax:

tokenize(text input, optional boolean toLowerCase = false)

Hivemall-NLP module provides a Japanese text tokenizer UDF using Kuromoji.

First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in hivemall-with-dependencies.jar.

add jar /tmp/hivemall-nlp-xxx-with-dependencies.jar;

source /tmp/define-additional.hive;

The signature of the UDF is as follows:

tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, optional const array<string> stopTags)

Caution: tokenize_ja is supported since Hivemall v0.4.1 and later.

It's basic usage is as follows:

select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");

["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]

For detailed APIs, please refer Javadoc of JapaneseAnalyzer as well.