This repository has been archived by the owner on Oct 8, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 153
Tokenizer
Makoto YUI edited this page Jan 12, 2016
·
6 revisions
Hivemall provides simple English text tokenizer UDF that has following syntax:
tokenize(text input, optional boolean toLowerCase = false)
Hivemall-NLP module provides a Japanese text tokenizer UDF using Kuromoji.
First of all, you need to issue the following DDLs to use the NLP module. Note NLP module is not included in hivemall-with-dependencies.jar.
add jar /tmp/hivemall-nlp-xxx-with-dependencies.jar;
source /tmp/define-additional.hive;
The signature of the UDF is as follows:
tokenize_ja(text input, optional const text mode = "normal", optional const array<string> stopWords, optional const array<string> stopTags)
Caution: tokenize_ja
is supported since Hivemall v0.4.1 and later.
It's basic usage is as follows:
select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。");
["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal","モード"]
For detailed APIs, please refer Javadoc of JapaneseAnalyzer as well.