doc: use more meaningful example

groonga · Jan 4, 2019 · cea6796 · cea6796
1 parent 72e4b4d
commit cea6796
Show file tree

Hide file tree

Showing 3 changed files with 67 additions and 6 deletions.
diff --git a/doc/locale/ja/LC_MESSAGES/reference.po b/doc/locale/ja/LC_MESSAGES/reference.po
@@ -27804,11 +27804,20 @@ msgid "``TokenUnigram`` hasn't parameter::"
 msgstr "``TokenUnigram`` には、引数がありません。"
 
 msgid ""
-":ref:`token-bigram` uses 2 characters per token. ``TokenUnigram`` uses 1 "
+"If normalizer is used, ``TokenUnigram`` uses white-space-separate like "
+"tokenize method for ASCII characters. ``TokenUnigram`` uses unigram tokenize "
+"method for non-ASCII characters."
+msgstr ""
+"ノーマライザーを使っている場合は ``TokenUnigram`` はASCIIの文字には空白区切り"
+"のようなトークナイズ方法を使います。非ASCII文字にはユニグラムのトークナイズ方"
+"法を使います。"
+
+msgid ""
+"If ``TokenUnigram`` tokenize non-ASCII charactors, ``TokenUnigram`` uses 1 "
 "character per token as below example."
 msgstr ""
-":ref:`token-bigram` は各トークンが2文字ですが、以下の例のように "
-"``TokenUnigram`` は各トークンが1文字です。"
+"``TokenUnigram`` が非ASCII文字をトークナイズすると、以下の例のように "
+"``TokenUnigram`` は各トークンが1文字となります。"
 
 msgid "Tuning"
 msgstr "チューニング"

diff --git a/doc/source/example/reference/tokenizers/token-unigram-non-ascii.log b/doc/source/example/reference/tokenizers/token-unigram-non-ascii.log
@@ -0,0 +1,48 @@
+Execution example::
+
+  tokenize TokenUnigram "日本語の勉強" NormalizerAuto --output_pretty yes
+  # [
+  #   [
+  #     0,
+  #     1546584495.218799,
+  #     0.0002140998840332031
+  #   ],
+  #   [
+  #     {
+  #       "value": "日",
+  #       "position": 0,
+  #       "force_prefix": false,
+  #       "force_prefix_search": false
+  #     },
+  #     {
+  #       "value": "本",
+  #       "position": 1,
+  #       "force_prefix": false,
+  #       "force_prefix_search": false
+  #     },
+  #     {
+  #       "value": "語",
+  #       "position": 2,
+  #       "force_prefix": false,
+  #       "force_prefix_search": false
+  #     },
+  #     {
+  #       "value": "の",
+  #       "position": 3,
+  #       "force_prefix": false,
+  #       "force_prefix_search": false
+  #     },
+  #     {
+  #       "value": "勉",
+  #       "position": 4,
+  #       "force_prefix": false,
+  #       "force_prefix_search": false
+  #     },
+  #     {
+  #       "value": "強",
+  #       "position": 5,
+  #       "force_prefix": false,
+  #       "force_prefix_search": false
+  #     }
+  #   ]
+  # ]
diff --git a/doc/source/reference/tokenizers/token_unigram.rst b/doc/source/reference/tokenizers/token_unigram.rst
@@ -26,9 +26,13 @@ Syntax
 Usage
 -----
 
-:ref:`token-bigram` uses 2 characters per
-token. ``TokenUnigram`` uses 1 character per token as below example.
+If normalizer is used, ``TokenUnigram`` uses white-space-separate like
+tokenize method for ASCII characters. ``TokenUnigram`` uses unigram
+tokenize method for non-ASCII characters.
+
+If ``TokenUnigram`` tokenize non-ASCII charactors, ``TokenUnigram`` uses
+1 character per token as below example.
 
 .. groonga-command
-.. include:: ../../example/reference/tokenizers/token-unigram.log
+.. include:: ../../example/reference/tokenizers/token-unigram-non-ascii.log
 .. tokenize TokenUnigram "100cents!!!" NormalizerAuto