Skip to content

Commit

Permalink
doc: use more meaningful example
Browse files Browse the repository at this point in the history
  • Loading branch information
komainu8 committed Jan 4, 2019
1 parent 72e4b4d commit cea6796
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 6 deletions.
15 changes: 12 additions & 3 deletions doc/locale/ja/LC_MESSAGES/reference.po
Expand Up @@ -27804,11 +27804,20 @@ msgid "``TokenUnigram`` hasn't parameter::"
msgstr "``TokenUnigram`` には、引数がありません。"

msgid ""
":ref:`token-bigram` uses 2 characters per token. ``TokenUnigram`` uses 1 "
"If normalizer is used, ``TokenUnigram`` uses white-space-separate like "
"tokenize method for ASCII characters. ``TokenUnigram`` uses unigram tokenize "
"method for non-ASCII characters."
msgstr ""
"ノーマライザーを使っている場合は ``TokenUnigram`` はASCIIの文字には空白区切り"
"のようなトークナイズ方法を使います。非ASCII文字にはユニグラムのトークナイズ方"
"法を使います。"

msgid ""
"If ``TokenUnigram`` tokenize non-ASCII charactors, ``TokenUnigram`` uses 1 "
"character per token as below example."
msgstr ""
":ref:`token-bigram` は各トークンが2文字ですが、以下の例のように "
"``TokenUnigram`` は各トークンが1文字です。"
"``TokenUnigram`` が非ASCII文字をトークナイズすると、以下の例のように "
"``TokenUnigram`` は各トークンが1文字となります。"

msgid "Tuning"
msgstr "チューニング"
Expand Down
@@ -0,0 +1,48 @@
Execution example::

tokenize TokenUnigram "日本語の勉強" NormalizerAuto --output_pretty yes
# [
# [
# 0,
# 1546584495.218799,
# 0.0002140998840332031
# ],
# [
# {
# "value": "日",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "本",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "語",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "の",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "勉",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "強",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
10 changes: 7 additions & 3 deletions doc/source/reference/tokenizers/token_unigram.rst
Expand Up @@ -26,9 +26,13 @@ Syntax
Usage
-----

:ref:`token-bigram` uses 2 characters per
token. ``TokenUnigram`` uses 1 character per token as below example.
If normalizer is used, ``TokenUnigram`` uses white-space-separate like
tokenize method for ASCII characters. ``TokenUnigram`` uses unigram
tokenize method for non-ASCII characters.

If ``TokenUnigram`` tokenize non-ASCII charactors, ``TokenUnigram`` uses
1 character per token as below example.

.. groonga-command
.. include:: ../../example/reference/tokenizers/token-unigram.log
.. include:: ../../example/reference/tokenizers/token-unigram-non-ascii.log
.. tokenize TokenUnigram "100cents!!!" NormalizerAuto

0 comments on commit cea6796

Please sign in to comment.