NAME

Search::Fulltext::Tokenizer::MeCab - Provides Japanese fulltext search for Search::Fulltext module

SYNOPSIS

use Search::Fulltext;
use Search::Fulltext::Tokenizer::MeCab;


my $query = '猫';
my @docs = (
    '我輩は猫である',
    '犬も歩けば棒に当る',
    '実家でてんちゃんって猫を飼ってまして，ものすっごい可愛いんですよほんと',
);


my $fts = Search::Fulltext->new({
    docs      => \@docs,
    tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'",
});
my $results = $fts->search($query);
is_deeply($results, [0, 2]);        # 1st & 3rd include '猫'
my $results = $fts->search('猫 AND 可愛い');
is_deeply($results, [2]);

DESCRIPTION

Search::Fulltext::Tokenizer::MeCab is a Japanse tokenizer working with fulltext search module Search::Fulltext. Only you have to do is specify perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer' as a tokenizer of Search::Fulltext.

my $fts = Search::Fulltext->new({
    docs      => \@docs,
    tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'",
});

You are supposed to use UTF-8 strings for docs.

Although various queries are available like "QUERIES" in Search::Fulltext, wildcard query (e.g. '我*') and phrase query (e.g. '"我輩は猫である"') are not supported.

User dictionary can be used to change the tokenizing behavior of internally-used Text::MeCab. See ENVIRONMENTAL VARIABLES section for detailes.

ENVIRONMENTAL VARIABLES

Some environmental variables are provided to customize the behavior of Search::Fulltext::Tokenizer::MeCab.

Typical usage:

$ ENV1=foobar ENV2=buz perl /path/to/your_script_using_this_module ARGS

MECABDIC_USERDIC

Specify path(s) to MeCab's user dictionary.

See MeCab's manual to learn how to create user dictionary.

Examples:
```
  MECABDIC_USERDIC="/path/to/yourdic1.dic"
  MECABDIC_USERDIC="/path/to/yourdic1.dic, /path/to/yourdic2.dic"
```

MECABDIC_DEBUG

When set to not 0, debug strings appear on STDERR.

Especially, outputs below would help check how your docs are tokenized.

  string to be parsed: 我輩は猫である (7)
  token: 我輩 (2)
  token: は (1)
  token: 猫 (1)
  token: で (1)
  token: ある (2)
  ...
  string to be parsed: 猫 AND 可愛い (9)
  token: 猫 (1)
  string to be parsed:  可愛い (4)
  token: 可愛い (3)

Note that not only docs but also queries are also tokenized.

SUPPORTS

Bug reports and pull requests are welcome at https://github.com/laysakura/Search-Fulltext-Tokenizer-MeCab !

To read this manual via perldoc, use -t option for correctly displaying UTF-8 caracters.

$ perldoc -t Search::Fulltext::Tokenizer::MeCab

VERSION

Version 1.05

AUTHOR

Sho Nakatani lay.sakura@gmail.com, a.k.a. @laysakura

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
lib/Search/Fulltext/Tokenizer		lib/Search/Fulltext/Tokenizer
share/dic		share/dic
t		t
.gitignore		.gitignore
.travis.yml		.travis.yml
Build.PL		Build.PL
Changes		Changes
LICENSE		LICENSE
META.json		META.json
README.md		README.md
cpanfile		cpanfile
minil.toml		minil.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/Search/Fulltext/Tokenizer

lib/Search/Fulltext/Tokenizer

share/dic

share/dic

t

t

.gitignore

.gitignore

.travis.yml

.travis.yml

Build.PL

Build.PL

Changes

Changes

LICENSE

LICENSE

META.json

META.json

README.md

README.md

cpanfile

cpanfile

minil.toml

minil.toml

Repository files navigation

NAME

SYNOPSIS

DESCRIPTION

ENVIRONMENTAL VARIABLES

SUPPORTS

VERSION

AUTHOR

About

Releases

Packages

Languages

License

laysakura/Search-Fulltext-Tokenizer-MeCab

Folders and files

Latest commit

History

Repository files navigation

NAME

SYNOPSIS

DESCRIPTION

ENVIRONMENTAL VARIABLES

SUPPORTS

VERSION

AUTHOR

About

Resources

License

Stars

Watchers

Forks

Languages