Skip to content

laysakura/Search-Fulltext-Tokenizer-MeCab

Repository files navigation

NAME

Search::Fulltext::Tokenizer::MeCab - Provides Japanese fulltext search for Search::Fulltext module

SYNOPSIS

use Search::Fulltext;
use Search::Fulltext::Tokenizer::MeCab;


my $query = '猫';
my @docs = (
    '我輩は猫である',
    '犬も歩けば棒に当る',
    '実家でてんちゃんって猫を飼ってまして,ものすっごい可愛いんですよほんと',
);


my $fts = Search::Fulltext->new({
    docs      => \@docs,
    tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'",
});
my $results = $fts->search($query);
is_deeply($results, [0, 2]);        # 1st & 3rd include '猫'
my $results = $fts->search('猫 AND 可愛い');
is_deeply($results, [2]);

DESCRIPTION

Search::Fulltext::Tokenizer::MeCab is a Japanse tokenizer working with fulltext search module Search::Fulltext. Only you have to do is specify perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer' as a tokenizer of Search::Fulltext.

my $fts = Search::Fulltext->new({
    docs      => \@docs,
    tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'",
});

You are supposed to use UTF-8 strings for docs.

Although various queries are available like "QUERIES" in Search::Fulltext, wildcard query (e.g. '我*') and phrase query (e.g. '"我輩は猫である"') are not supported.

User dictionary can be used to change the tokenizing behavior of internally-used Text::MeCab. See ENVIRONMENTAL VARIABLES section for detailes.

ENVIRONMENTAL VARIABLES

Some environmental variables are provided to customize the behavior of Search::Fulltext::Tokenizer::MeCab.

Typical usage:

$ ENV1=foobar ENV2=buz perl /path/to/your_script_using_this_module ARGS
  • MECABDIC_USERDIC

    Specify path(s) to MeCab's user dictionary.

    See MeCab's manual to learn how to create user dictionary.

    Examples:

      MECABDIC_USERDIC="/path/to/yourdic1.dic"
      MECABDIC_USERDIC="/path/to/yourdic1.dic, /path/to/yourdic2.dic"
    
  • MECABDIC_DEBUG

    When set to not 0, debug strings appear on STDERR.

    Especially, outputs below would help check how your docs are tokenized.

      string to be parsed: 我輩は猫である (7)
      token: 我輩 (2)
      token: は (1)
      token: 猫 (1)
      token: で (1)
      token: ある (2)
      ...
      string to be parsed: 猫 AND 可愛い (9)
      token: 猫 (1)
      string to be parsed:  可愛い (4)
      token: 可愛い (3)
    

    Note that not only docs but also queries are also tokenized.

SUPPORTS

Bug reports and pull requests are welcome at https://github.com/laysakura/Search-Fulltext-Tokenizer-MeCab !

To read this manual via perldoc, use -t option for correctly displaying UTF-8 caracters.

$ perldoc -t Search::Fulltext::Tokenizer::MeCab

VERSION

Version 1.05

AUTHOR

Sho Nakatani lay.sakura@gmail.com, a.k.a. @laysakura

About

Provides Japanese fulltext search for Search::Fulltext CPAN module

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages