Skip to content

About the dictionary

KEINOS edited this page Apr 15, 2024 · 5 revisions

About the dictionary of Kagome

From:

The kagome module provides dictionaries in a format that can be embedded in Go programs.

In kagome, two types of dictionaries, IPA and Uni, are supported as standard.

$ go get github.com/ikawaha/kagome/v2
go: downloading github.com/ikawaha/kagome v1.11.2
go: downloading github.com/ikawaha/kagome/v2 v2.9.0
go: downloading github.com/ikawaha/kagome-dict v1.0.7
go: downloading github.com/ikawaha/kagome-dict/ipa v1.0.9
go: downloading github.com/ikawaha/kagome-dict/uni v1.1.8
go: added github.com/ikawaha/kagome-dict v1.0.7
go: added github.com/ikawaha/kagome-dict/ipa v1.0.9 // <-- IPADIC
go: added github.com/ikawaha/kagome-dict/uni v1.1.8 // <-- UniDIC
go: added github.com/ikawaha/kagome/v2 v2.9.0

The program can simply "import" this dictionary and use/embed it. Once loaded into memory, the dictionary works as a singleton and can be used by several morphological analyzers.

package main

import (
	"fmt"
	"log"

	"github.com/ikawaha/kagome-dict/ipa" // use and embed IPADIC
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func Example() {
	// Create a new tokenizer using IPADIC with OmitBosEos option
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		log.Fatal(err)
	}

	// Segment the input to tokens
	seg := t.Wakati("すもももももももものうち")

	fmt.Printf("%#v\n", seg)
	// Output: []string{"すもも", "も", "もも", "も", "もも", "の", "うち"}
}

What is a Kagome dictionary?

As already mentioned, kagome supports two standard dictionaries, IPADIC and UniDIC.

IPADIC is the MeCab's so-called "standard dictionary", characterized by a more intuitive separation of morphological units than UniDIC. In contrast, UniDIC splits a sentence into smaller example units for retrieval.

Both dictionaries are quite old. Although not "comparable", IPADIC has a vocabulary of about 400,000 words and UniDIC about 750,000; IPADIC is more suitable for memory-limited environments, while UniDIC's shorter lexical units make it more suitable for splitting words when searching.

Dictionary Source Go Pacakge
IPADIC (MeCab) mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa
UniDIC UniDIC-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni

And both dictionaries are a "set of morphemes" of "dict.Dict" type, just the information contained in the dictionary is different.

// import dict github.com/ikawaha/kagome-dict

type dict.Dict struct {
	Morphs       dict.Morphs
	POSTable     dict.POSTable
	ContentsMeta dict.ContentsMeta
	Contents     dict.Contents
	Connection   dict.ConnectionTable
	Index        dict.IndexTable
	CharClass    dict.CharClass
	CharCategory dict.CharCategory
	InvokeList   dict.InvokeList
	GroupList    dict.GroupList
	UnkDict      dict.UnkDict
}

That is, packages with the same type dict.Dict can be embedded as system dictionary.

For example, NEologd and Korean dictionary from MeCab are also available as such dictionaries, albeit on an "experimental" basis.

NEologd collects proper nouns from the Internet and covers a wide vocabulary, while Korean MeCab is a Korean morphological dictionary available in MeCab.

Dictionary Source Go Pacakge
IPADIC-NEologd (MeCab) mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd
Korean (MeCab) mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Also, the usage is the same as before, just use the dictionary package with go get and import.

$ go get github.com/ikawaha/kagome-dict-ko
go: downloading github.com/ikawaha/kagome-dict-ko v1.1.0
go: added github.com/ikawaha/kagome-dict-ko v1.1.0
package main

import (
	"fmt"
	"log"

	ko "github.com/ikawaha/kagome-dict-ko" // use and embed Korean dict
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func Example() {
	t, err := tokenizer.New(ko.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		log.Fatal(err)
	}

	// Segment the input to tokens
	seg := t.Wakati("환영합니다, 한국에.")

	fmt.Printf("%#v\n", seg)
	// Output: []string{"환영", "합니다", ",", " ", "한국", "에", "."}
}

Differences between IPADIC and UniDIC

As already mentioned, IPADIC and UniDIC are a "set of morphemes" of dict.Dict type, and the information contained in the dictionary is just different.

However, although UniDIC has a larger registered vocabulary than IPADIC, many argue that it is less accurate than IPADIC.

$ # IPA DICT
$ echo "私は日本人です。" | kagome -sysdict ipa
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
日本人	名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

$ # Uni DICT
$ echo "私は日本人です。" | kagome -sysdict uni
私	代名詞,*,*,*,*,*,ワタクシ,私-代名詞,私,ワタクシ,私,ワタクシ,和,*,*,*,*
は	助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
日本	名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
人	接尾辞,名詞的,一般,*,*,*,ニン,人,人,ニン,人,ニン,漢,*,*,*,*
です	助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
。	補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Note the difference between "日本人" and "日本" + "人". This is because the purpose of morphological analysis is different.

The latter UniDIC is a dictionary based on "short units" (短単位たんたんい) defined by the NINJAL to facilitate the collection of examples for the BCCWJ.

  • NINJAL (National Institute of Japanese Language and Linguistics)
  • BCCWJ (Balanced Corpus of Contemporary Written Japanese)

These "short units" are known to be too short to be used in "natural language processing" for syntactic and semantic analysis.

It is therefore understandable why some people claim that UniDIC is less accurate than IPADIC. In this respect, IPADIC is faster and more convenient for most use cases.

Advantage and use cases of UniDIC

An advantage of UniDIC is the "consistency" in word segmentation.

The difference between the two dictionaries, IPA and Uni, is illustrated by a well-known example.

"りんごジュースを飲んだ。" vs "リンゴジュースを飲んだ。"

Both are correct and mean the same thing, such as "I drank apple juice".

And here comes the problem.

$ # IPA DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict ipa
りん	副詞,助詞類接続,*,*,*,*,りん,リン,リン
ご	接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ジュース	名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん	動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ	助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。	記号,句点,*,*,*,*,。,。,。
EOS

$ # UNI DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict uni
りんご	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
ジュース	名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を	助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん	動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ	助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。	補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Note the difference between "りん, " and "りんご".

IPADIC recognized "りんご" as an adverb/prefix (副詞/接頭詞) combination and UniDIC as a noun (名詞).

The simplest solution, apart from registering a user dictionary, is to use katakana notation.

$ # IPADICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict ipa
リンゴ	名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
ジュース	名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん	動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ	助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。	記号,句点,*,*,*,*,。,。,。
EOS

$ # UniDICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict uni
リンゴ	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
ジュース	名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を	助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん	動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ	助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。	補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

But, sensibly, "りんごジュース" is easier to read than "リンゴジュース" because the words are visually separated (katakana-hiranaga-mixture vs all-in-katakana).

And both dictionaries include the word "りんご" and "リンゴ" as a noun (名詞).

$ # IPA DICT
$ echo "りんご" | kagome -sysdict ipa
りんご	名詞,一般,*,*,*,*,りんご,リンゴ,リンゴ
EOS

$ echo "リンゴ" | kagome -sysdict ipa
リンゴ	名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
EOS

$ # UNI DICT
$ echo "りんご" | kagome -sysdict uni
りんご	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
EOS

$ echo "リンゴ" | kagome -sysdict uni
リンゴ	名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
EOS

The difference is that IPADIC attempted to interpret them grammatically, while UniDIC interpreted them in short units.

  1. "日本人" (noun) vs "日本, " (noun + postfix)
  2. "りん, , ジュース" (adverb + prefix + noun) vs "りんご, ジュース" (noun+noun)

In both cases, the latter delimitation is divided into units suitable for search engines, etc.

This means that "short units" are effective in unifying the units of "search examples" in search engines and other information retrieval systems.

Thus, UniDIC has more advantage for word searching purposes.