# mecabwrap
## A Python Interface to MeCab for Unix and Windows

<table align="left">
<tr>    
    <td>
        <a href="https://travis-ci.org/kota7/mecabwrap-py" target="_blank">
            <img src="https://travis-ci.org/kota7/mecabwrap-py.svg?branch=master">
        </a>
    </td>    
    <td>
        <a href="https://ci.appveyor.com/project/kota7/mecabwrap-py/branch/master " target="_blank">
            <img src="https://ci.appveyor.com/api/projects/status/oidn1rfte6u8kavs/branch/master?svg=true">
        </a>
    </td>
    <td>
        <a href="https://badge.fury.io/py/mecabwrap" target="_blank">
            <img src="https://badge.fury.io/py/mecabwrap.svg">
        </a>
    </td>
</tr>    
</table>


**mecabwrap** is yet another Python interface to [MeCab Morphological Analyzer](http://taku910.github.io/mecab/).

It is designed to work seamlessly both on Unix and Windows machine.


## Requirement

- Python 2.7+ or 3.4+ (May also work on older versions, but not tested any more)
- MeCab 0.996


## Installation


### 1. Install MeCab

**Ubuntu**

```bash
$ sudo apt-get install mecab libmecab-dev mecab-ipadic-utf8
```

**Mac OSX**

```bash
$ brew install mecab mecab-ipadic
```

**Windows**

Download and run the [installer](https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7WElGUGt6ejlpVXc).

See also: [official website](http://taku910.github.io/mecab/#install) 



### 2. Install this Package


The package is now on [PyPI](https://pypi.python.org/pypi/mecabwrap/), so can be installed by `pip` command:

```bash
$ pip install mecabwrap
```

Or, the latest development version can be installed from the GitHub.

```bash
$ git clone --depth 1 https://github.com/kota7/mecabwrap-py.git
$ cd mecabwrap-py
$ pip install -U ./
```


## Quick Check


Following command will print the MeCab version.
Otherwise, you do not have MeCab installed or MeCab is not on the search path.

```bash
$ mecab -v
# should result in `mecab of 0.996` or similar.
```


To verify that the package is successfully installed, try the following:

```bash
$ python
```

```python
>>> from mecabwrap import tokenize
>>> for token in tokenize(u"すもももももももものうち"): 
...     print(token)
... 
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
```



In [1]:
# Version for this notebook
!pip list | grep mecabwrap

mecabwrap                     0.3.4       


## Usage


### A Simple Tokenizer

The `tokenize` function is a high level API for splitting a text into tokens.
It returns a generator of tokens.

In [2]:
from mecabwrap import tokenize, print_token

for token in tokenize('すもももももももものうち'):
    print_token(token)

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ,*,*
も	助詞,係助詞,*,*,*,*,も,モ,モ,*,*
もも	名詞,一般,*,*,*,*,もも,モモ,モモ,*,*
も	助詞,係助詞,*,*,*,*,も,モ,モ,*,*
もも	名詞,一般,*,*,*,*,もも,モモ,モモ,*,*
の	助詞,連体化,*,*,*,*,の,ノ,ノ,*,*
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ,*,*


`Token` is defined as a namedtuple (v0.3.2+) with the following fields:

- `surface`: Word that appear in the text
- `pos`: Part of speech
- `pos1`: Part of speech, detail 1
- `pos2`: Part of speech, detail 2
- `pos3`: Part of speech, detail 3
- `infl_type`: Inflection type
- `infl_form`: Inflection form
- `baseform`: Original form
- `reading`: Surface written in katakana
- `phoenetic`: Surface pronunciation
- `lemma`: Representative form of the word. 語彙素
- `lemma_reading`: Reading of lemma

Among these, lemma and lemma_reading are not available in ipadic.  They are defined in unidic-based dictionaries.

In [3]:
token

Token(surface='うち', pos='名詞', pos1='非自立', pos2='副詞可能', pos3=None, infl_type=None, infl_form=None, baseform='うち', reading='ウチ', phoenetic='ウチ', lemma=None, lemma_reading=None)

### Using MeCab Options

To configure the MeCab calls, one may use `do_` functions that support arbitrary number of MeCab options.  
Currently, the following three `do_` functions are provided.
- `do_mecab`: works with a single input text and returns the result as a string.
- `do_mecab_vec`: works with a multiple input texts and returns a string of concatenated results.
- `do_mecab_iter`: works with a multiple input texts and returns a generator.

For example, following code invokes the *wakati* option, so the outcome be words separated by spaces with no meta information. 
See [the official site](http://taku910.github.io/mecab/format.html) for more details.

In [4]:
from mecabwrap import do_mecab
out = do_mecab('人生楽ありゃ苦もあるさ', '-Owakati')
print(out)

人生 楽 ありゃ 苦 も ある さ 



The exapmle below uses `do_mecab_vec` to parse multiple texts.
Note that `-F` option configures the outcome formatting.


In [5]:
from mecabwrap import do_mecab_vec
ins = ['春はあけぼの', 'やうやう白くなりゆく山際', '少し明かりて', '紫だちたる雲の細くたなびきたる']

out = do_mecab_vec(ins, '-F%f[6](%f[1]) | ', '-E...ここまで\n')
print(out)

春(一般) | は(係助詞) | あけぼの(固有名詞) | ...ここまで
やうやう(一般) | 白い(自立) | なる(自立) | ゆく(非自立) | 山際(一般) | ...ここまで
少し(助詞類接続) | 明かり(一般) | て(格助詞) | ...ここまで
紫(一般) | だ() | ちる(自立) | たり() | 雲(一般) | の(連体化) | 細い(自立) | たなびく(自立) | たり() | ...ここまで



### Returning Iterators

When the number of input text is large, then holding the outcomes in the memory may not be a good idea.  `do_mecab_iter` function, which works for multiple texts, returns a generator of MeCab results.
When `byline=True`, chunks are separated by line breaks; a chunk corresponds to a token in the default setting.
When `byline=False`, chunks are separated by `EOS`; hence a chunk corresponds to a sentence.

In [6]:
from mecabwrap import do_mecab_iter

ins = ['春はあけぼの', 'やうやう白くなりゆく山際', '少し明かりて', '紫だちたる雲の細くたなびきたる']

print('\n*** generating tokens ***')
i = 0
for text in do_mecab_iter(ins, byline=True):
    i += 1
    print('(' + str(i) + ')\t' + text)
    
print('\n*** generating tokenized sentences ***')
i = 0
for text in do_mecab_iter(ins, '-E', '（文の終わり）', byline=False):
    i += 1
    print('---(' + str(i) + ')\n' + text)


*** generating tokens ***
(1)	春	名詞,一般,*,*,*,*,春,ハル,ハル
(2)	は	助詞,係助詞,*,*,*,*,は,ハ,ワ
(3)	あけぼの	名詞,固有名詞,地域,一般,*,*,あけぼの,アケボノ,アケボノ
(4)	EOS
(5)	やうやう	副詞,一般,*,*,*,*,やうやう,ヤウヤウ,ヨーヨー
(6)	白く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,白い,シロク,シロク
(7)	なり	動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ
(8)	ゆく	動詞,非自立,*,*,五段・カ行促音便ユク,基本形,ゆく,ユク,ユク
(9)	山際	名詞,一般,*,*,*,*,山際,ヤマギワ,ヤマギワ
(10)	EOS
(11)	少し	副詞,助詞類接続,*,*,*,*,少し,スコシ,スコシ
(12)	明かり	名詞,一般,*,*,*,*,明かり,アカリ,アカリ
(13)	て	助詞,格助詞,連語,*,*,*,て,テ,テ
(14)	EOS
(15)	紫	名詞,一般,*,*,*,*,紫,ムラサキ,ムラサキ
(16)	だ	助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
(17)	ち	動詞,自立,*,*,五段・ラ行,体言接続特殊２,ちる,チ,チ
(18)	たる	助動詞,*,*,*,文語・ナリ,体言接続,たり,タル,タル
(19)	雲	名詞,一般,*,*,*,*,雲,クモ,クモ
(20)	の	助詞,連体化,*,*,*,*,の,ノ,ノ
(21)	細く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,細い,ホソク,ホソク
(22)	たなびき	動詞,自立,*,*,五段・カ行イ音便,連用形,たなびく,タナビキ,タナビキ
(23)	たる	助動詞,*,*,*,文語・ナリ,体言接続,たり,タル,タル
(24)	EOS

*** generating tokenized sentences ***
---(1)
春	名詞,一般,*,*,*,*,春,ハル,ハル
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
あけぼの	名詞,固有名詞,地域,一般,*,*,あけぼの,アケボノ,アケボノ
（文の終わり）
---(2)
やうやう	副詞,一般,*,*,*,*,やうやう,ヤウヤウ,ヨーヨー
白く	形容詞,自立,*,*,形容詞・アウオ段,連用テ接続,白い,シロク,シ

### Writing the outcome to a file

To write the MeCab outcomes directly to a file, one may either use `-o` option or `outpath` argument.  Note that this does not work with `do_mecab_iter`, since it is designed to write the outcomes to a temporary file.


In [7]:
do_mecab('すもももももももものうち', '-osumomo1.txt')
# or,
do_mecab('すもももももももものうち', outpath='sumomo2.txt')

with open('sumomo1.txt') as f: 
    print(f.read())
with open('sumomo2.txt') as f: 
    print(f.read())

import os
# clean up
os.remove('sumomo1.txt')
os.remove('sumomo2.txt')


# these get error
try:
    res = do_mecab_iter(['すもももももももものうち'], '-osumomo3.txt')
    next(res)
except Exception as e:
    print(e)

try:
    res = do_mecab_iter(['すもももももももものうち'], outpath='sumomo3.txt')
    next(res)
except Exception as e:
    print(e)

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

`-o` option is not supported for `do_mecab_iter`
`outpath` option is not supported for `do_mecab_iter`


### Using Dictionary (v0.3.0+)

`do_` functions accepts `dictionary` option to specify the location of the system directory.
`dictionary` can be either:

- path to the system directory
- sub-directory name under the mecab's default dicdir (note: `mecab-config` is required for this)

This provides an intuitive syntax for using extended dictionaries such as [ipadic-neologd](https://github.com/neologd/mecab-ipadic-neologd) or [unidic-nelogd](https://github.com/neologd/mecab-unidic-neologd).

In [8]:
# this cell assumes that mecab-ipadic-neologd is already installed
# otherwise, follow the instruction at https://github.com/neologd/mecab-ipadic-neologd
print("*** Default ipadic ***")
print(do_mecab("メロンパンを食べたい"))

print("*** With ipadic neologd ***")
print(do_mecab("メロンパンを食べたい", dictionary="mecab-ipadic-neologd"))

# this is equivalent to giving the path
dicdir, = !mecab-config --dicdir
print(do_mecab("メロンパンを食べたい",
               dictionary=os.path.join(dicdir, "mecab-ipadic-neologd")))

*** Default ipadic ***
メロン	名詞,一般,*,*,*,*,メロン,メロン,メロン
パン	名詞,一般,*,*,*,*,パン,パン,パン
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
食べ	動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい	助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS

*** With ipadic neologd ***
メロンパン	名詞,固有名詞,一般,*,*,*,メロンパン,メロンパン,メロンパン
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
食べ	動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい	助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS

メロンパン	名詞,固有名詞,一般,*,*,*,メロンパン,メロンパン,メロンパン
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
食べ	動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい	助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS



### Very Long Input and Buffer Size (v0.2.3+)

When input text is longer than the input buffer size (default: 8192), MeCab automatically split it into two "sentences", by inserting an extra EOS (and a few letters are lost around the separation point).
As a result, `do_mecab_vec` and `do_mecab_iter` might produce output of length longer than the input.

The `do_` functions provide two workarounds for this:
1.  If the option `auto_buffer_size` is `True`, the `input-buffer-size` option is automatically adjusted to the level as large as covering all input text.  Note that it won't work when the input size exceeds the MeCab's maximum buffer size, `8192 * 640` ~ 5MB.
1.  If the option `trancate` is `True`, input text is truncated so that they are covered by the input buffer size.

Note that `do_mecab` does not have these features.

In [9]:
import warnings

x = 'すもももももももものうち!' * 225
print("input buffer size =", len(x.encode()))

with warnings.catch_warnings(record=True) as w:
    res1 = list(do_mecab_iter([x]))
# the text is split into two since it exceeds the input buffer size
print("output length =", len(res1))

print('***\nEnd of the first element')
print(res1[0][-150:])

print('***\nBeginning of the second element')
print(res1[1][0:150])

output would contain extra EOS


input buffer size = 8325
output length = 2
***
End of the first element
モ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
記号,一般,*,*,*,*,*
EOS
***
Beginning of the second element
記号,一般,*,*,*,*,*
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
すもも	名詞,一般


In [10]:
import re

res2 = list(do_mecab_iter([x], auto_buffer_size=True))
print("output length =", len(res2))

print('***\nEnd of the first element')
print(res2[0][-150:])

# count the number of '!', to confirm all 223 repetitions are covered
print('number of "!" =', len(re.findall(r'!', ''.join(res2))))

print()
res3 = list(do_mecab_iter([x], truncate=True))
print("output length =", len(res3))

print('***\nEnd of the first element')
print(res3[0][-150:])

# count the number of '!', to confirm some are not covered due to trancation
print('number of "!" =', len(re.findall(r'!', ''.join(res3))))


output length = 1
***
End of the first element
も	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
EOS
number of "!" = 225

output length = 1
***
End of the first element
モ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
!	名詞,サ変接続,*,*,*,*,*
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
記号,一般,*,*,*,*,*
EOS
number of "!" = 221


### Batch processing (v0.3.2+)

`mecab_batch` function supports multiple text input.
The function takes a list of strings and apply mecab tokenizer to each.
The output is the list of tokenization outcomes.

`mecab_batch_iter` function works the similarly but returns a generator instead.

In [11]:
from mecabwrap import mecab_batch

x = ["明日は晴れるかな", "雨なら読書をしよう"]
mecab_batch(x)

[[Token(surface='明日', pos='名詞', pos1='副詞可能', pos2=None, pos3=None, infl_type=None, infl_form=None, baseform='明日', reading='アシタ', phoenetic='アシタ', lemma=None, lemma_reading=None),
  Token(surface='は', pos='助詞', pos1='係助詞', pos2=None, pos3=None, infl_type=None, infl_form=None, baseform='は', reading='ハ', phoenetic='ワ', lemma=None, lemma_reading=None),
  Token(surface='晴れる', pos='動詞', pos1='自立', pos2=None, pos3=None, infl_type='一段', infl_form='基本形', baseform='晴れる', reading='ハレル', phoenetic='ハレル', lemma=None, lemma_reading=None),
  Token(surface='か', pos='助詞', pos1='副助詞／並立助詞／終助詞', pos2=None, pos3=None, infl_type=None, infl_form=None, baseform='か', reading='カ', phoenetic='カ', lemma=None, lemma_reading=None),
  Token(surface='な', pos='助詞', pos1='終助詞', pos2=None, pos3=None, infl_type=None, infl_form=None, baseform='な', reading='ナ', phoenetic='ナ', lemma=None, lemma_reading=None)],
 [Token(surface='雨', pos='名詞', pos1='一般', pos2=None, pos3=None, infl_type=None, infl_form=None, baseform='雨', readi

By default, each string is converted into a list of `Token` objects.
To obtain a more concise outcome, We can specify a converter function to the tokens as `format_func` option.
`format_func` must be a function that takes a single `Token` object and returns the parsed outcome. 

In [12]:
# use baseform if exists, otherwise surface
mecab_batch(x, format_func=lambda x: x.baseform or x.surface)

[['明日', 'は', '晴れる', 'か', 'な'], ['雨', 'だ', '読書', 'を', 'する', 'う']]

We can filter certain part-of-speeches by `pos_filter` option.
More complex filtering can be achieved by `filter_func` option.

In [13]:
mecab_batch(x, format_func=lambda x: x.baseform or x.surface, pos_filter=("名詞", "動詞"))

[['明日', '晴れる'], ['雨', '読書', 'する']]

In [14]:
mecab_batch(x, format_func=lambda x: x.baseform or x.surface, 
            filter_func=lambda x: len(x.surface)==2)

[['明日'], ['だ', '読書', 'する']]

### Scikit-learn compatible transformer

`MecabTokenizer` is a scikit-learn compatible transformer that applies `mecab_batch` to a list of string inputs.

In [15]:
from mecabwrap import MecabTokenizer

tokenizer = MecabTokenizer(format_func=lambda x: x.surface)
tokenizer.transform(x)

[['明日', 'は', '晴れる', 'か', 'な'], ['雨', 'なら', '読書', 'を', 'しよ', 'う']]

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
x = ["明日は晴れるかな", "明日天気になあれ"]

p = Pipeline([
    ("mecab", MecabTokenizer(format_func=lambda x: x.surface)),
    ("tfidf", TfidfVectorizer(tokenizer=lambda x: x, lowercase=False))
])

y = p.fit_transform(x).todense()
pd.DataFrame(y, columns=p.steps[-1][-1].get_feature_names())

Unnamed: 0,あれ,か,な,に,は,天気,明日,晴れる
0,0.0,0.499221,0.3552,0.0,0.499221,0.0,0.3552,0.499221
1,0.499221,0.0,0.3552,0.499221,0.0,0.499221,0.3552,0.0


### Note on Python 2

All text inputs are assumed to be unicode.  
In Python2, inputs must be `u''` string, not `''`.
In python3, `str` type is unicode, so `u''` and `''` are equivalent.

In [17]:
o1 = do_mecab('すもももももももものうち')   # this works only for python 3
o2 = do_mecab(u'すもももももももものうち')  # this works both for python 2 and 3
print(o1)
print(o2)

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS



### Note on dictionary encodings

The functions takes `mecab_enc` option, which indicates the encoding of the MeCab dictionary being used.  Usually this can be left as the default value `None`, so that the encoding is automatically detected.  Alternatively, one may specify the encoding explicitly.

In [18]:
# show mecab dict
! mecab -D | grep charset
print()

o1 = do_mecab('日本列島改造論', mecab_enc=None)      # default
print(o1)

o2 = do_mecab('日本列島改造論', mecab_enc='utf-8')   # explicitly specified
print(o2)

#o3 = do_mecab('日本列島改造論', mecab_enc='cp932')   # wrong encoding, fails


charset:	UTF-8

日本	名詞,固有名詞,地域,国,*,*,日本,ニッポン,ニッポン
列島	名詞,一般,*,*,*,*,列島,レットウ,レットー
改造	名詞,サ変接続,*,*,*,*,改造,カイゾウ,カイゾー
論	名詞,接尾,一般,*,*,*,論,ロン,ロン
EOS

日本	名詞,固有名詞,地域,国,*,*,日本,ニッポン,ニッポン
列島	名詞,一般,*,*,*,*,列島,レットウ,レットー
改造	名詞,サ変接続,*,*,*,*,改造,カイゾウ,カイゾー
論	名詞,接尾,一般,*,*,*,論,ロン,ロン
EOS

