This package, RcppMeCab, is a
Rcpp wrapper for the part-of-speech morphological analyzer
MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power
R computation to analyze texts faster.
Linux and Mac OSX
MeCab of your language-of-choice.
MeCab-Kofrom Bitbucket repository
MeCab Chinese Dicfrom MeCab-Chinese
Second, you can install RcppMeCab from CRAN with:
install.packages("RcppMeCab") # build from source # install.packages("devtools") install_github("junhewk/RcppMeCab") # install developmental version
You should set the language you want to use for the analysis with the environment variable
MECAB_LANG. The default value is
ko and if you want to analyze Japanese or Chinese, please set it as
jp before install the package.
install.packages("RcppMeCab") # for installing Korean version # or, install for Japanese Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version install.packages("RcppMeCab", type="source") # build from source # install.packages("devtools") install_github("junhewk/RcppMeCab") # install developmental version
For analyzing, you also need MeCab binary and dictionary.
Install mecab binary. Provide directory location to
RcppMeCab function. For example:
pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")
This package has
pos(sentence) # returns list, sentence will present on the names of the list pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names) pos(sentence, format = "data.frame") # the result will returned as a data frame format pos(sentence, user_dic) # gets a compiled user dictionary posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading
- sentence: a text for analyzing
- join: If it gets TRUE, output form is (morpheme/tag). If it gets FALSE, output form is (morpheme) + tag in attribute.
- format: The default is a list. If you set this as
"data.frame", the function will return the result in a data frame format.
- sys_dic: a directory in which
dicrcfile is located, default value is "" or you can set your default value using
options(mecabSysDic = "")
- user_dic: a user dictionary file compiled by
mecab_dict_index, default value is also ""
Compiling User Dictionary
MeCab API has
DictionaryCompiler, but it contains
die(). Hence, calling it in Rcpp crashes down entire R session. This will not be included in
Please refer to Mecab for Japanese.
Unix and Mac OSX
You should have
model_file if you want the library to estimate cost automatically.
$ /usr/local/libexec/mecab/mecab-dict-index -m `model_file` -d `mecab_dic_location` -u `user_dictionary_file_name` -f `CSV file charset` -t `original dictionary charset` `target_csv # example $ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv
MeCabbinary version has
You can use it in the same way the Linux binary compiles the dictionary.
- Test multilanguage support
- Provide other useful functions
- Provide multilanguage manuals for international support
Junhewk Kim (firstname.lastname@example.org)