Fugashi is a Cython wrapper for MeCab.
See the blog post for background on why Fugashi exists and some of the design decisions.
Any reasonable version of MeCab should work, but it's recommended you install from source.
from fugashi import Tagger tagger = Tagger('-Owakati') text = "麩菓子（ふがし）は、麩を主材料とした日本の菓子。" tagger.parse(text) # => '麩 菓子 （ ふ が し ） は 、 麩 を 主材 料 と し た 日本 の 菓子 。' for word in tagger.parseToNodeList(text): print(word, word.feature.lemma, word.pos, sep='\t') # "feature" is the Unidic feature data as a named tuple
Fugashi is written with the assumption you'll use Unidic to process Japanese, but it supports arbitrary dictionaries.
If you're using a dictionary besides Unidic you can use the GenericTagger like this:
from fugashi import GenericTagger tagger = GenericTagger() # parse can be used as normal tagger.parse('something') # features from the dictionary can be accessed by field numbers for word in tagger.parseToNodeList(text): print(word.surface, word.feature)
You can also create a dictionary wrapper to get feature information as a named tuple.
from fugashi import GenericTagger, create_feature_wrapper CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma') tagger = GenericTagger(wrapper=CustomFeatures) for word in tagger.parseToNodeList(text): print(word.surface, word.feature.alpha)
If you have a problem with Fugashi feel free to open an issue. However, there are some cases where it might be better to use a different library.
- If you want to use MeCab but don't have a C compiler, use natto-py.
- If you don't want to deal with installing MeCab at all, try SudachiPy.
Note that these are both slower than Fugashi according to a benchmark I wrote.