## 2.1 An Introduction to fugashi 


In this section you'll learn how to do Japanese tokenization using fugashi, a MeCab wrapper, and the unidic-lite dictionary. 



|surface|pos1|pos2|pos3|lemma|pron|kana|goshu|
|-------|----|----|----|-----|----|----|-----|
|喫茶|名詞|普通名詞|一般|喫茶|キッサ|キッサ|漢|
|店|接尾辞|名詞的|一般|店|テン|テン|漢|
|と|助詞|格助詞|*|と|ト|ト|和|
|カフェ|名詞|普通名詞|一般|カフェ-cafe|カフェ|カフェ|外|
|の|助詞|格助詞|*|の|ノ|ノ|和|
|違い|名詞|普通名詞|一般|違い|チガイ|チガイ|和|
|は|助詞|係助詞|*|は|ワ|ハ|和|
|意外|形状詞|一般|*|意外|イガイ|イガイ|漢|
|と|助詞|格助詞|*|と|ト|ト|和|
|明確|形状詞|一般|*|明確|メーカク|メイカク|漢|


This table is an example of the output available from fugashi and UniDic. Note how besides tokenization it includes a variety of information about each token. This is only some of the fields available in UniDic. 



### Setup 


First you'll need to install fugashi and the dictionary.  


**fugashi** is a wrapper for **MeCab**, a classic Japanese morphological analyzer. fugashi uses Cython to access MeCab's C interface, and also includes some convenient tweaks to make it easier to use in Python. 


[wrapper_ja]: 「ラッパー」(英：wrapper, 包み)とは、あるプログラムを「包んで」別の API を提供するプログラムやライブラリのことです。

**unidic-lite** is a slightly modified version of UniDic 2.1.2. That version of UniDic is somewhat old, but it's small enough that it's easy to install, and high quality enough that it's sufficient for most applications. The **unidic** package on PyPI wraps the latest edition of UniDic, but due to a large increase in dictionary entries, it's harder to set up, so we won't use it for this tutorial. 


At time of writing the latest version of fugashi is 1.1.0 and the latest version of unidic-lite is 1.0.8. unidic-lite will work on any system, and fugashi distributes ready-to-use "wheels" for OSX, Linux, and 64 bit Windows. (If you have another operating system you may have to build from source. If you have trouble please feel free to [open an issue](https://github.com/polm/fugashi).) 



In [1]:
%%capture
!pip install fugashi unidic-lite

Now that fugashi is installed, you can confirm it works by running it in the terminal. Try running `fugashi -O wakati` and then typing some Japanese. If you push Enter, your input text will be printed with spaces separating tokens. You can use `CTRL+D` to terminate the process. Here's some example output: 



In [2]:
!echo "毎年東麻布ではかかし祭りが開催されます" | fugashi -O wakati

毎年 東 麻布 で は かかし 祭り が 開催 さ れ ます


Note: `wakati` comes from 分かち書き *wakachigaki*, which refers to the practice of writing Japanese with spaces included, as used in children's books and low resolution displays. In MeCab this refers to the special output mode that just separates tokens with spaces. Note that real wakachigaki uses spaces to separate bunsetsu, not tokens or words. 


Next let's use fugashi in code. The main interface to the library is the `Tagger` object, which holds a variety of dictionary related state. The primary way to use the `Tagger` is to simply apply it to input text, which will return a list of `Node` objects. Each `Node` contains the raw text of the token in a `surface` property, and extended dictionary fields are available in the `feature` property. 



In [3]:
import fugashi

tagger = fugashi.Tagger()

text = "形態素解析をやってみた"
words = tagger(text)
print(words)
print("=====")

for word in words:
    print(word.surface, word.feature.lemma, word.feature.kana, sep="\t")

[形態, 素, 解析, を, やっ, て, み, た]
=====
形態	形態	ケイタイ
素	素	ソ
解析	解析	カイセキ
を	を	ヲ
やっ	遣る	ヤッ
て	て	テ
み	見る	ミ
た	た	タ


Note: In Japanese NLP, it's standard to refer to the raw input text form as the "surface" (表層 *hyousou*), and MeCab uses this in its API. This usage comes from linguistics, where the **surface form** of a word in a particular context (which may be inflected or have unusual orthography) is contrasted with the **lexical form**, which would be a normalized or dictionary form. 


For basic tokenization, this is all you need to know. In the next section, we'll look at a slightly more involved application of morphological analysis, and later in this chapter we'll cover advanced tokenization-related topics. 



### Morphological Analysis Mini Project: Automatic Fuseji 


*Fuseji* (伏せ字) is the practice of replacing some characters with placeholders, usually a circle, to conceal the content of words. A similar thing is sometimes done in English, particularly to avoid using obscene words (`a**hole`, "you little @#%(*!"). In Japanese fuseji can be used for obscene words, but they can also be used to avoid spoilers, be vague about the names of brands or specific people, or for other reasons. 


Let's pretend that we want to automatically apply fuseji for the purpose of hiding spoilers about new movies or other media. While the simplest thing is to replace characters at random from the whole string, it's better to replace certain kinds of words, such as proper nouns. We can use the detailed part of speech information in UniDic, along with word boundaries, to replace proper nouns with fuseji versions. 



In [4]:
from fugashi import Tagger
from random import sample

tagger = Tagger()


def fuseji_node(text, ratio=1.0):
    """This function will take a node from tokenization and actually replace parts of it with filler characters."""
    ll = len(text)
    idxs = sample(range(ll), max(1, int(ratio * ll)))
    out = []
    for ii, cc in enumerate(text):
        out.append("◯" if ii in idxs else cc)
    return "".join(out)


def fuseji_text(text, ratio=1.0):
    """Given an input string, apply fuseji. """
    out = []
    for node in tagger(text):
        # Normal Japanese text doesn't use white space, but this is necessary 
        # if you include latin text, for example. 
        out.append(node.white_space)
        if node.feature.pos2 != "固有名詞":
            out.append(node.surface)
        else:
            out.append(fuseji_node(node.surface))
    return "".join(out)


print(fuseji_text("犯人はヤス"))
print(fuseji_text("東京タワーの高さは333m"))

犯人は◯◯
◯◯タワーの高さは333m


This code is already reasonably effective, but there are several ways it could be tweaked or improved. For example, sometimes the words that should be concealed aren't just proper nouns; they could also be ordinary nouns or verbs. 


How can we find what parts of speech we want to filter? The best way is to use **example sentences** to find what parts of speech we want, as well as to get a better understanding of where our program works well and where it doesn't. 


- 新キャラの「カズヤ」は年内に配信予定
- マジルテの水晶の畑エリアにはクリスタルが沢山ある
- 「吾輩は猫である」の作家は夏目漱石
- 『さかしま』（仏: À rebours）は、フランスの作家ジョリス＝カルル・ユイスマンスによる小説

We can check the parts of speech of words in fugashi by using the `node.pos` attribute. This part of speech information comes from UniDic and uses four levels. You can access the individual levels as `node.feature.pos1`, `node.feature.pos2`, and so on. The `node.pos` attribute is a convenience feature that joins the four separate values together and replaces empty values with an asterisk (`*`). 


[^pos_ryaku]: `pos`は「part of speech (品詞）」の略です。

You can check part of speech tags of words by giving a sentence as input with fugashi on the command line, without giving the `-O wakati` command line argument. 



In [5]:
!echo "毎年東麻布ではかかし祭りが開催されます" | fugashi

毎年	マイトシ	マイトシ	毎年	名詞-普通名詞-副詞可能			0
東	ヒガシ	ヒガシ	東	名詞-普通名詞-一般			0,3
麻布	アザブ	アザブ	アザブ	名詞-固有名詞-地名-一般			0
で	デ	デ	で	助詞-格助詞			
は	ワ	ハ	は	助詞-係助詞			
かかし	カカシ	カカス	欠かす	動詞-一般	五段-サ行	連用形-一般	0,2
祭り	マツリ	マツリ	祭り	名詞-普通名詞-一般			0
が	ガ	ガ	が	助詞-格助詞			
開催	カイサイ	カイサイ	開催	名詞-普通名詞-サ変可能			0
さ	サ	スル	為る	動詞-非自立可能	サ行変格	未然形-サ	0
れ	レ	レル	れる	助動詞	助動詞-レル	連用形-一般	
ます	マス	マス	ます	助動詞	助動詞-マス	終止形-一般	
EOS


### Censoring Unknown Words 


Another thing that'll come up as we're testing is that sometimes words not in the dictionary will be used, like the names of characters in movies and books. From the example sentences above, マジルテ *Majirute*, the name of a fictional place, is an example of such a word. We basically always want to censor those words to avoid spoilers, so rather than checking part of speech information, we can also check specifically for words that aren't in our dictionary. These are called "unks", from "unknown words", or 未知語 *michigo* in Japanese. In fugashi you can determine if a given node is in the dictionary just by checking the `node.is_unk` attribute. 


Looking at our example sentences, some patterns emerge. We probably don't want to filter verbs, since it's hard to tell when a verb is important. Proper nouns should definitely be filtered. Common nouns may or may not be important, so it's hard to say if we should filter them - for now, let's leave them alone. 


Since our conditions for censoring words are getting kind of complicated, let's factor them into a function. 



In [6]:
def should_hide(node):
    """Check if this node should be hidden or not. """
    if node.is_unk:
        return True
    ff = node.feature
    if ff.pos1 == "名詞" and ff.pos2 == "固有名詞":
        return True
    return False


def fuseji_text(text, ratio=1.0):
    """Given an input string, apply fuseji. """
    out = []
    for node in tagger(text):
        out.append(node.white_space)
        word = fuseji_node(node.surface) if should_hide(node) else node.surface
        out.append(word)
    return "".join(out)


texts = [
    "犯人はヤス",
    "魔法の言葉はヒラケゴマ",
    "『さかしま』（仏: À rebours）は、フランスの作家ジョリス＝カルル・ユイスマンスによる小説",
    "鈴木爆発で最初に解体する爆弾はみかんの形をしている",
]

for text in texts:
    print(fuseji_text(text))

犯人は◯◯
魔法の言葉は◯◯◯◯◯
『さかしま』（仏: ◯ ◯◯◯◯◯◯◯）は、◯◯◯◯の作家◯◯◯◯＝◯◯◯・◯◯◯◯◯◯による小説
◯◯爆発で最初に解体する爆弾はみかんの形をしている


### Use Readings to Censor only Part of Words 


At this point our program is pretty effective at applying fuseji to any text we throw at it. That said, censoring the entire text is a little boring. It would be more interesting if we could reveal some letters so that readers can guess the rest of the word, but not quite be certain about it. 


There is one potential issue though - if we use kanji, even one character might give the word away in away that's not interesting. What if we could convert words to phonetic versions and *then* censor part of them? That would allow us to show part of the word while giving away less information. 


Thankfully, UniDic includes a field we can use for this conversion. Every word in the UniDic dictionary has a `kana` field we can use to get the conventional reading for the word in katakana form. (UniDic also has a `pron` field, which uses non-standard orthography to differentiate long vowels.) 


One thing to keep in mind is that the kana reading will only be available for words in UniDic, and it won't always be perfect. There are two cases where the reading will be wrong: 


1. The word is not in the dictionary. 
2. The reading of the word is ambiguous. 


If the word is not in the dictionary, it's possible to train a machine learning model or use other methods to predict the reading, but that's pretty difficult. So this time, if a word is an unk we'll just skip converting it and use the raw surface form. 


Ambiguous words are more difficult. Some examples of ambiguous words: 




- 東: *higashi* or *azuma* (or *tou*)
- 中田: *nakada* or *nakata*
- 仮名: *kana* or *kamei*
- 網代: *amishiro* or *ajiro*
- 最中: *saichuu* or *monaka*
- 私: *watashi* or *watakushi*
- 日本: *nihon* or *nippon*

d


- 東: ひがし、あずま、とう
- 中田: なかだ、なかた
- 仮名: かな、かめい
- 牧場: ぼくじょう、まきば
- 網代: あみしろ、あじろ
- 日本: にほん、にっぽん

d

Usually a reading will be clear from context, but many ambiguous words are proper nouns like the names of people and places, and without knowing which specific entity it's referring to there's no way to be sure of the correct reading. Even worse, there's no way to be sure if the word you're looking at is ambiguous or not just using the tokenizer output. 


(Note: Words written the same way but pronounced differently are referred to as 同形異音語 *doukei iongo* or "heteronyms". They are also common in English, though less so for proper nouns.) 

Note: For ambiguous words, deciding their reading could be considered a form of **word sense disambiguation** for common nouns, or **entity linking** for proper nouns. Both are NLP problems with a long history. 


So how can we handle ambiguous words if we can't even identify them with certainty? It turns out that their difficulty actually has a silver lining - because even people make mistakes, we can get away with just using the kana UniDic gives us and hope that it's right most of the time. For serious applications replacing the original text with a mistake would be unacceptable, but for our fuseji application, it's not the end of the world if we're wrong occasionally.  


Sometimes when you learn about a problem confronting your NLP system, there may not be a solution you're able to implement. In this case, writing a program to disambiguate words would be much more work than the rest of our entire program. But by being aware of the problem, we can consider how failures affect the output of our system, and evaluate whether we should continue with its development, or start over with a design that can work around the problem. 


Now that we've settled that, let's change our code to use the kana instead of the surface when censoring words. 



In [7]:
def fuseji_text(text, ratio=1.0):
    """Given an input string, apply fuseji. """
    out = []
    for node in tagger(text):
        out.append(node.white_space)
        node_text = node.surface if node.is_unk else node.feature.kana
        word = fuseji_node(node_text, ratio=0.5) if should_hide(node) else node.surface
        out.append(word)
    return "".join(out)


texts = [
    "黒幕の正体はガーランド",
]

for text in texts:
    print(fuseji_text(text))

黒幕の正体は◯ーラ◯ド


And that makes our automatic fuseji program complete. It's not a lot of code, but in building this you learned how to: 




1. iterate over the tokens in a text
2. identify parts of speech of interest with example sentences
3. use multiple levels of part of speech tags
4. check if a token is in the dictionary or an unk
5. convert words to their phonetic representation

d


1. 文章の単語を一つずつ処理する方法
2. 例文を使って目的の品詞を特定する方法
3. 品詞の構造の扱い
4. 未知語の判別
5. 読み仮名変換

d

These are all basic building blocks you can use to build a wide variety of applications. 


While our motivation for this program was a simple and playful one, the techniques used here are simple versions of those used in **personally identifying information (PII) removal**, which removes identifying details from documents like medical and legal records so they can be used in audits or analysis without risk to the people they describe. 


To learn more about the tokenizer API, consider some ways you might want to extend this application and how you'd make the necessary changes. 




- what if you wanted to remove all numbers from a contract, to hide dates or prices?
- what if you wanted to hide a specific list of words, perhaps obscenities, rather than certain parts of speech?
- how would you change the program to replace hard-to-read words with their phonetic versions?

d


- 契約書から日付や金額などの数字を消す
- 品詞によってではなく、禁止語など特定の単語を伏せる
- 難読語を読み仮名に変換する

d