# Fugashi Word Count Tutorial

In this tutorial we'll use fugashi, a wrapper for the MeCab tokenizer and morphological analysis tool, to count words in 吾輩は猫である "I Am a Cat", the famous novel by Natsume Soseki. 

MeCab uses large dictionaries and statistical models to tokenize Japanese text. Since the dictionaries are hosted on S3 they're easy to install in an AWS environment.

First let's install the basic packages for this tutorial. 

- **fugashi** is the tool that actually performs tokenization
- **unidic** is the dictionary we'll use for this tutorial
- **requests** will be used to download the book text

In [1]:
import sys
!{sys.executable} -m pip install fugashi unidic requests

Collecting fugashi
  Downloading fugashi-1.0.4-cp36-cp36m-manylinux1_x86_64.whl (476 kB)
[K     |████████████████████████████████| 476 kB 9.5 MB/s eta 0:00:01
[?25hCollecting unidic
  Downloading unidic-1.0.2.tar.gz (5.0 kB)
Collecting wasabi<1.0.0,>=0.6.0
  Downloading wasabi-0.8.0-py3-none-any.whl (23 kB)
Collecting plac<2.0.0,>=1.1.3
  Downloading plac-1.2.0-py2.py3-none-any.whl (21 kB)
Building wheels for collected packages: unidic
  Building wheel for unidic (setup.py) ... [?25ldone
[?25h  Created wheel for unidic: filename=unidic-1.0.2-py3-none-any.whl size=5411 sha256=35c1df18abe4b8182f936257fbe59642b00dcefe2c9c8ea43a0d241cf8b992e9
  Stored in directory: /home/ec2-user/.cache/pip/wheels/61/62/e0/c3c3e36d343f8d6b959e76fe9f7c73c0aeff297ac259deac25
Successfully built unidic
Installing collected packages: fugashi, wasabi, plac, unidic
Successfully installed fugashi-1.0.4 plac-1.2.0 unidic-1.0.2 wasabi-0.8.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pyt

Because it's large - about 1GB on disk - UniDic requires an extra download step. This data is on S3 so it's very fast to download.

In [2]:
!{sys.executable} -m unidic download aws

download url: https://cotonoha-dic.s3-ap-northeast-1.amazonaws.com/unidic.zip
Dictionary version: 2.3.0+1
Downloading UniDic v2.3.0+1...
unidic.zip: 100%|████████████████████████████| 608M/608M [01:22<00:00, 7.41MB/s]
Finished download.
Downloaded UniDic v2.3.0+1 to /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/unidic/dicdir


Next let's do some basic tokenization. We'll print out a sentence along with the pronunciation and part-of-speech information for each word.

In [3]:
import fugashi
tagger = fugashi.Tagger()
for word in tagger("吾輩は猫である。"):
    print(word, word.feature.kana, word.pos, sep="\t")

吾輩	ワガハイ	代名詞,*,*,*
は	ハ	助詞,係助詞,*,*
猫	ネコ	名詞,普通名詞,一般,*
で	デ	助動詞,*,*,*
ある	アル	動詞,非自立可能,*,*
。	*	補助記号,句点,*,*


UniDic part of speech tags have four parts, from basic part of speech type like 名詞 (noun) or 動詞 (verb) to more fine-grained tags, like whether a proper noun is a place or person. `*` is used as a placeholder when there's no detailed tag. 

UniDic includes a lot of other data beyond what we're using here, including kana accent, broad etymological category, and more. See [the dataset description][dataset] for details on the available fields.

[dataset]: https://github.com/polm/unidic-py/blob/master/doc/dataset.md

Now let's download the book and try a basic word count.

In [4]:
import requests

wagahai = requests.get("https://github.com/polm/fugashi-sagemaker-demo/raw/master/wagahai.txt").text

from collections import Counter
wc = Counter()

# Get a word count
for line in wagahai.split("\n"):
    for word in tagger(line):
        wc[word.surface] += 1
print("Most common words:")        
for key, val in wc.most_common(10):
    print(val, key)
print()

Most common words:
9712 の
9217 《
9216 》
7498 に
7475 。
7238 て
6835 、
6644 は
6266 と
6128 を



We can see that the most common words are grammatical function words and punctuation, so looks like our word count is working. 

Now let's try getting a list of the most common proper nouns that are names of people. This should give us a list of characters in the novel.

In [5]:
wcpos = Counter()
for line in wagahai.split("\n"):
    for word in tagger(line):
        if (word.feature.pos2 == '固有名詞' and
            word.feature.pos3 == '人名'):
            wcpos[word.surface] += 1

print("Most common proper nouns:")
for key, val in wcpos.most_common(10):
    print(val, key)

Most common proper nouns:
343 迷亭
85 鈴木
80 独仙
42 雪江
41 鼻子
40 多々良
37 りょう
33 武右衛門
26 馳
24 寒月


Sure enough the top entries are all names of characters in the novel. 

That's it for this basic tutorial; for more information see the documentation for [MeCab](https://taku910.github.io/mecab/) or [fugashi](https://github.com/polm/fugashi). happy tokenizing!