# Fugashi Word Count Tutorial

In this tutorial we'll use fugashi, a wrapper for the MeCab tokenizer and morphological analysis tool, to count words in 吾輩は猫である "I Am a Cat", the famous novel by Natsume Soseki. 

MeCab uses large dictionaries and statistical models to tokenize Japanese text. Since the dictionaries are hosted on S3 they're easy to install in an AWS environment.

First let's install the basic packages for this tutorial. 

- **fugashi** is the tool that actually performs tokenization
- **unidic-lite** is the dictionary we'll use for this tutorial
- **requests** will be used to download the book text

In [3]:
import sys
!{sys.executable} -m pip install fugashi unidic-lite requests

Collecting fugashi
  Downloading fugashi-1.0.4-cp36-cp36m-manylinux1_x86_64.whl (476 kB)
[K     |████████████████████████████████| 476 kB 15.7 MB/s eta 0:00:01
[?25hCollecting unidic-lite
  Downloading unidic-lite-1.0.7.tar.gz (47.3 MB)
[K     |████████████████████████████████| 47.3 MB 7.0 kB/s eta 0:00:01
Building wheels for collected packages: unidic-lite
  Building wheel for unidic-lite (setup.py) ... [?25ldone
[?25h  Created wheel for unidic-lite: filename=unidic_lite-1.0.7-py3-none-any.whl size=47556592 sha256=11e2858775e269091d6f7132659e2fbc6138a92d2b8c355db8efda3440ea293f
  Stored in directory: /home/ec2-user/.cache/pip/wheels/82/63/c6/b5f0ea5a04e01edc468cc78cd3d62deca919bbcb09116b37e6
Successfully built unidic-lite
Installing collected packages: fugashi, unidic-lite
Successfully installed fugashi-1.0.4 unidic-lite-1.0.7
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


Next let's do some basic tokenization. We'll print out a sentence along with the pronunciation and part-of-speech information for each word.

In [6]:
import fugashi
tagger = fugashi.Tagger()
for word in tagger("吾輩は猫である。"):
    print(word, word.feature.kana, word.pos, sep="\t")

吾輩	ワガハイ	代名詞,*,*,*
は	ハ	助詞,係助詞,*,*
猫	ネコ	名詞,普通名詞,一般,*
で	デ	助動詞,*,*,*
ある	アル	動詞,非自立可能,*,*
。		補助記号,句点,*,*


UniDic part of speech tags have four parts, from basic part of speech type like 名詞 (noun) or 動詞 (verb) to more fine-grained tags, like whether a proper noun is a place or person. `*` is used as a placeholder when there's no detailed tag. 

UniDic includes a lot of other data beyond what we're using here, including kana accent, broad etymological category, and more. See XXX for details on the available fields.

(Note: replace XXX with link to dataset details document.)

Now let's download the book and try a basic word count.

In [7]:
import requests

# TODO put in S3 Bucket as sample document
wagahai = requests.get("https://github.com/polm/ja-tokenizer-benchmark/raw/master/wagahai.txt").text

from collections import Counter
wc = Counter()

# Get a word count
for line in wagahai.split("\n"):
    for word in tagger(line):
        wc[word.surface] += 1
print("Most common words:")        
for key, val in wc.most_common(10):
    print(val, key)
print()

Most common words:
9713 の
9217 《
9217 》
7513 に
7490 。
7258 て
6837 、
6665 は
6279 と
6128 を



We can see that the most common words are grammatical function words and punctuation, so looks like our word count is working. 

Now let's try getting a list of the most common proper nouns that are names of people. This should give us a list of characters in the novel.

In [9]:
wcpos = Counter()
for line in wagahai.split("\n"):
    for word in tagger(line):
        if (word.feature.pos2 == '固有名詞' and
            word.feature.pos3 == '人名'):
            wcpos[word.surface] += 1

print("Most common proper nouns:")
for key, val in wcpos.most_common(10):
    print(val, key)

Most common proper nouns:
343 迷亭
99 金田
85 鈴木
80 独仙
42 雪江
41 鼻子
40 多々良
36 りょう
35 てい
29 けん


Sure enough the top entries are all names of characters in the novel. 

That's it for this basic tutorial; for more information see the documentation for [MeCab](https://taku910.github.io/mecab/) or [fugashi](https://github.com/polm/fugashi). happy tokenizing!