In [1]:
from collections import Counter
import re

# Classical Chinese Tutorial -- Counter as an Efficient Way to Do N-Gram

Just found `Counter` is super useful for 1-gram frequency. For example, just input you txt:

reference: https://stackoverflow.com/questions/43473736/most-common-2-grams-using-python

In [2]:
# just insert your string into Counter, it automatically count the 1-gram
input_string = "天命之謂性，率性之謂道，修道之謂教。道也者，不可須臾離也，可離非道也。是故君子戒慎乎其所不睹，恐懼乎其所不聞。莫見乎隱，莫顯乎微。故君子慎其獨也。喜怒哀樂之未發，謂之中；發而皆中節，謂之和；中也者，天下之大本也；和也者，天下之達道也。致中和，天地位焉，萬物育焉。"
one_gram = Counter(input_string)
one_gram.most_common(5) # print most common 5 

[('，', 12), ('之', 8), ('也', 8), ('。', 7), ('謂', 5)]

Of course, you want to get rid of the punctuations.

In [3]:
clean_string = re.sub(r'[，。？：「」；]', '', input_string)
one_gram = Counter(clean_string)
one_gram.most_common(5) # print most common 5 

[('之', 8), ('也', 8), ('謂', 5), ('道', 5), ('天', 4)]

## How About Counting 2-Grams?

Ok, we see one-gram is quite easy, but how about two-grams?  
We can recall our memory about the usage of `zip`. `zip` is quite common if we want to loop over two list parallel in a for loop. We can use it!

In [4]:
# counting two grams using zip
two_gram = Counter(zip(clean_string, clean_string[1:]))
two_gram.most_common(5) 

[(('之', '謂'), 3),
 (('道', '也'), 3),
 (('也', '者'), 3),
 (('故', '君'), 2),
 (('君', '子'), 2)]

Huh, it's neat.

## 3-Grams

Just repeat what we have done...

In [5]:
# counting two grams using zip
three_gram = Counter(zip(clean_string, clean_string[1:], clean_string[2:]))
three_gram.most_common(10) 

[(('故', '君', '子'), 2),
 (('乎', '其', '所'), 2),
 (('其', '所', '不'), 2),
 (('也', '者', '天'), 2),
 (('者', '天', '下'), 2),
 (('天', '下', '之'), 2),
 (('天', '命', '之'), 1),
 (('命', '之', '謂'), 1),
 (('之', '謂', '性'), 1),
 (('謂', '性', '率'), 1)]