# Many ways to segment Chinese  
> This post introduces five Python libraries for segmenting Chinese into words and compares the effect of writing systems on parsing. 

- toc: true
- branch: master
- badges: true
- categories: [tokenization, Jieba, PKUSeg, PyHanLP, SnowNLP, Ckip-Transformers]
- image: images/syntactic-ambiguity.jpg

![](https://github.com/howard-haowen/blog.ai/raw/master/images/syntactic-ambiguity.jpg "Credit: www.creators.com")

# Intro

Unlike English, Chinese does not use spaces in its writing system, which can be a pain in the neck (or in the eyes, for that matter) if you're learning to read Chinese. In a way, it's like trying to make sense out of long German words like `Lebensabschnittspartner`, which roughly means "the person I'm with today" (taken from [David Sedaris's language lessons](https://www.newyorker.com/magazine/2011/07/11/easy-tiger) published on the New Yorker). We'll see how computer models can help us with breaking a stretch of Chinese text into words (called `tokenization` in NLP jargon). To give computer models a hard time, we'll test out this text without punctuations. 

In [None]:
text = "今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死"

This text is challenging not only because it can be segmented multiple ways but also because it could potentially express quite different meanings depending on how you interprete it. For instance, this part `今年好煩惱少不得打官司` could either mean "This year will be great for you. You'll have few worries. Don't file any lawsuit" or "This year, you'll be very worried. A lawsuit is inevitable". Either way, it sounds like the kind of aphorism you'd find in fortune cookies. Now that you know  the secret to aphorisms being always right is ambiguity, we'll turn to five Python libraries for doing the hard work for us.

# Jieba

Of the five tools to be introduced here, [Jieba](https://github.com/fxsjy/jieba) is perhaps the most widely used one, and it's even pre-installed on Colab and supported by [spaCy](https://spacy.io). Unfortunately, Jieba told us that a lawsuit is inevitable this year... 😭

In [None]:
#collapse

import jieba
tokens = jieba.cut(text)  
jieba_default = " | ".join(tokens)
print(jieba_default)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.741 seconds.
Prefix dict has been built successfully.


今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻 | 隻 | 大如山 | 老鼠 | 隻 | 隻 | 死


The result is quite satisfying, except for `酸養`, which is not even a word. Jieba is famouse for being super fast. If we run the segmentation function 1000000 times, top results we got are 256 nanoseconds per loop! 

In [None]:
#collapse

%timeit jieba.cut(text)

The slowest run took 12.90 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 256 ns per loop


Let's write a function for later use. 

In [None]:
def Jieba_tokenizer(text):
  tokens = jieba.cut(text)  
  result = " | ".join(tokens)
  return result

# PKUSeg

As its name suggests, [PKUSeg](https://github.com/lancopku/pkuseg-python) is built by the [Language Computing and Machine Learning Group](https://lanco.pku.edu.cn) at Peking (aka. Beijing) University. It's been recently integrated into spaCy. 

In [None]:
#collapse-output

!pip install -U pkuseg

Collecting pkuseg
[?25l  Downloading https://files.pythonhosted.org/packages/ed/68/2dfaa18f86df4cf38a90ef024e18b36d06603ebc992a2dcc16f83b00b80d/pkuseg-0.0.25-cp36-cp36m-manylinux1_x86_64.whl (50.2MB)
[K     |████████████████████████████████| 50.2MB 66kB/s 
Installing collected packages: pkuseg
Successfully installed pkuseg-0.0.25


Here's the result.

In [None]:
#collapse

import pkuseg

pku = pkuseg.pkuseg()        
result = pku.cut(text) 
result = " | ".join(result)
result

'今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒剛 | 剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死'

Compared with Jieba, PKUSeg not only got more wrong tokens (`酸養` and `酒剛 `) but also ran at a much slower speed.   

In [None]:
#collapse

%timeit pku.cut(text)

1000 loops, best of 3: 648 µs per loop


Yet, PKUSeg has one nice feature absent from Jieba. 
> Users have the option to choose from four domain-specific models, including `news`, `web`, `medicine`, and `tourism`. 

This can be quite helpful if you're specifically dealing with texts in any of the four domains. Let's test the `news` domain with the first paragraph of a news article about Covid-19 published on [Yahoo News](https://tw.news.yahoo.com/新冠肺炎防疫-陳時中14-00記者會說明-020511460.html). 

In [None]:
article = '''
台灣新冠肺炎連續第6天零本土病例破功！中央流行疫情指揮中心指揮官陳時中今天宣布國內新增4例本土確定病例，均為桃園醫院感染事件之確診個案相關接觸者，其中3例為案863之同住家人(案907、909、910)，研判與案863、864、865為一起家庭群聚案，其中1人（案907）死亡，是相隔8個月以來再添死亡病例；另1例為案889之就醫相關接觸者(案908)。此外，今天也新增6例境外移入確定病例，分別自印尼(案901)、捷克(案902)及巴西(案903至906)入境。衛福部桃園醫院感染累計達19例(其中1人死亡)，全台達909例、8死。
'''

Here's the result with the default settinng. 

In [None]:
#collapse

pku = pkuseg.pkuseg()        
result = pku.cut(article) 
result = " | ".join(result)
result

'台灣 | 新冠 | 肺炎 | 連續 | 第6 | 天 | 零 | 本土 | 病例 | 破功 | ！ | 中央 | 流行 | 疫情 | 指揮 | 中心 | 指揮官 | 陳時 | 中 | 今天 | 宣布 | 國內 | 新增 | 4 | 例 | 本土 | 確定 | 病例 | ， | 均 | 為 | 桃園 | 醫院 | 感染 | 事件 | 之 | 確 | 診個案 | 相關 | 接觸者 | ， | 其中 | 3 | 例 | 為案 | 863 | 之 | 同 | 住家人 | ( | 案 | 907 | 、 | 909 | 、 | 910 | ) | ， | 研判 | 與案 | 863 | 、 | 864 | 、 | 865 | 為 | 一起 | 家庭 | 群聚案 | ， | 其中 | 1 | 人 | （ | 案 | 907 | ） | 死亡 | ， | 是 | 相隔 | 8 | 個 | 月 | 以 | 來 | 再 | 添 | 死亡 | 病例 | ； | 另 | 1 | 例 | 為案 | 889 | 之 | 就 | 醫 | 相關 | 接觸者 | ( | 案 | 908 | ) | 。 | 此外 | ， | 今天 | 也 | 新增 | 6例 | 境外 | 移入 | 確定 | 病例 | ， | 分別 | 自 | 印尼 | ( | 案 | 901 | ) | 、 | 捷克 | ( | 案 | 902 | ) | 及 | 巴西 | ( | 案 | 903 | 至 | 906 | ) | 入境 | 。 | 衛福部 | 桃園 | 醫院 | 感染 | 累計 | 達 | 19 | 例 | ( | 其中 | 1 | 人 | 死亡 | ) | ， | 全 | 台 | 達 | 909 | 例 | 、 | 8 | 死 | 。'

Here's the result with the `model_name` argument set to `news`. Both models made some mistakes here and there, but what's surprising to me is that the news-specific model even made a mistake when parsing `新冠肺炎`, which literally means "new coronavirus disease" and refers to Covid-19. 

In [None]:
#collapse

pku = pkuseg.pkuseg(model_name='news')        
result = pku.cut(article) 
result = " | ".join(result)
result

Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/news.zip" to /root/.pkuseg/news.zip
100%|██████████| 43767759/43767759 [00:00<00:00, 104004889.71it/s]


'台灣 | 新 | 冠 | 肺 | 炎連 | 續 | 第6天 | 零本土 | 病例 | 破功 | ！ | 中央 | 流行疫情指揮中心 | 指揮 | 官 | 陳 | 時 | 中 | 今天 | 宣布 | 國內 | 新增 | 4例 | 本土 | 確定 | 病例 | ， | 均 | 為桃園醫院 | 感染 | 事件 | 之 | 確 | 診 | 個 | 案 | 相關 | 接觸 | 者 | ， | 其中 | 3例 | 為案 | 863 | 之 | 同 | 住 | 家人 | (案 | 907 | 、 | 909 | 、 | 910) | ， | 研判 | 與案 | 863 | 、 | 864 | 、 | 865為 | 一起 | 家庭 | 群 | 聚案 | ， | 其中 | 1 | 人 | （ | 案 | 907 | ） | 死亡 | ， | 是 | 相隔 | 8個月 | 以 | 來 | 再 | 添 | 死亡 | 病例 | ； | 另 | 1例 | 為案 | 889 | 之 | 就 | 醫 | 相關 | 接觸 | 者 | (案 | 908) | 。 | 此外 | ， | 今天 | 也 | 新增 | 6例 | 境外 | 移入 | 確定 | 病例 | ， | 分 | 別 | 自 | 印尼 | (案 | 901) | 、 | 捷克 | (案 | 902) | 及 | 巴西 | (案 | 903至906 | ) | 入境 | 。 | 衛 | 福部桃園醫院 | 感染 | 累 | 計達 | 19例 | ( | 其中 | 1 | 人 | 死亡 | ) | ， | 全 | 台 | 達 | 909例 | 、 | 8 | 死 | 。'

Let's write a function for later use. 

In [None]:
def PKU_tokenizer(text):
  pku = pkuseg.pkuseg()
  tokens = pku.cut(text) 
  result = " | ".join(tokens)
  return result

# PyHanLP

Next, we'll try [PyHanLP](https://github.com/hankcs/pyhanlp). It'll take some time to download the model and data files (about 640MB in total).  

In [None]:
#collapse-output

!pip install pyhanlp

Collecting pyhanlp
[?25l  Downloading https://files.pythonhosted.org/packages/8f/99/13078d71bc9f77705a29f932359046abac3001335ea1d21e91120b200b21/pyhanlp-0.1.66.tar.gz (86kB)
[K     |████████████████████████████████| 92kB 9.0MB/s 
[?25hCollecting jpype1==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/09/e19ce27d41d4f66d73ac5b6c6a188c51b506f56c7bfbe6c1491db2d15995/JPype1-0.7.0-cp36-cp36m-manylinux2010_x86_64.whl (2.7MB)
[K     |████████████████████████████████| 2.7MB 12.4MB/s 
[?25hBuilding wheels for collected packages: pyhanlp
  Building wheel for pyhanlp (setup.py) ... [?25l[?25hdone
  Created wheel for pyhanlp: filename=pyhanlp-0.1.66-py2.py3-none-any.whl size=29371 sha256=cbe214d3e71b3e4e5692c0570e6eadbafc6845b99409abc5af1d790d9b7ee50f
  Stored in directory: /root/.cache/pip/wheels/25/8d/5d/6b642484b1abd87474914e6cf0d3f3a15d8f2653e15ff60f9e
Successfully built pyhanlp
Installing collected packages: jpype1, pyhanlp
Successfully installed jpype1-0.7.0 pyhan

In [None]:
#collapse-output

from pyhanlp import *

下载 https://file.hankcs.com/hanlp/hanlp-1.7.8-release.zip 到 /usr/local/lib/python3.6/dist-packages/pyhanlp/static/hanlp-1.7.8-release.zip
100.00%, 1 MB, 187 KB/s, 还有 0 分  0 秒   
下载 https://file.hankcs.com/hanlp/data-for-1.7.5.zip 到 /usr/local/lib/python3.6/dist-packages/pyhanlp/static/data-for-1.7.8.zip
98.24%, 626 MB, 8117 KB/s, 还有 0 分  1 秒   

With PyHanLP, we got a similar parsing result, but without the error that Jieba produced. 

In [None]:
#collapse

tokens = HanLP.segment(text)
token_list = [res.word for res in tokens]
pyhan = " | ".join(token_list)
print(pyhan)

今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻 | 隻 | 大 | 如山 | 老鼠 | 隻 | 隻 | 死


However, PyHanLP is about 26 times slower than Jieba, as timed below.

In [None]:
#collapse

%timeit HanLP.segment(text)

The slowest run took 11.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 24.6 µs per loop


Let's write a function for later use. 

In [None]:
def PyHan_tokenizer(text):
  tokens = HanLP.segment(text)
  token_list = [res.word for res in tokens]
  result = " | ".join(token_list)
  return result

# SnowNLP

Next is [SnowNLP](https://github.com/isnowfy/snownlp), which I came across only recently. While PyHanLP is about 640MB in size, SnowNLP takes up only less than 40MB.

In [None]:
#collapse-output

!pip install snownlp
from snownlp import SnowNLP

Collecting snownlp
[?25l  Downloading https://files.pythonhosted.org/packages/3d/b3/37567686662100d3bce62d3b0f2adec18ab4b9ff2b61abd7a61c39343c1d/snownlp-0.12.3.tar.gz (37.6MB)
[K     |████████████████████████████████| 37.6MB 86kB/s 
[?25hBuilding wheels for collected packages: snownlp
  Building wheel for snownlp (setup.py) ... [?25l[?25hdone
  Created wheel for snownlp: filename=snownlp-0.12.3-cp36-none-any.whl size=37760957 sha256=7de1997923cd51c8c45b896d9a29792e57652d5f55e3caf088212be684c50b36
  Stored in directory: /root/.cache/pip/wheels/f3/81/25/7c197493bd7daf177016f1a951c5c3a53b1c7e9339fd11ec8f
Successfully built snownlp
Installing collected packages: snownlp
Successfully installed snownlp-0.12.3


SnowNLP gave a similar result, but made two parsing mistakes. Neither `做醋格` nor `外酸` is a legitimate word.

In [None]:
#collapse

tokens = SnowNLP(text)
token_list = [tokens.words][0]
snow =  " | ".join(token_list)
print(snow)

今 | 年 | 好 | 煩 | 惱 | 少不得 | 打 | 官司 | 釀 | 酒 | 剛 | 剛 | 好 | 做醋格 | 外酸 | 養 | 牛 | 隻 | 隻 | 大 | 如 | 山 | 老 | 鼠 | 隻 | 隻 | 死


SnowNLP not only made more mistakes, but also took longer to run.

In [None]:
#collapse

%timeit  SnowNLP(text)

10000 loops, best of 3: 35.4 µs per loop


But SnowNLP has a convenient feature inspired by [TextBlob](https://github.com/sloria/TextBlob). Any instance of `SnowNLP()` has such attributes as `words`, `pinyin` (for romanization of words), `tags` (for parts of speech tags), and even `sentiments`, which calculates the probability of a text being positive.  

In [None]:
print(tokens.words)

['今', '年', '好', '煩', '惱', '少不得', '打', '官司', '釀', '酒', '剛', '剛', '好', '做醋格', '外酸', '養', '牛', '隻', '隻', '大', '如', '山', '老', '鼠', '隻', '隻', '死']


In [None]:
print(tokens.pinyin)

['jin', 'nian', 'hao', '煩', '惱', 'shao', 'bu', 'de', 'da', 'guan', 'si', '釀', 'jiu', '剛', '剛', 'hao', 'zuo', 'cu', 'ge', 'wai', 'suan', '養', 'niu', '隻', '隻', 'da', 'ru', 'shan', 'lao', 'shu', '隻', '隻', 'si']


In [None]:
print(list(tokens.tags))

[('今', 'Tg'), ('年', 'q'), ('好', 'a'), ('煩', 'Rg'), ('惱', 'Rg'), ('少不得', 'Rg'), ('打', 'v'), ('官司', 'n'), ('釀', 'u'), ('酒', 'n'), ('剛', 'i'), ('剛', 'Mg'), ('好', 'a'), ('做醋格', 'Ag'), ('外酸', 'Ng'), ('養', 'Dg'), ('牛', 'Ag'), ('隻', 'Bg'), ('隻', 'a'), ('大', 'a'), ('如', 'v'), ('山', 'n'), ('老', 'a'), ('鼠', 'Ng'), ('隻', 'Ag'), ('隻', 'Bg'), ('死', 'a')]


In [None]:
print(tokens.sentiments)

0.04306320074116554


Again, let's write a function for later use.

In [None]:
def Snow_tokenizer(text):
  tokens = SnowNLP(text)
  token_list = [tokens.words][0]
  result = " | ".join(token_list)
  return result

# CKIP Transformers

While the four models above are primarily trained on simplified Chinese, [CKIP Transformers](https://github.com/ckiplab/ckip-transformers) is trained on traditional Chinese. It is created by the [CKIP Lab](https://ckip.iis.sinica.edu.tw) at Academia Sinica. As its name suggests, CKIP Transformers is built on the Transformer architecture, such as BERT and ALBERT. 


> Note: Read this to find out [How Google Changed NLP](https://www.codemotion.com/magazine/dev-hub/machine-learning-dev/bert-how-google-changed-nlp-and-how-to-benefit-from-this/).


![BERT](https://www.codemotion.com/magazine/wp-content/uploads/2020/05/bert-google-1200x675.png)

In [None]:
#collapse-output

!pip install -U ckip-transformers
from ckip_transformers.nlp import CkipWordSegmenter

Collecting ckip-transformers
  Downloading https://files.pythonhosted.org/packages/19/53/81d1a8895cbbc02bf32771a7a43d78ad29a8c281f732816ac422bf54f937/ckip_transformers-0.2.1-py3-none-any.whl
Collecting transformers>=3.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 22.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 43.0MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 49.4MB/s 
Building wheels for collected p

`CKIP Transformers` gives its users the freedom to choose between speed and accuracy. It comes with three levels; the smaller the number, the shorter the running time. All you need to do is pass a number to the `level` argument of `CkipWordSegmenter()`. Here're the models and F1 scores for each level:

*   Level 1: CKIP ALBERT Tiny, 96.66%
*   Level 2: CKIP ALBERT Base, 97.33%
*   Level 3: CKIP BERT Base, 97.60%

By comparison, the F1 score for Jieba is only 81.18%. For more stats, visit the [CKIP Lab's repo](https://github.com/ckiplab/ckip-transformers).

In [None]:
ws_driver  = CkipWordSegmenter(level=1, device=0)

Here's the result at Level 1. What's suprising here is that this big chunk `大如山老鼠` was not further segmented. But this is not a mistake. It simply means that the model has learned it as an idiom. 

In [None]:
#collapse

tokens  = ws_driver([text])
ckip_1 = " | ".join(tokens[0])
print(ckip_1)

Tokenization: 100%|██████████| 1/1 [00:00<00:00, 3284.50it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00,  3.98it/s]

今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻隻 | 大如山老鼠 | 隻隻 | 死





Of the five libraries covered here, CKIP Transformers by far takes the longest time to run. But where it lags behind in speed (i.e. 17.8 ms per loop for top 3 results), it makes it up in accuracy. 

> Warning: Don't toggle to show the output unless you really want to see a long list of details.

In [None]:
#collapse-output

%timeit ws_driver([text])

Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1721.80it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 97.88it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1529.09it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 132.06it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1633.93it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 153.22it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 4549.14it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 140.57it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1354.75it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 147.18it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1138.52it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 126.70it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 2458.56it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 117.60it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1108.43it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 171.43it/s]
Tokenization: 100%|██████████| 1/1 [00:00

100 loops, best of 3: 17.8 ms per loop





Let's reinstantiate the `CkipWordSegmenter()` class and set the level to 2 this time.

In [None]:
ws_driver  = CkipWordSegmenter(level=2, device=0)

Here's the result at Level 2, where `大如山老鼠` was properly segmented into `大`, `如`, and `山老鼠`.

In [None]:
#collapse

tokens  = ws_driver([text])
ckip_2 = " | ".join(tokens[0])
print(ckip_2)

Tokenization: 100%|██████████| 1/1 [00:00<00:00, 2253.79it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 47.86it/s]

今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛好 | 做醋 | 格外 | 酸 | 養牛 | 隻隻 | 大 | 如 | 山老鼠 | 隻隻 | 死





Finally, let's create an instance of `CkipWordSegmenter()` at Level 3.

In [None]:
ws_driver  = CkipWordSegmenter(level=3, device=0)

However, Level 3 didn't produce a better result than Level 2. For instance, `牛隻`, though a legitimate token, is not appropriate in this context. 



In [None]:
#collapse

tokens  = ws_driver([text])
ckip_3 = " | ".join(tokens[0])
print(ckip_3)

Tokenization: 100%|██████████| 1/1 [00:00<00:00, 976.10it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 59.33it/s]

今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死





Here's the function for later use, which takes two arguments instead of one, unlike in previous cases.

In [None]:
def Ckip_tokenizer(text, level):
  ws_driver  = CkipWordSegmenter(level=level, device=0)
  tokens  = ws_driver([text])
  result = " | ".join(tokens[0])
  return result

# Comparison

To compare the five libraries, let's write a general function.

In [None]:
def Tokenizer(text, style):
  if style == 'jieba':
    result = Jieba_tokenizer(text)
  elif style == 'pku':
    result = PKU_tokenizer(text)
  elif style == 'pyhan':
    result = PyHan_tokenizer(text)
  elif style == 'snow':
    result = Snow_tokenizer(text)
  elif style == 'ckip':
    res1 = Ckip_tokenizer(text, 1)
    res2 = Ckip_tokenizer(text, 2)
    res3 = Ckip_tokenizer(text, 3)
    result = f"Level 1: {res1}\nLevel 2: {res2}\nLevel 3: {res3}"
  output = f"Result tokenized by {style}: \n{result}"
  return output

Now I'm interested in finding out whether simplified or traditional Chinese would have any effect on segmentation results. In addition to the text we've been trying (let's rename it as `text_A`), we'll also test another challenging text taken from the [PyHanLP repo](https://github.com/hankcs/pyhanlp) (let's call it `text_B`), which is intended to be ambiguous in multiple places. Given these two texts, two versions of Chinese scripts (simplified and traditional), and five segmentation libraries, we end up having in total 20 combinations of texts and libraries. 

In [None]:
#collapse-output

import itertools

textA_tra = "今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死"
textA_sim = "今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死"
textB_tra = "工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作"
textB_sim = "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作"
texts = [textA_tra, textA_sim, textB_tra, textB_sim]
tokenizers = ['jieba', 'pku', 'pyhan', 'snow','ckip']

testing_tup = list(itertools.product(texts, tokenizers))
testing_tup

[('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'jieba'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'pku'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'pyhan'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'snow'),
 ('今年好煩惱少不得打官司釀酒剛剛好做醋格外酸養牛隻隻大如山老鼠隻隻死', 'ckip'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'jieba'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'pku'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'pyhan'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'snow'),
 ('今年好烦恼少不得打官司酿酒刚刚好做醋格外酸养牛隻隻大如山老鼠隻隻死', 'ckip'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'jieba'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'pku'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'pyhan'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'snow'),
 ('工信處女幹事每月經過下屬科室都要親口交代24口交換機等技術性器件的安裝工作', 'ckip'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'jieba'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'pku'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'pyhan'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'snow'),
 ('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作', 'ckip

Here're the results for traditional `textA`.

In [None]:
#collapse

for sent in testing_tup[:5]:
  result = Tokenizer(sent[0], sent[1])
  print(result)

Result tokenized by jieba: 
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻 | 隻 | 大如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by pku: 
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒剛 | 剛 | 好 | 做 | 醋 | 格外 | 酸養 | 牛 | 隻隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死
Result tokenized by pyhan: 
今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀 | 酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻 | 隻 | 大 | 如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by snow: 
今 | 年 | 好 | 煩 | 惱 | 少不得 | 打 | 官司 | 釀 | 酒 | 剛 | 剛 | 好 | 做醋格 | 外酸 | 養 | 牛 | 隻 | 隻 | 大 | 如 | 山 | 老 | 鼠 | 隻 | 隻 | 死


Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1287.78it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 136.95it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1394.38it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 66.44it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 998.41it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.47it/s]

Result tokenized by ckip: 
Level 1: 今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛 | 隻隻 | 大如山老鼠 | 隻隻 | 死
Level 2: 今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛好 | 做醋 | 格外 | 酸 | 養牛 | 隻隻 | 大 | 如 | 山老鼠 | 隻隻 | 死
Level 3: 今年 | 好 | 煩惱 | 少不得 | 打官司 | 釀酒 | 剛剛 | 好 | 做 | 醋 | 格外 | 酸 | 養 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死





Here're the results for the simplified version of the same text. Notice that the outcome can be quite different simply because a traditional text is converted to its simplified counterpart. 

In [None]:
#collapse

for sent in testing_tup[5:10]:
  result = Tokenizer(sent[0], sent[1])
  print(result)

Result tokenized by jieba: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做 | 醋 | 格外 | 酸 | 养牛 | 隻 | 隻 | 大如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by pku: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做 | 醋 | 格外 | 酸养 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死
Result tokenized by pyhan: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚好 | 做 | 醋 | 格外 | 酸 | 养牛 | 隻 | 隻 | 大 | 如山 | 老鼠 | 隻 | 隻 | 死
Result tokenized by snow: 
今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做醋 | 格外 | 酸 | 养 | 牛 | 隻 | 隻 | 大 | 如 | 山 | 老 | 鼠 | 隻 | 隻 | 死


Tokenization: 100%|██████████| 1/1 [00:00<00:00, 303.61it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 123.89it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 695.69it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 66.45it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 392.84it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 72.00it/s]

Result tokenized by ckip: 
Level 1: 今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿 | 酒 | 刚刚 | 好 | 做 | 醋 | 格外 | 酸 | 养 | 牛隻隻 | 大如山老鼠 | 隻隻 | 死
Level 2: 今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚 | 好 | 做醋 | 格外 | 酸 | 养 | 牛 | 隻隻 | 大 | 如 | 山老鼠 | 隻隻 | 死
Level 3: 今年 | 好 | 烦恼 | 少不得 | 打官司 | 酿酒 | 刚刚好 | 做 | 醋 | 格外 | 酸 | 养 | 牛隻 | 隻 | 大 | 如 | 山 | 老鼠 | 隻隻 | 死





Here're the results for traditional `textB`. Serious mistakes include `處女` (for "virgin") and `口交` (for "blowjob"). Both are correct words in Chinese, but not the intended ones in this context.

In [None]:
#collapse

for sent in testing_tup[10:15]:
  result = Tokenizer(sent[0], sent[1])
  print(result)

Result tokenized by jieba: 
工信 | 處女 | 幹事 | 每月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口交 | 換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Result tokenized by pku: 
工信 | 處女 | 幹事 | 每月 | 經 | 過下 | 屬科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交 | 換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Result tokenized by pyhan: 
工 | 信 | 處女 | 幹 | 事 | 每月 | 經 | 過 | 下 | 屬 | 科室 | 都 | 要 | 親 | 口 | 交代 | 24 | 口交 | 換機 | 等 | 技 | 術 | 性 | 器件 | 的 | 安 | 裝 | 工作
Result tokenized by snow: 
工 | 信 | 處 | 女 | 幹 | 事 | 每 | 月 | 經 | 過 | 下 | 屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交 | 換 | 機 | 等 | 技 | 術性 | 器件 | 的 | 安 | 裝 | 工作


Tokenization: 100%|██████████| 1/1 [00:00<00:00, 494.49it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 119.49it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 402.87it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.66it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 3942.02it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.56it/s]

Result tokenized by ckip: 
Level 1: 工信 | 處女 | 幹事 | 每 | 月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Level 2: 工信處 | 女 | 幹事 | 每 | 月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作
Level 3: 工信處 | 女 | 幹事 | 每 | 月 | 經過 | 下屬 | 科室 | 都 | 要 | 親口 | 交代 | 24 | 口 | 交換機 | 等 | 技術性 | 器件 | 的 | 安裝 | 工作





Here're the results for the simplified version of `textB`. In terms of `textB`, CKIP Transformers Level 2 and 3 are most stable, giving the same error-free results regardless of the writing sytems.

In [None]:
#collapse

for sent in testing_tup[15:]:
  result = Tokenizer(sent[0], sent[1])
  print(result)

Result tokenized by jieba: 
工信处 | 女干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Result tokenized by pku: 
工信 | 处女 | 干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Result tokenized by pyhan: 
工信处 | 女干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Result tokenized by snow: 
工 | 信处女 | 干事 | 每月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作


Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1220.69it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 131.83it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 878.39it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 71.48it/s]
Tokenization: 100%|██████████| 1/1 [00:00<00:00, 1254.65it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00, 60.75it/s]

Result tokenized by ckip: 
Level 1: 工信处 | 女干 | 事 | 每 | 月 | 经过 | 下 | 属 | 科室 | 都 | 要 | 亲 | 口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Level 2: 工信处 | 女 | 干事 | 每 | 月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作
Level 3: 工信处 | 女 | 干事 | 每 | 月 | 经过 | 下属 | 科室 | 都 | 要 | 亲口 | 交代 | 24 | 口 | 交换机 | 等 | 技术性 | 器件 | 的 | 安装 | 工作





# Recap

This post has tested five word segmentation libraries against two challenging Chinese texts. Here're the takeaways:

*   If you value speed more than anything, Jieba is definitely the top choice. If you're dealing with traditional Chinese, it is a good practice to first convert your texts to simplified Chinese before feeding them to Jieba. Doing this may produce better results.

*   If you care more about accuracy instead, it's best to use CKIP Transformers. Its Level 2 and 3 produce consistent results whether your texts are in traditional or simplified Chinese.

*   Finally, if you hope to levarage the power of NLP libraries such as spaCy and [Texthero](https://github.com/jbesomi/texthero#faq) (by the way, their slogan is really awesome: `from zero to hero`), you'll have to go for Jieba or PKUSeg. I hope spaCy will also add CKIP to its inventory of tokenizers in the near future. 


