<a href="https://colab.research.google.com/github/mzorki/tutorials/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This tutorial

Below is Python3 code for working with different types of tokenizers of Chinese language. The corresponding files are:
- article in Digital Orientalist with more information TODO
- git repository and README file are here: https://github.com/mzorki/tutorials 
- alternative download link [from Google Drive](https://drive.google.com/drive/folders/1FQ8NAqBm7fZB0IPAXYOAHzQlXlCL9m7S?usp=sharing)

**Important** 
- for the code in this notebook to work one needs to have the entire folder on their Google Drive. Unfortunately, after some recent changes it is impossible to directly copy a shared Drive folder. The easiest way is to download the whole folder and add it to one's Drive manually.

Notes
- This file can be also used by those who do not know how to code. Just follow the insctuctions and run the code in cells by pressing the "play" button in upper left corner. 
- If running the file from Google Colab, the table of contents can be opened in the menu on the left side of this window
- Colab will shut down automatically after some time. This means, that it is impossible to run code that requires several hours to process (see [How long can notebooks run in Colab?](https://research.google.com/colaboratory/faq.html#:~:text=How%20long%20can%20notebooks%20run,or%20based%20on%20your%20usage.)). 




# Step 1. Import all the required libraries.


To be able to run different tokenization tools, we need to import them.
Below is an example import statement.
In each section that deals with a tool, there will be a separate import. Do not forget to run a cell that imports a tool before trying it out, otherwise the code will not work.<br><br>
In some cases like jieba, Google already has all the necessary data downloaded, in some – like with HanLP and Udkanbun, this notebook will first download some files. Please be careful with data usage: the files can be quite big and they get redownloaded each time this notebook is run.


In [1]:
from tqdm.notebook import tqdm 
import re

# Step 2. Mount Google Drive
For Colab to be able to work with Google Drive as a normal directory, we need to give it permissions to do so and tell it the place where this notebook is. <br> 
Run the cell, then click the link that appears below, give the permissions and copy the code that will appear on that page into the field below.

In [2]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


Input the path to the project. Generally starts with "gdrive/My Drive/" + path to the project folder. <br><br>
Notes:
- In my case, Google Drive first level folder called "My Drive" contains a "Shared" folder with my working folder "Chinese_tokenizers" inside of it, hence the path. <br>
- It is always better to have no spaces in folder names, but name "My Drive" is set automatically, so we cannot change it.


In [3]:
project_dir = "/gdrive/My Drive/Shared/Chinese_tokenizers/" 

#move to the working directory 
%cd {project_dir} 

/gdrive/My Drive/Shared/Chinese_tokenizers


# Step 3. Save the paths to the dictionary and the text to tokenize.

CDICT (Stardict version) dictionary with full-form characters is used in this notebook. [Link to CDICT and other open-access dictionaries](http://download.huzheng.org/zh_TW/).

To use a custom dictionary:
- create a .txt file with one word per line
- upload the dictionary to the project folder
- change the file name in the cell below 

To tokenize your own text:
- create a .txt file with the text to tokenize.
- no special formatting is required. On the other hand, if there was any, it might be lost after tokenization. 
- change the file name in the cell below



In [4]:
dictionary_path = 'CDICT(Stardict)_wordlist.txt'
text_path = 'test_text.txt'

tok_text = open(text_path).read()

#this line splits the text into smaller chunks. It assumes that there are separations made with new lines.
#there are ways to make this more elegantly, but it depends on a specific formatting of a file.
sentences = [i for i in tok_text.split('\n') if i!='']

# Step 4. Test the tokenizers.

Below are examples of code to work with several tokenizers. Generally, there are following parts:
- a short description of the tool
- link to the GitHub repository
- example code to tokenize one phrase
- code to load user dictionary
- example code to tokenize a .txt file

Notes:
- after a user dictionary has been loaded, the tokenizer will remember it for all operations afterwards. To reset, run the cells once more from the beginning of a section
- output files are saved to the folder 'results'. Sometimes Google Drive requires some time to update the folder and add a new file. If you do not see the file with the results, try refreshing the page or waiting for a couple of minutes. 

## Jieba

[jieba GitHub repository](https://github.com/fxsjy/jieba).

Jieba is one of the most popular tokenizers for Modern Chinese. It has very detailed instructions on their github page. 
<br>
Has many fine-tuning options (including whether it uses a statistical model or adds machine learning), PoS tagging and a possibility to add a user-defined dictionary.
<br>
Very good with Modern Chinese. Works fine with full characters, but for MC and OC will recognize many long phrases as words.
<br>
Github page has more explanations and examples.

In [None]:
import jieba

#remove the "#" below and run the cell if you want jieba to use machine learning. 
#jieba.enable_paddle() 

### Tokenize a sentence
Insert your own text to test it.

In [None]:
text = "我来到北京清华大学"

Tokenization in the "Default mode": <br>
note that when possible this algorithm will choose to keep longer sequences of characters not split.

In [None]:
seg_list = jieba.cut(text, cut_all=False)
print("Default Mode: " + " ".join(seg_list))  # 精确模式

Default Mode: 我 来到 北京 清华大学


Tokenization in "Full mode": <br>
note how in moments of uncertainty the algorithm returns **all possible** variants. This adds a lot of noise and makes the text not appropriate for corpus analysis.

In [None]:
seg_list = jieba.cut(text, cut_all=True)
print("Full Mode: " + " ".join(seg_list))  # 全模式

Full Mode: 我 来到 北京 清华 清华大学 华大 大学


Load dictonary. <br>
By default uses CDICT. If you want to use another one, replace dictionary file name in "Step 3".


In [None]:
jieba.load_userdict(dictionary_path)

Run the tokenization with the dictionary using the default mode. <br>
Here I used a dictionary with full forms and will attempt to tokenize a classical poem. 

In [None]:
seg_list = jieba.cut("建章歡賞夕，二八盡妖妍。羅綺昭陽殿，芬芳玳瑁筵 。", cut_all=False)
print("Full Mode + dictionary: " + " ".join(seg_list))

Full Mode + dictionary: 建章 歡賞夕 ， 二八 盡 妖妍 。 羅綺 昭陽殿 ， 芬芳 玳瑁筵   。


### Tokenize and save a text

The name of the file to tokenize should be inserted in "Step 3". <br>
The file is saved in the 'results' folder.




In [None]:
fh = open('./results/jieba_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = jieba.cut(phrase, cut_all=False)
  joined_list = " ".join(seg_list)
  fh.write(f'{joined_list}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=23.0), HTML(value='')))




## HanLP

[Original HanLP repository](https://github.com/hankcs/HanLP) <br>
[Python version of HanLP](https://github.com/hankcs/pyhanlp/wiki/%E6%89%8B%E5%8A%A8%E9%85%8D%E7%BD%AE)

HanLP is another heavyweight in processing of the Chinese language. It is written in Java and has a Python interface (pyhanlp) added on top of it, so parts of the code are counterintuitive for Python users. <br><br>
HanLP heavily uses Machine Learning and offers a wide range of problems it can solve, including tokenization, part-of-speech tagging, dependency parsing etc. It also offers smart tagging of pinyin and transformation from simplified to full characters. The fact that it uses machine learning means, that it will distinguish between cases like 后/後, 云/雲 and will try to choose the more appropriate one in each case. <br>
Below I will only cover basic tokenization with or without a user dictionary. <br> HanLP works very well for modern Chinese. For wenyan it often allows long sequences of characters to remain not split, but unlike jieba it does not introduce any extra noise. <br>

**Important!**
- It is not pre-installed by Google Colab, so we need to first install it and download all the necessary files for it to work
- keep an eye on data usage
- after this file is closed, all the downloaded data will be deleted and will need to be downloaded again with the next use

In [None]:
!pip install pyhanlp
import pyhanlp
from pyhanlp import *

Collecting pyhanlp
[?25l  Downloading https://files.pythonhosted.org/packages/8f/99/13078d71bc9f77705a29f932359046abac3001335ea1d21e91120b200b21/pyhanlp-0.1.66.tar.gz (86kB)
[K     |████████████████████████████████| 92kB 5.7MB/s 
[?25hCollecting jpype1==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/09/e19ce27d41d4f66d73ac5b6c6a188c51b506f56c7bfbe6c1491db2d15995/JPype1-0.7.0-cp36-cp36m-manylinux2010_x86_64.whl (2.7MB)
[K     |████████████████████████████████| 2.7MB 16.1MB/s 
[?25hBuilding wheels for collected packages: pyhanlp
  Building wheel for pyhanlp (setup.py) ... [?25l[?25hdone
  Created wheel for pyhanlp: filename=pyhanlp-0.1.66-py2.py3-none-any.whl size=29371 sha256=d2eb230d91c947d8da511175db62e835fcc25858539e4180bcf6fa3b59187378
  Stored in directory: /root/.cache/pip/wheels/25/8d/5d/6b642484b1abd87474914e6cf0d3f3a15d8f2653e15ff60f9e
Successfully built pyhanlp
Installing collected packages: jpype1, pyhanlp
Successfully installed jpype1-0.7.0 pyhan

### Tokenize a sentence

In [23]:
text = "我来到北京清华大学"
seg_list = [str(i) for i in HanLP.segment(text)] #hanlp is written in java, so a conversion to Python format is necessary
print(" ".join(seg_list))

我/rr 来到/v 北京/ns 清华大学/ntu


Let's remove the PoS segmentation. Do not run if you want to keep it. Will not work well for MC and OC.

In [24]:
JClass("com.hankcs.hanlp.HanLP$Config").ShowTermNature = False

In [None]:
seg_list = [str(item) for item in HanLP.segment(text)] #hanlp is written in java, so a conversion is necessary
print(" ".join(seg_list))

我 来到 北京 清华大学


Load a user dictionary. <br>
Do not run if don't want to use it. 
Run the whole HanLP segment from the beginning if you want to stop using it.

In [25]:
CustomDictionary = JClass("com.hankcs.hanlp.dictionary.CustomDictionary")

dictionary = open(dictionary_path).read()
words = dictionary.split('\n')

for word in words:
    CustomDictionary.add(word)

When it comes to classical Chinese, unlike jieba , HanLP decides to split some character sequences. 

In [26]:
seg_list = [str(item) for item in HanLP.segment("建章歡賞夕，二八盡妖妍。羅綺昭陽殿，芬芳玳瑁筵 。")] 
print(" ".join(seg_list))

建章 歡 賞 夕 ， 二八 盡 妖 妍 。 羅綺 昭陽殿 ， 芬芳 玳瑁筵   。


### Tokenize and save a text

The name of the file to tokenize should be inserted in "Step 3". <br>
The file is saved in the 'results' folder.

In [None]:
fh = open('./results/hanlp_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = [str(item) for item in HanLP.segment(phrase)]
  joined_list = " ".join(seg_list)
  fh.write(f'{joined_list}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))




## Udkanbun

[GitHub repository](https://github.com/KoichiYasuoka/UD-Kanbun) <br>
[UNIVERSAL DEPENDENCIES TREEBANK OF THE FOUR BOOKS
IN CLASSICAL CHINESE](http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/publications/2019-12-04.pdf) <br>
[Project page](http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/kyodokenkyu/2018-12-01.html)

Udkanbun is a tokenizer, POS-Tagger, and Dependency-Parser for Classical Chinese Texts (漢文/文言文). It was primarily created for dependency parcing. This means, that when it comes to compounds, the algorithm will prefer to treat them as separate words and map their syntactic relationship.

Just like with HanLP, it needs to be installed before we can use it.


In [None]:
!pip install udkanbun
import udkanbun

Collecting udkanbun
[?25l  Downloading https://files.pythonhosted.org/packages/b4/88/dec18d7ad738edeaacb050e3285c09a567f767e2c4c2d730e9bd5d61e1c3/udkanbun-2.7.2.tar.gz (13.9MB)
[K     |████████████████████████████████| 13.9MB 315kB/s 
[?25hCollecting ufal.udpipe>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/e5/72/2b8b9dc7c80017c790bb3308bbad34b57accfed2ac2f1f4ab252ff4e9cb2/ufal.udpipe-1.2.0.3.tar.gz (304kB)
[K     |████████████████████████████████| 307kB 37.1MB/s 
[?25hCollecting mecab-python3>=0.996.5
[?25l  Downloading https://files.pythonhosted.org/packages/b4/f0/b57bfb29abd6b898d7137f4a276a338d2565f28a2098d60714388d119f3e/mecab_python3-1.0.3-cp36-cp36m-manylinux1_x86_64.whl (487kB)
[K     |████████████████████████████████| 491kB 40.7MB/s 
[?25hCollecting deplacy>=1.8.9
  Downloading https://files.pythonhosted.org/packages/a4/85/707706dbc2e0626b5408c2469908810e5c4d5374995745bca04f450f41ec/deplacy-1.8.9-py3-none-any.whl
Building wheels for collected packa

Load the tokenizer.

In [None]:
lzh=udkanbun.load()

### Working with one sentence. Full information and syntactic trees.

In [None]:
text = "建章歡賞夕，二八盡妖妍。羅綺昭陽殿，芬芳玳瑁筵 。"

In [None]:
seg_phrase = lzh(text)

View full information.

In [None]:
print(seg_phrase)

# text = 建章歡賞夕，二八盡妖妍。
1	建	建	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=establish|SpaceAfter=No
2	章	章	PROPN	n,名詞,人,姓氏	NameType=Sur	1	obj	_	Gloss=[surname]|SpaceAfter=No
3	歡	歡	PROPN	n,名詞,人,名	NameType=Giv	2	flat	_	Gloss=[given-name]|SpaceAfter=No|Translit=欢
4	賞	賞	VERB	v,動詞,行為,交流	_	1	parataxis	_	Gloss=reward|SpaceAfter=No|Translit=赏
5	夕	夕	NOUN	n,名詞,時,*	Case=Tem	4	obj	_	Gloss=evening|SpaceAfter=No
6	，	，	PUNCT	s,記号,読点,*	_	9	cc	_	SpaceAfter=No
7	二	二	NUM	n,数詞,数字,*	_	9	nsubj	_	Gloss=two|SpaceAfter=No
8	八	八	NUM	n,数詞,数字,*	_	7	conj	_	Gloss=eight|SpaceAfter=No
9	盡	盡	VERB	v,動詞,行為,動作	_	1	conj	_	Gloss=exhaust|SpaceAfter=No|Translit=尽
10	妖	妖	VERB	v,動詞,描写,態度	Degree=Pos	11	amod	_	Gloss=bewitching|SpaceAfter=No
11	妍	妍	NOUN	n,名詞,*,*	_	9	obj	_	SpaceAfter=No
12	。	。	PUNCT	s,記号,句点,*	_	1	punct	_	SpaceAfter=No

# text = 羅綺昭陽殿，芬芳玳瑁筵 。
1	羅	羅	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=gather|SpaceAfter=No|Translit=罗
2	綺	綺	VERB	v,動詞,描写,形質	Degree=Pos	1	flat:vv	_	SpaceAfter=No|Translit=绮
3	昭	昭	VERB	v,動詞,描写,形質	Degree=Pos	5	amod	_	Gloss=br

Show just the visual representation of the dependency tree.

In [None]:
print(seg_phrase.to_tree())

建 ═══╗═╗═══╗═╗ root
章 ═╗<╝ ║   ║ ║ obj
歡 <╝   ║   ║ ║ flat
賞 ═╗<══╝   ║ ║ parataxis
夕 <╝       ║ ║ obj
， <══════╗ ║ ║ cc
二 ═╗<══╗ ║ ║ ║ nsubj
八 <╝   ║ ║ ║ ║ conj
盡 ═══╗═╝═╝<╝ ║ conj
妖 <╗ ║       ║ amod
妍 ═╝<╝       ║ obj
。 <══════════╝ punct
羅 ═╗═══╗═══╗═╗ root
綺 <╝   ║   ║ ║ flat:vv
昭 <══╗ ║   ║ ║ amod
陽 <╗ ║ ║   ║ ║ nmod
殿 ═╝═╝<╝   ║ ║ obj
， <══════╗ ║ ║ advmod
芬 <══╗   ║ ║ ║ nmod
芳 ═╗═╝═╗═╝<╝ ║ obj
玳 <╝   ║     ║ conj
瑁 <╗   ║     ║ nmod
筵 ═╝<══╝     ║ conj
。 <══════════╝ punct



We can save the tree to an .svg file. <br>
The contents of the file might not be visible within Google Drive. To view, download the file to computer, right-click and choose "Open with" => "Google Chrome"

In [None]:
f=open("trial.svg","w")
f.write(seg_phrase.to_svg())
f.close()

Only show the tokenized text.


In [None]:
print(" ".join([i.form for i in seg_phrase[1:]]))

建 章 歡 賞 夕 ， 二 八 盡 妖 妍 。 羅 綺 昭 陽 殿 ， 芬 芳 玳 瑁 筵 。


### Tokenize and save a text

The name of the file to tokenize should be inserted in "Step 3". <br>
The file is saved in the 'results' folder.

In [None]:
fh = open('./results/udkanbun_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = lzh(phrase)
  seg_phrase = " ".join([i.form for i in seg_list[1:]])
  fh.write(f'{seg_phrase}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))




## Dictionary-based tokenization
This is a small and very simple script that crawls through the text and tries to match character sequences to a dictionary. <br>
It does not use any advanced techniques and because of that tends to get "greedy": in a sequence "ABCD", even if the best way to tokenize is "AB" + "CD", it will return "ABC" + "D" whenever possible. To deal with this, the maximum allowed word length is set to 2, but can be manually changed below.<br>
Unlike udkanbun it will split the text into larger chunks when possible, but will not allow for whole sentences to remain as is.
<br>
**Important** <br>
- Because of the simplicity this tokenizer is also very slow. It is possible to make it slightly faster by adding extra power from your computer. For this go to "Runtime" => "Change runtime type" => set "Hardware accelerator" to GPU.
- After changing the settings click on "Runtime" => "Restart runtime" to shut down the notebook, run all the cells in steps 1-3 again and return to this one.


In [5]:
from helper import dict_tokenizer as dt

Load dictonary. <br>
By default uses CDICT. If you want to use another one, replace dictionary file name in "Step 3".

In [6]:
dictionary = dt.open_vocab(dictionary_path)

### Tokenize a sentence
By default allows only 1 or 2 character words. Change number in "longest_word" to allow longer ones.

In [None]:
seg_list = dt.tokenize(text, vocab = dictionary, longest_word=2)
print(" ".join(seg_list))

建章 歡 賞 夕 ， 二八 盡 妖 妍 。 羅綺 昭陽 殿 ， 芬芳 玳瑁 筵 。


### Tokenize a text

In [7]:
fh = open('./results/dictionary_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = dt.tokenize(phrase, vocab = dictionary, longest_word=2)
  seg_phrase = " ".join(seg_list)
  fh.write(f'{seg_phrase}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))


