<a href="https://colab.research.google.com/github/mzorki/tutorials/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This tutorial

Below is Python3 code for working with different types of tokenizers of Chinese language. The corresponding git repository and README file are here: https://github.com/mzorki/tutorials 
- This file can be also used by those who do not know how to code. Just follow the insctuctions and run the code in cells by pressing the "play" button in upper left corner. 
- If running the file from Google Colab, the table of contents can be opened in the menu on the left side of this window. 
- For more information on tokenization, read this article in Digital Orientalist
TODO insert.


# Step 1. Import all the required libraries.


In [2]:
import pandas as pd
import json
from pandas import json_normalize
from tqdm.notebook import tqdm 
import re

# Step 2. Mount Google Drive
For Colab to be able to work with Google Drive as a normal directory, we need to give it permissions to do so and tell it the place where this notebook is. <br> 
Run the cell, then click the link that appears below, give the permissions and copy the code that will appear on that page into the field below.

In [3]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

Mounted at /gdrive


Input the path to the project. Generally starts with "gdrive/My Drive/" + path to the project folder. <br>
In my case, I have a folder "Chinese_tokenizers" in the Google Drive first level folder called "My Drive", hence the path. <br>
It is always better not to have spaces in folder names.


In [4]:
project_dir = "/gdrive/My Drive/Chinese_tokenizers/" 

#move to the working directory 
%cd {project_dir} 

/gdrive/My Drive/Chinese_tokenizers


# Step 3. Save the paths to the dictionary and the text to tokenize.

CDICT (Stardict) dictionary is used in this notebook.
To use a custom dictionary:
- create a .txt file with one word per line
- upload the dictionary to the project folder
- change the file name in the cell below 

To tokenize a text:
- create a .txt file with the text to tokenize.
- no special formatting is required for the text. On the other hand, if there was any, it might be lost after tokenization. 
- change the file name in the cell below

In [14]:
dictionary_path = 'CDICT(Stardict)_wordlist.txt'
text_path = 'test_text.txt'

tok_text = open(text_path).read()
sentences = [i for i in tok_text.split('\n') if i!='']

# Step 4. Test the tokenizers.

Below are examples of code to work with several tokenizers. Generally, there are following parts:
- a short description of the tool
- link to the github repository
- example code to tokenize one phrase
- code to load used dictionary
- example code to tokenize a .txt file

Please notice:
- after a user dictionary has been loaded, the tokenizer will remember it for all operations afterwards. To reset, run the cells from the beginning of a section
- output files are saved to the working folder. For Google Drive it sometimes takes some time to update the folder and add a new file. If you do not see it, try refreshing Google Drive page or waiting for a minute. 

## Jieba

https://github.com/fxsjy/jieba

jieba is one of the most popular tokenizers for Modern Chinese. It has very detailed instructions on their github page. 
<br>
Has many fine-tuning options (including whether it uses a statistical model or adds machine learning), PoS tagging and a possibility to add a user-defined dictionary.
<br>
Very good with Modern Chinese. Works fine with full characters, but for MC and OC will recognize many long phrases as words.
<br>
Github page has more explanations and examples.

In [6]:
import jieba

#remove the "#" below if you want jieba to use machine learning. 
#jieba.enable_paddle() 

### Tokenize a sentence.
Insert your own text to test it.

In [9]:
text = "我来到北京清华大学"

Tokenization in the "Default mode": <br>
note that when possible this algorithm will choose to keep longer sequences of characters not split.

In [11]:
seg_list = jieba.cut(text, cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

Default Mode: 我/ 来到/ 北京/ 清华大学


Tokenization in "Full mode": <br>
note how in moments of uncertainty the algorithm returns **all possible** variants. This adds a lot of noise and makes the text not appropriate for corpus analysis.

In [10]:
seg_list = jieba.cut(text, cut_all=True)
print("Full Mode: " + " ".join(seg_list))  # 全模式

Full Mode: 我 来到 北京 清华 清华大学 华大 大学


Load the dictionary file. 


In [12]:
jieba.load_userdict(dictionary_path)

Run the tokenization with the dictionary.

In [23]:
seg_list = jieba.cut("建章歡賞夕，二八盡妖妍。羅綺昭陽殿，芬芳玳瑁筵 。", cut_all=False)
print("Full Mode + dictionary: " + " ".join(seg_list))

Full Mode + dictionary: 建章 歡賞夕 ， 二八 盡 妖妍 。 羅綺 昭陽殿 ， 芬芳 玳瑁筵   。


### Tokenize a text

The name of the file to tokenize should be inserted in "Step 3". <br>




In [19]:
with open('k.txt', 'w') as fh:
  fh.write('hey')

In [36]:
fh = open('jieba_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = jieba.cut(phrase, cut_all=False)
  joined_list = " ".join(seg_list)
  fh.write(f'{joined_list}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))




## HanLP
HanLP is a ... <br>
It is not pre-installed by Google Colab, so we need to first install it and download all the necessary files for it to work. <br><br>
**Important!**
- keep an eye on data usage
- after this file is closed, all the downloaded data will be deleted. So one needs to reinstall everything each time.

In [24]:
!pip install pyhanlp
import pyhanlp



## Udkanbun

Udkanbun is ... https://github.com/KoichiYasuoka/UD-Kanbun <br>
Once again, it needs to be installed before we can use it.


In [25]:
!pip install udkanbun
import udkanbun

Collecting udkanbun
[?25l  Downloading https://files.pythonhosted.org/packages/b4/88/dec18d7ad738edeaacb050e3285c09a567f767e2c4c2d730e9bd5d61e1c3/udkanbun-2.7.2.tar.gz (13.9MB)
[K     |████████████████████████████████| 13.9MB 315kB/s 
[?25hCollecting ufal.udpipe>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/e5/72/2b8b9dc7c80017c790bb3308bbad34b57accfed2ac2f1f4ab252ff4e9cb2/ufal.udpipe-1.2.0.3.tar.gz (304kB)
[K     |████████████████████████████████| 307kB 37.1MB/s 
[?25hCollecting mecab-python3>=0.996.5
[?25l  Downloading https://files.pythonhosted.org/packages/b4/f0/b57bfb29abd6b898d7137f4a276a338d2565f28a2098d60714388d119f3e/mecab_python3-1.0.3-cp36-cp36m-manylinux1_x86_64.whl (487kB)
[K     |████████████████████████████████| 491kB 40.7MB/s 
[?25hCollecting deplacy>=1.8.9
  Downloading https://files.pythonhosted.org/packages/a4/85/707706dbc2e0626b5408c2469908810e5c4d5374995745bca04f450f41ec/deplacy-1.8.9-py3-none-any.whl
Building wheels for collected packa

Load the tokenizer.

In [26]:
lzh=udkanbun.load()

### Working with one phrase. Full information and syntactic trees.

In [27]:
text = "建章歡賞夕，二八盡妖妍。羅綺昭陽殿，芬芳玳瑁筵 。"

In [28]:
seg_phrase = lzh(text)

It is possible to load the full information from the syntactic tree.

In [29]:
print(seg_phrase)

# text = 建章歡賞夕，二八盡妖妍。
1	建	建	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=establish|SpaceAfter=No
2	章	章	PROPN	n,名詞,人,姓氏	NameType=Sur	1	obj	_	Gloss=[surname]|SpaceAfter=No
3	歡	歡	PROPN	n,名詞,人,名	NameType=Giv	2	flat	_	Gloss=[given-name]|SpaceAfter=No|Translit=欢
4	賞	賞	VERB	v,動詞,行為,交流	_	1	parataxis	_	Gloss=reward|SpaceAfter=No|Translit=赏
5	夕	夕	NOUN	n,名詞,時,*	Case=Tem	4	obj	_	Gloss=evening|SpaceAfter=No
6	，	，	PUNCT	s,記号,読点,*	_	9	cc	_	SpaceAfter=No
7	二	二	NUM	n,数詞,数字,*	_	9	nsubj	_	Gloss=two|SpaceAfter=No
8	八	八	NUM	n,数詞,数字,*	_	7	conj	_	Gloss=eight|SpaceAfter=No
9	盡	盡	VERB	v,動詞,行為,動作	_	1	conj	_	Gloss=exhaust|SpaceAfter=No|Translit=尽
10	妖	妖	VERB	v,動詞,描写,態度	Degree=Pos	11	amod	_	Gloss=bewitching|SpaceAfter=No
11	妍	妍	NOUN	n,名詞,*,*	_	9	obj	_	SpaceAfter=No
12	。	。	PUNCT	s,記号,句点,*	_	1	punct	_	SpaceAfter=No

# text = 羅綺昭陽殿，芬芳玳瑁筵 。
1	羅	羅	VERB	v,動詞,行為,動作	_	0	root	_	Gloss=gather|SpaceAfter=No|Translit=罗
2	綺	綺	VERB	v,動詞,描写,形質	Degree=Pos	1	flat:vv	_	SpaceAfter=No|Translit=绮
3	昭	昭	VERB	v,動詞,描写,形質	Degree=Pos	5	amod	_	Gloss=br

Or just the visual representation.

In [30]:
print(seg_phrase.to_tree())

建 ═══╗═╗═══╗═╗ root
章 ═╗<╝ ║   ║ ║ obj
歡 <╝   ║   ║ ║ flat
賞 ═╗<══╝   ║ ║ parataxis
夕 <╝       ║ ║ obj
， <══════╗ ║ ║ cc
二 ═╗<══╗ ║ ║ ║ nsubj
八 <╝   ║ ║ ║ ║ conj
盡 ═══╗═╝═╝<╝ ║ conj
妖 <╗ ║       ║ amod
妍 ═╝<╝       ║ obj
。 <══════════╝ punct
羅 ═╗═══╗═══╗═╗ root
綺 <╝   ║   ║ ║ flat:vv
昭 <══╗ ║   ║ ║ amod
陽 <╗ ║ ║   ║ ║ nmod
殿 ═╝═╝<╝   ║ ║ obj
， <══════╗ ║ ║ advmod
芬 <══╗   ║ ║ ║ nmod
芳 ═╗═╝═╗═╝<╝ ║ obj
玳 <╝   ║     ║ conj
瑁 <╗   ║     ║ nmod
筵 ═╝<══╝     ║ conj
。 <══════════╝ punct



We can save the tree to an .svg file. <br>
The contents of the file might not be visible within Google Drive. To view, download the file to computer, right-click and choose "Open with" => "Google Chrome"

In [34]:
f=open("trial.svg","w")
f.write(seg_phrase.to_svg())
f.close()

Only save the text.

In [35]:
print(" ".join([i.form for i in seg_phrase[1:]]))

建 章 歡 賞 夕 ， 二 八 盡 妖 妍 。 羅 綺 昭 陽 殿 ， 芬 芳 玳 瑁 筵 。


### Working with a bigger text

In [38]:
fh = open('udkanbun_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = lzh(phrase)
  seg_phrase = " ".join([i.form for i in seg_list[1:]])
  fh.write(f'{seg_phrase}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))




## Dictionary-based tokenization

In [41]:
from helper import dict_tokenizer as dt

By default uses CDICT. If you want to use another one, replace dictionary file name in "Step 3".

In [60]:
dictionary = dt.open_vocab(dictionary_path)

### Tokenize a sentence
By default allows only 1 or 2 character words. Change number in "longest_word" to allow longer ones.

In [56]:
seg_list = dt.tokenize(text, vocab = dictionary, longest_word=2)
print(" ".join(seg_list))

建章 歡 賞 夕 ， 二八 盡 妖 妍 。 羅綺 昭陽 殿 ， 芬芳 玳瑁 筵 。


### Tokenize a text

In [54]:
fh = open('dictionary_tokenized.txt', 'w')

for phrase in tqdm(sentences):
  seg_list = dt.tokenize(phrase, vocab = dictionary, longest_word=2)
  seg_phrase = " ".join(seg_list)
  fh.write(f'{seg_phrase}\n')

fh.close()

HBox(children=(FloatProgress(value=0.0, max=17.0), HTML(value='')))


