## Introduction to Chinese NLP using Jieba

Jieba is an useful library performing Chinese word segmentation.

* ```pip install jieba```

Jieba supports three segmentation methods:

- Accurate Mode (精確模式)，試圖將句子最精確地切開，適合文本分析: ```jieba.cut(sentence, cut_all=False)```
    
- Full Mode (全模式)，把句子中所有的可以成詞的詞語都掃描出來, 速度非常快，但是不能解決歧義:  ```jieba.cut(sentence, cut_all=True)```

- Search Engine Mode (搜索引擎模式)，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用於搜索引擎分詞: ```jieba.cut_for_search```

Use Hidden Markov Model as default. Alter by changing the keyword ```HMM=False```.

In [1]:
import jieba

In [2]:
sentence = "獨立音樂需要大家一起來推廣，歡迎加入我們的行列！"
print ("Example：", sentence)

Example： 獨立音樂需要大家一起來推廣，歡迎加入我們的行列！


In [14]:
words = jieba.cut(sentence, cut_all=False)

In [15]:
print("Default/Accurate Mode:" + "/ ".join(words))

Default/Accurate Mode:獨立/ 音樂/ 需要/ 大家/ 一起/ 來/ 推廣/ ，/ 歡迎/ 加入/ 我們/ 的/ 行列/ ！


In [5]:
sentence_2 = "独立音乐需要大家一起来推广，欢迎加入我们的行列！"

In [6]:
words_2 = jieba.cut(sentence_2, cut_all=False)

In [8]:
l_2 = []
for x in words_2:
    l_2.append(x)

In [10]:
"/".join(l_2)

'音乐/需要/大家/一/起来/推广/，/欢迎/加入/我们/的/行列/！'

### Compare results: Trad. vs Simp. Chinese

There are a slight difference in the segmentation results between Traditional vs Simplified Chinese.

This is because the original dictionary is built upon a simplified one.

## Chinese Lyrics Segmentation

This example we will use ***Remeber Me***, the famous theme song from the movie **Coco**.

In [11]:
lyrics = '''請記住我 雖然再見必須說
請記住我 眼淚不要墜落
我雖然要離你遠去 你住在我心底
在每個分離的夜裡 為你唱一首歌
請記住我 雖然我要去遠方
請記住我 當聽見吉他的悲傷
這就是我跟你在一起唯一的憑據
直到我再次擁抱你 請記住我

你閉上眼睛音樂就會響起 不停的愛就永不會流失
你閉上眼睛音樂就會響起 要不停的愛

請記住我 雖然再見必須說
請記住我 眼淚不要墜落
我雖然要離你遠去 你住在我心底
在每個分離的夜裡 為你唱一首歌
請記住我 我即將會消失
請記住我 我們的愛不會消失
我用我的辦法跟你一起不離不棄
直到我再次擁抱你 請記住我'''

In [19]:
words_3 = jieba.cut(lyrics, cut_all=False)

In [20]:
for x in words_3:
    print(x, end='/')

請/記住/我/ /雖然/再/見/必須/說/
/請/記住/我/ /眼淚/不要/墜落/
/我/雖然/要/離/你/遠/去/ /你/住/在/我/心底/
/在/每個/分離/的/夜裡/ /為/你/唱/一首歌/
/請/記住/我/ /雖然/我/要/去/遠方/
/請/記住/我/ /當聽/見/吉他/的/悲傷/
/這/就是/我/跟/你/在/一起/唯一/的/憑/據/
/直到/我/再次/擁抱/你/ /請/記住/我/
/
/你/閉上/眼睛/音樂/就/會響/起/ /不停/的/愛就/永不/會/流失/
/你/閉上/眼睛/音樂/就/會響/起/ /要/不停/的/愛/
/
/請/記住/我/ /雖然/再/見/必須/說/
/請/記住/我/ /眼淚/不要/墜落/
/我/雖然/要/離/你/遠/去/ /你/住/在/我/心底/
/在/每個/分離/的/夜裡/ /為/你/唱/一首歌/
/請/記住/我/ /我/即/將會/消失/
/請/記住/我/ /我們/的/愛不會/消失/
/我用/我/的/辦法/跟/你/一起/不離/不棄/
/直到/我/再次/擁抱/你/ /請/記住/我/

Satisfied with the result or not? Evalute the result with the next session!

## Use Custom Dictionary:
We can change the dictionary to a Traditional Chinese one in hopes of getting better performance.

Download an example trad Chinese dict: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

In [21]:
jieba.set_dictionary('dict.txt.big')

In [22]:
words_4 = jieba.cut(lyrics, cut_all=False)

In [23]:
for x in words_4:
    print(x, end='/')

Building prefix dict from /Users/jsnceo/Py/jieba/dict.txt.big ...
Loading model from cache /var/folders/fp/ygfr2w8s7gb54j7j6d93v4980000gn/T/jieba.uef4963dead3e4a0d1fa39402d71b261e.cache
Loading model cost 1.236 seconds.
Prefix dict has been built succesfully.


請/記住/我/ /雖然/再見/必須/說/
/請/記住/我/ /眼淚/不要/墜落/
/我/雖然/要/離/你/遠去/ /你/住/在/我/心底/
/在/每個/分離/的/夜裡/ /為/你/唱/一首歌/
/請/記住/我/ /雖然/我要/去/遠方/
/請/記住/我/ /當/聽見/吉他/的/悲傷/
/這/就是/我/跟/你/在/一起/唯一/的/憑據/
/直到/我/再次/擁抱/你/ /請/記住/我/
/
/你/閉上眼睛/音樂/就/會/響起/ /不停/的/愛就/永不/會/流失/
/你/閉上眼睛/音樂/就/會/響起/ /要/不停/的/愛/
/
/請/記住/我/ /雖然/再見/必須/說/
/請/記住/我/ /眼淚/不要/墜落/
/我/雖然/要/離/你/遠去/ /你/住/在/我/心底/
/在/每個/分離/的/夜裡/ /為/你/唱/一首歌/
/請/記住/我/ /我/即將/會/消失/
/請/記住/我/ /我們/的/愛/不會/消失/
/我用/我/的/辦法/跟/你/一起/不/離/不棄/
/直到/我/再次/擁抱/你/ /請/記住/我/

### Hola! Very small but important differences:

* 憑/據/ is now grouped as 憑據
* 閉上/眼睛/ is now grouped as 閉上眼睛
* 再/見 is grouped as 再見
* 我/要/ is grouped as 我要
* /當聽/見/ is grouped as /當/聽見/
* 就/會響/起/ is grouped as 就/會/響起/
* /我/即/將會/消失/ is grouped as /我/即將/會/消失/
* /我們/的/愛不會/消失/ is grouped as /我們/的/愛/不會/消失/
* 不離/不棄/ is grouped as 不/離/不棄/  *(The only worsened example)*

## Load Custom Dictionary to add self-defined new words

This function allows user adding new words on top of the default (base) dictionary. Although Jieba has HMM to identify new words, it is more accurate to input on our own.

First, we have to create a file in the same format as the dictionary above. Each line with a word, word frequency, and POS tag.

> POS Tag refers to https://blog.csdn.net/kevin_darkelf/article/details/39520881

> Here, i means idiom

In [25]:
!echo '不離不棄 2 i' | tee userdict.txt

不離不棄 2 i


Using the command line, we got a file named userdict.txt with the new word we want.

Load the dict using `jieba.load_userdict()`

In [27]:
jieba.load_userdict('userdict.txt')

In [28]:
words_5 = jieba.cut(lyrics, cut_all=False)

In [29]:
for x in words_5:
    print(x, end='/')

請/記住/我/ /雖然/再見/必須/說/
/請/記住/我/ /眼淚/不要/墜落/
/我/雖然/要/離/你/遠去/ /你/住/在/我/心底/
/在/每個/分離/的/夜裡/ /為/你/唱/一首歌/
/請/記住/我/ /雖然/我要/去/遠方/
/請/記住/我/ /當/聽見/吉他/的/悲傷/
/這/就是/我/跟/你/在/一起/唯一/的/憑據/
/直到/我/再次/擁抱/你/ /請/記住/我/
/
/你/閉上眼睛/音樂/就/會/響起/ /不停/的/愛就/永不/會/流失/
/你/閉上眼睛/音樂/就/會/響起/ /要/不停/的/愛/
/
/請/記住/我/ /雖然/再見/必須/說/
/請/記住/我/ /眼淚/不要/墜落/
/我/雖然/要/離/你/遠去/ /你/住/在/我/心底/
/在/每個/分離/的/夜裡/ /為/你/唱/一首歌/
/請/記住/我/ /我/即將/會/消失/
/請/記住/我/ /我們/的/愛/不會/消失/
/我用/我/的/辦法/跟/你/一起/不離不棄/
/直到/我/再次/擁抱/你/ /請/記住/我/

See? 不離不棄 is identified as a single group!

## Returns words with the Part of Speech

In [31]:
from jieba import posseg as pseg

In [34]:
words_pseg = pseg.cut('''你閉上眼睛音樂就會響起 不停的愛就永不會流失
你閉上眼睛音樂就會響起 要不停的愛''')

In [35]:
for x in words_pseg:
    print(x)

你/r
閉上眼睛/i
音樂/n
就/d
會/v
響起/v
 /x
不停/d
的/uj
愛/v
就/d
永不/d
會/v
流失/v

/x
你/r
閉上眼睛/i
音樂/n
就/d
會/v
響起/v
 /x
要/v
不停/d
的/uj
愛/n


## Returns words with Position

In [50]:
word_token = jieba.tokenize('''你閉上眼睛音樂就會響起 不停的愛就永不會流失
你閉上眼睛音樂就會響起 要不停的愛''')

In [51]:
for x in word_token:
    print('word: %s \t\t start: %d \t\t end: %d' % (x[0],x[1],x[2]))

word: 你 		 start: 0 		 end: 1
word: 閉上眼睛 		 start: 1 		 end: 5
word: 音樂 		 start: 5 		 end: 7
word: 就 		 start: 7 		 end: 8
word: 會 		 start: 8 		 end: 9
word: 響起 		 start: 9 		 end: 11
word:   		 start: 11 		 end: 12
word: 不停 		 start: 12 		 end: 14
word: 的 		 start: 14 		 end: 15
word: 愛就 		 start: 15 		 end: 17
word: 永不 		 start: 17 		 end: 19
word: 會 		 start: 19 		 end: 20
word: 流失 		 start: 20 		 end: 22
word: 
 		 start: 22 		 end: 23
word: 你 		 start: 23 		 end: 24
word: 閉上眼睛 		 start: 24 		 end: 28
word: 音樂 		 start: 28 		 end: 30
word: 就 		 start: 30 		 end: 31
word: 會 		 start: 31 		 end: 32
word: 響起 		 start: 32 		 end: 34
word:   		 start: 34 		 end: 35
word: 要 		 start: 35 		 end: 36
word: 不停 		 start: 36 		 end: 38
word: 的 		 start: 38 		 end: 39
word: 愛 		 start: 39 		 end: 40


## Extracting keywords

Built-in IDF corpus comes in handy!

In [52]:
import jieba.analyse

In [55]:
tags = jieba.analyse.extract_tags(lyrics, topK=10, withWeight=True)

In [56]:
tags

[('記住', 1.6603843754027778),
 ('雖然', 0.8301921877013889),
 ('再見', 0.33207687508055556),
 ('必須', 0.33207687508055556),
 ('眼淚', 0.33207687508055556),
 ('墜落', 0.33207687508055556),
 ('遠去', 0.33207687508055556),
 ('每個', 0.33207687508055556),
 ('分離', 0.33207687508055556),
 ('夜裡', 0.33207687508055556)]

Remember, this result is based on the trained idf comes along with the jieba library. In practice, we might want to use different idf in different semantics environment. 

If we want to learn the idf vector for specific corpus, try using scikit-learn `sklearn.feature_extraction.text.TfidfVectorizer` and then load it with `jieba.analyse.set_idf_path(file_name)`.

Same function available for stop words: `jieba.analyse.set_stop_words(file_name)`

* Keywords extraction using TextRank algorithm is available as well! (TextRank is an algorithm developed by Mihalcea & Tarau (2004))

In [60]:
jieba.analyse.textrank(lyrics, withWeight=True)

[('記住', 1.0),
 ('音樂', 0.5567601041132135),
 ('眼淚', 0.48799869757082176),
 ('擁抱', 0.487994814254722),
 ('響起', 0.3684034610780498),
 ('聽見', 0.3232401858606175),
 ('消失', 0.3232401858606175),
 ('直到', 0.25738429474895475),
 ('再見', 0.25490417641642066),
 ('墜落', 0.2536201186061362),
 ('流失', 0.20217297849281443),
 ('吉他', 0.1854785682972578),
 ('不會', 0.1854785682972578)]

Additional Reading Materials:
1. http://blogs.lessthandot.com/index.php/artificial-intelligence/automated-keyword-extraction-tf-idf-rake-and-textrank/

References:
1. https://github.com/fxsjy/jieba
2. http://blog.fukuball.com/ru-he-shi-yong-jieba-jie-ba-zhong-wen-fen-ci-cheng-shi/