# Adding a custom tokenizer to spaCy and extracting keywords 
>  This post shows how to plug in a custom tokenizer to spaCy and gets decent results for the extraction of keywords from texts in traditional Chinese.  

- toc: true
- branch: master
- badges: true
- categories: [keyword-extraction, spacy, textacy, ckip-transformers, jieba, textrank, rake]
- image: images/keywords.png

![](https://github.com/howard-haowen/blog.ai/raw/master/images/keywords.png "Credit: Alex Hallatt")

# Intro

spaCy is an `industrial-strength natural language processing` library in Python, and supports multiple human languages, including Chinese. For segmenting Chinese texts into words, spaCy uses Jieba or PKUSeg under the hood. However, neither of them beats CKIP Transformers in accuracy when it comes to traditional Chinese (see my previous [post](https://howard-haowen.github.io/blog.ai/tokenization/jieba/pkuseg/pyhanlp/snownlp/ckip-transformers/2021/01/29/Many-ways-to-segment-Chinese.html) for a comparison). So I'll show how to plug in CKIP Transformers to `spaCy` to get the best out of both. 

For the purpose of demonstration, I'll situate this integration in a pipeline for extracting keywords from texts. Compared with other NLP tasks, keyword extraction is a relatively easy job. TextRank and RAKE seem to be among the most widely adopted algorithms for keyword extraction. I tried most of the methods mentioned in [this article](https://monkeylearn.com/keyword-extraction/), but there doesn't seem to be any easy-peasy implementation of TextRank or RAKE that produces decent results for traditional Chinese texts. So the first part of this post walks through a pipeline that actually works, and the second part records other methods that failed. I included the second part because I believe in this quote:

> “We learn wisdom from failure much more than from success. We often discover what will do, by finding out what will not do; and probably he who never made a mistake never made a discovery.” ― Samuel Smiles

> Note: TextRank is based on Google's PageRank, which is used to compute the rank of webpages. This [article](https://nlpforhackers.io/textrank-text-summarization/) on Natural Language Processing for Hackers demonstrates the connection between the two. From it I learned a tidbit: I always assumed that `Page` as in PageRank refers to webpages, but it turns out to be the family name of Larry Page, the creator of PageRank.

# Working pipeline

## Set variables

Let's start with defining two variables that users of our keyword extraction program might want to modify: `CUSTOM_STOPWORDS` for a list of words that users definitely hope to exclude from keyword candidates and `KW_NUM` for the number of keywords that they'd like to extract from a document.  


In [1]:
CUSTOM_STOPWORDS = [
                    "民眾","朋友","市民","人數", "全民","人員","人士","里民",
                    "影本","系統", "項目", "證件", "資格","公民", "對象","個人",
                    ]

KW_NUM = 10

## Preprocess texts

I took an announcement from Land Administration Bureau of Kaohsiung City Goverment as a sample text, but you can basically take any text in traditional Chinese to test the program. 

> Tip: To run the program with your own text, follow the following steps: 




1. Click on `Open in Colab` at the upper right corner of this page. 
2. Click on `File` and then `Save a copy in Drive`.  
3. Replace the following text with your own text. 
4. Click on `Runtime` and then `Run all`. 
5. Go to the section `Put it together` to see the outcome. 

In [2]:
#collapse

raw_text = '''
市府地政局109年度第4季開發區土地標售，共計推出8標9筆優質建地，訂於109年12月16日開標，合計總底價12 億4049萬6164 元。

 

第93期重劃區，原為國軍眷村，緊鄰國定古蹟-「原日本海軍鳳山無線電信所」，市府為保存古蹟同時活化眷村遷移後土地，以重劃方式整體開發，新闢住宅區、道路、公園及停車場，使本區具有歷史文化內涵與綠色休閒特色，生活機能更加健全。地政局首次推出1筆大面積土地，面積約2160坪，地形方整，雙面臨路，利於規劃興建景觀大樓，附近有市場、學校、公園及大東文化園區，距捷運大東站、鳳山國中站及鳳山火車站僅數分鐘車程，交通四通八達，因土地稀少性及區位條件絕佳，勢必成為投資人追逐焦點。

 

第87期重劃區，位於省道台1線旁，鄰近捷運南岡山站，重劃後擁有完善的道路系統、公園綠地及毗鄰醒村懷舊文化景觀建築群，具備優質居住環境及交通便捷要件，地政局一推出土地標售，即掀起搶標熱潮，本季再釋出1筆面積約93坪土地，臨20米介壽路及鵬程東路，附近有岡山文化中心、兆湘國小、公13、公14、陽明公園及劉厝公園，區位條件佳，投資人準備搶進！

 

第77期市地重劃區，位於鳳山區快速道路省道台88線旁，近中山高五甲系統交流道，近年推出土地標售皆順利完銷。本季再推出2筆土地，其中1筆面積約526坪，臨保華一路，適合商業使用；1筆面積107坪，位於代德三街，自用投資兩相宜。

 

高雄大學區段徵收區，為北高雄優質文教特區，優質居住環境，吸引投資人進駐，本季再推出2標2筆土地，其中1筆第三種商業區土地，面積約639坪，位於大學26街，近高雄大學正門及萬坪藍田公園，地形方正，使用強度高，適合興建優質住宅大樓；另1筆住三用地，面積約379坪，臨28米藍昌路，近高雄大學及中山高中，交通便捷。

 

另第37期重劃區及前大寮農地重劃區各推出1至2筆土地，價格合理。

 

第4季土地標售作業於109年12月1日公告，投資大眾可前往地政局土地開發處土地處分科索取標售海報及標單，或直接上網高雄房地產億年旺網站、地政局及土地開發處網站查詢下載相關資料，在期限前完成投標，另再提醒投標人，本年度已更新投標單格式，投標大眾請注意應以新式投標單投標以免投標無效作廢。

 

為配合防疫需求，本季開標作業除於地政局第一會議室辦理外，另將於地政局Facebook粉絲專頁同步直播，請大眾多加利用。

 

洽詢專線：(07)3373451或(07)3314942

高雄房地產億年旺網站（網址：http://eland.kcg.gov.tw/）

高雄市政府地政局網站（網址：http://landp.kcg.gov.tw/）

高雄市政府地政局土地開發處網站（網址：http://landevp.kcg.gov.tw/）　
'''
raw_text[-300:]

'及土地開發處網站查詢下載相關資料，在期限前完成投標，另再提醒投標人，本年度已更新投標單格式，投標大眾請注意應以新式投標單投標以免投標無效作廢。\n\n \n\n為配合防疫需求，本季開標作業除於地政局第一會議室辦理外，另將於地政局Facebook粉絲專頁同步直播，請大眾多加利用。\n\n \n\n洽詢專線：(07)3373451或(07)3314942\n\n高雄房地產億年旺網站（網址：http://eland.kcg.gov.tw/）\n\n高雄市政府地政局網站（網址：http://landp.kcg.gov.tw/）\n\n高雄市政府地政局土地開發處網站（網址：http://landevp.kcg.gov.tw/）\u3000\n'

I find this lightweight library [`nlp2`](https://github.com/voidful/nlp2) quite handy for text cleaning. The `clean_all` function removes URL links, HTML elements, and unused tags. 

> Note: I want to give a shoutout to [Eric Lam](https://github.com/voidful), who created `nlp2` and other useful NLP tools such as `NLPrep`, `TFkit`, and `nlp2go`.   

In [None]:
#collapse-output

!pip install nlp2
from nlp2 import clean_all

After cleaning, our sample text looks like this. Notice that all the URL links are gone now. 

In [4]:
text = clean_all(raw_text)
text[-300:]

'合理。\n\n \n\n第4季土地標售作業於109年12月1日公告，投資大眾可前往地政局土地開發處土地處分科索取標售海報及標單，或直接上網高雄房地產億年旺網站、地政局及土地開發處網站查詢下載相關資料，在期限前完成投標，另再提醒投標人，本年度已更新投標單格式，投標大眾請注意應以新式投標單投標以免投標無效作廢。\n\n \n\n為配合防疫需求，本季開標作業除於地政局第一會議室辦理外，另將於地政局Facebook粉絲專頁同步直播，請大眾多加利用。\n\n \n\n洽詢專線： 3373451或 3314942\n\n高雄房地產億年旺網站（網址： ）\n\n高雄市政府地政局網站（網址： ）\n\n高雄市政府地政局土地開發處網站（網址： ）'

## Install `spacy` and `ckip-transformers`

In [None]:
#collapse-output

!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download zh_core_web_sm

In [None]:
#collapse-output

!pip install -U ckip-transformers

## Tokenize texts with `ckip-transformers`

Let's create a driver for word segmentation and one for parts of speech. CKIP Transformers also has a built-in driver for named entity recognition, i.e.  `CkipNerChunker`. But we won't use it here. 

> Tip: By default, CPU is used. If you want to use GPU to speed up word segmentation, initialize `ws_driver` this way instead: `ws_driver = CkipWordSegmenter(device=-1)`

In [None]:
#collapse-output

from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger
ws_driver  = CkipWordSegmenter()
pos_driver = CkipPosTagger()

> Important: Make sure that the input to `ws_driver()` is a list even if you're only dealing with a single text. Otherwise, words won't be properly segmented. Notice that the input to `pos_driver()` is the output of `ws_driver()`. 

In [None]:
#collapse-output

ws  = ws_driver([text])
pos = pos_driver(ws)

Here're the segmented tokens. 

In [9]:
#collapse-output

tokens = ws[0]
print(tokens)

['市府', '地政局', '109年度', '第4', '季', '開發區', '土地', '標售', '，', '共計', '推出', '8', '標', '9', '筆', '優質', '建地', '，', '訂', '於', '109年', '12月', '16日', '開標', '，', '合計', '總底價', '12 億', '4049萬', '6164 ', '元', '。', '\n\n \n\n', '第93', '期', '重劃區', '，', '原', '為', '國軍', '眷村', '，', '緊鄰', '國定', '古蹟', '-', '「', '原', '日本', '海軍', '鳳山', '無線', '電信所', '」', '，', '市府', '為', '保存', '古蹟', '同時', '活化', '眷村', '遷移', '後', '土地', '，', '以', '重劃', '方式', '整體', '開發', '，', '新', '闢', '住宅區', '、', '道路', '、', '公園', '及', '停車場', '，', '使', '本', '區', '具有', '歷史', '文化', '內涵', '與', '綠色', '休閒', '特色', '，', '生活', '機能', '更加', '健全', '。', '地政局', '首次', '推出', '1', '筆', '大', '面積', '土地', '，', '面積', '約', '2160', '坪', '，', '地形', '方整', '，', '雙面', '臨', '路', '，', '利於', '規劃', '興建', '景觀', '大樓', '，', '附近', '有', '市場', '、', '學校', '、', '公園', '及', '大東', '文化', '園區', '，', '距', '捷運', '大東站', '、', '鳳山', '國中站', '及', '鳳山', '火車站', '僅', '數', '分鐘', '車程', '，', '交通', '四通八達', '，', '因', '土地', '稀少性', '及', '區位', '條件', '絕佳', '，', '勢必', '成為', '投資人', '追逐', '焦點', '。', '\n\n \n\n',

By contrast, Jieba produced lots of wrongly segmented tokens, which is precisely why we prefer CKIP Transformers.

In [10]:
#collapse-output

import jieba
print(list(jieba.cut(text)))

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.958 seconds.
Prefix dict has been built successfully.


['市府', '地', '政局', '109', '年度', '第', '4', '季開', '發區', '土地', '標售', '，', '共計', '推出', '8', '標', '9', '筆優質', '建地', '，', '訂', '於', '109', '年', '12', '月', '16', '日', '開標', '，', '合計', '總底價', '12', ' ', '億', '4049', '萬', '6164', ' ', '元', '。', '\n', '\n', ' ', '\n', '\n', '第', '93', '期重', '劃區', '，', '原為國', '軍', '眷村', '，', '緊鄰', '國定', '古', '蹟', '-', '「', '原', '日本海', '軍鳳山', '無線', '電信', '所', '」', '，', '市府', '為', '保存', '古', '蹟', '同時', '活化', '眷村', '遷移', '後', '土地', '，', '以', '重劃', '方式', '整體', '開發', '，', '新闢', '住宅', '區', '、', '道路', '、', '公園', '及', '停車場', '，', '使本區', '具有', '歷史', '文化', '內涵', '與', '綠色', '休閒', '特色', '，', '生活', '機能', '更加', '健全', '。', '地', '政局', '首次', '推出', '1', '筆大面積', '土地', '，', '面積', '約', '2160', '坪', '，', '地形', '方整', '，', '雙面', '臨路', '，', '利', '於', '規劃', '興建景', '觀大樓', '，', '附近', '有', '市場', '、', '學校', '、', '公園', '及', '大東', '文化', '園區', '，', '距捷', '運大東', '站', '、', '鳳山國', '中站', '及鳳', '山火', '車站', '僅數', '分鐘', '車程', '，', '交通', '四通', '八達', '，', '因', '土地', '稀少', '性及', '區位', '條件', '絕佳', '，', '勢必成

## Feed tokenized results to `spacy` using `WhitespaceTokenizer`

The [official website of spaCy](https://spacy.io/usage/linguistic-features#native-tokenizers) describes several ways of adding a custom tokenizer. The simplest is to define the  `WhitespaceTokenizer` class, which tokenizes a text on space characters. The output of tokenization can then be fed into subsequent operations down the pipeline, including `tagger` for parts-of-speech (POS) tagging, `parser` for dependency parsing, and `ner` for named entity recognition. This is possible primarily because `tokenizer` creates a `Doc` object whereas the other three steps operate on the `Doc` object, as illustrated in this graph. 
![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

> Note: The original code for `words` is `words = text.split(" ")`, but it caused an error to my text. So I revised it into `words = text.strip().split()`. 

In [11]:
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.strip().split()
        return Doc(self.vocab, words=words)

Next, let's load the `zh_core_web_sm` model for Chinese, which we'll need for POS tagging. Then here comes the crucial part: `nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)`. This line of code sets the default tokenizer from Jieba to `WhitespaceTokenizer`, which we just defined above.

In [12]:
import spacy
nlp = spacy.load('zh_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

Then we join the tokenized result from CKIP Transformers to a single string of space-seperated tokens.

In [13]:
#collapse-output

token_str = " ".join(tokens)
token_str

'市府 地政局 109年度 第4 季 開發區 土地 標售 ， 共計 推出 8 標 9 筆 優質 建地 ， 訂 於 109年 12月 16日 開標 ， 合計 總底價 12 億 4049萬 6164  元 。 \n\n \n\n 第93 期 重劃區 ， 原 為 國軍 眷村 ， 緊鄰 國定 古蹟 - 「 原 日本 海軍 鳳山 無線 電信所 」 ， 市府 為 保存 古蹟 同時 活化 眷村 遷移 後 土地 ， 以 重劃 方式 整體 開發 ， 新 闢 住宅區 、 道路 、 公園 及 停車場 ， 使 本 區 具有 歷史 文化 內涵 與 綠色 休閒 特色 ， 生活 機能 更加 健全 。 地政局 首次 推出 1 筆 大 面積 土地 ， 面積 約 2160 坪 ， 地形 方整 ， 雙面 臨 路 ， 利於 規劃 興建 景觀 大樓 ， 附近 有 市場 、 學校 、 公園 及 大東 文化 園區 ， 距 捷運 大東站 、 鳳山 國中站 及 鳳山 火車站 僅 數 分鐘 車程 ， 交通 四通八達 ， 因 土地 稀少性 及 區位 條件 絕佳 ， 勢必 成為 投資人 追逐 焦點 。 \n\n \n\n 第87 期 重劃區 ， 位於 省道 台1線 旁 ， 鄰近 捷運 南 岡山站 ， 重劃 後 擁有 完善 的 道路 系統 、 公園 綠地 及 毗鄰 醒村 懷舊 文化 景觀 建築群 ， 具備 優質 居住 環境 及 交通 便捷 要件 ， 地政局 一 推出 土地 標售 ， 即 掀起 搶標 熱潮 ， 本 季 再 釋出 1 筆 面積 約 93 坪 土地 ， 臨 20 米 介壽路 及 鵬程東路 ， 附近 有 岡山 文化 中心 、 兆湘 國小 、 公13 、 公14 、 陽明 公園 及 劉厝 公園 ， 區位 條件 佳 ， 投資人 準備 搶進 ！ \n\n \n\n 第77 期 市地 重劃區 ， 位於 鳳山區 快速 道路 省道 台88 線 旁 ， 近 中山高 五甲 系統 交流道 ， 近年 推出 土地 標售 皆 順利 完銷 。 本 季 再 推出 2 筆 土地 ， 其中 1 筆 面積 約 526 坪 ， 臨 保華一路 ， 適合 商業 使用 ； 1 筆 面積 107 坪 ， 位於 代德三街 ， 自用 投資 兩 相宜 。 \n\n \n\n 高雄 大學 區段 徵收區 ， 為 北 高雄 優質 文教 特區 ， 優質 居住 環境 ， 吸

Next, we feed `token_str`, our tokenized text, to `nlp` to create a spaCy `Doc` object. From this point on, we are able to leverage the power of spaCy. For every token in a `Doc` object, we have access to its text via the attribute `.text` and its parts-of-speech label via the attribute `.pos_`. 

In [36]:
#collapse-output

doc = nlp(token_str)
print([token.text for token in doc])
print([token.pos_ for token in doc])

['市府', '地政局', '109年度', '第4', '季', '開發區', '土地', '標售', '，', '共計', '推出', '8', '標', '9', '筆', '優質', '建地', '，', '訂', '於', '109年', '12月', '16日', '開標', '，', '合計', '總底價', '12', '億', '4049萬', '6164', '元', '。', '第93', '期', '重劃區', '，', '原', '為', '國軍', '眷村', '，', '緊鄰', '國定', '古蹟', '-', '「', '原', '日本', '海軍', '鳳山', '無線', '電信所', '」', '，', '市府', '為', '保存', '古蹟', '同時', '活化', '眷村', '遷移', '後', '土地', '，', '以', '重劃', '方式', '整體', '開發', '，', '新', '闢', '住宅區', '、', '道路', '、', '公園', '及', '停車場', '，', '使', '本', '區', '具有', '歷史', '文化', '內涵', '與', '綠色', '休閒', '特色', '，', '生活', '機能', '更加', '健全', '。', '地政局', '首次', '推出', '1', '筆', '大', '面積', '土地', '，', '面積', '約', '2160', '坪', '，', '地形', '方整', '，', '雙面', '臨', '路', '，', '利於', '規劃', '興建', '景觀', '大樓', '，', '附近', '有', '市場', '、', '學校', '、', '公園', '及', '大東', '文化', '園區', '，', '距', '捷運', '大東站', '、', '鳳山', '國中站', '及', '鳳山', '火車站', '僅', '數', '分鐘', '車程', '，', '交通', '四通八達', '，', '因', '土地', '稀少性', '及', '區位', '條件', '絕佳', '，', '勢必', '成為', '投資人', '追逐', '焦點', '。', '第87', '期', '重劃區', '，',

The POS tagging is made possible by the `zh_core_web_sm` model. Notice that spaCy uses coarse labels such as `NOUN` and `VERB`. By contrast, CKIP Transformers adopts a more fine-grained tagset, such as `Nc` for locative nouns and `Nd` for temporal nouns. Here're the POS labels for the same text produced by CKIP Transformers. We'll be using the spaCy's POS tagging to filter out words that we don't want in the candicate pool for keywords.

In [15]:
#collapse-output

pos_tags = pos[0]
print(pos_tags)

['Nc', 'Nc', 'Nd', 'Neu', 'Nd', 'Nc', 'Na', 'VC', 'COMMACATEGORY', 'VJ', 'VC', 'Neu', 'Nf', 'Neu', 'Nf', 'A', 'Na', 'COMMACATEGORY', 'VJ', 'P', 'Nd', 'Nd', 'Nd', 'VA', 'COMMACATEGORY', 'VG', 'Na', 'Neu', 'Neu', 'Neu', 'Nf', 'PERIODCATEGORY', 'WHITESPACE', 'Neu', 'Nf', 'Nc', 'COMMACATEGORY', 'D', 'VG', 'Na', 'Nc', 'COMMACATEGORY', 'VJ', 'A', 'Na', 'DASHCATEGORY', 'PARENTHESISCATEGORY', 'A', 'Nc', 'Nc', 'Nc', 'A', 'Nc', 'PARENTHESISCATEGORY', 'COMMACATEGORY', 'Nc', 'P', 'VC', 'Na', 'Nd', 'VHC', 'Nc', 'VC', 'Ng', 'Na', 'COMMACATEGORY', 'P', 'Nv', 'Na', 'Na', 'VC', 'COMMACATEGORY', 'VH', 'VC', 'Nc', 'PAUSECATEGORY', 'Na', 'PAUSECATEGORY', 'Nc', 'Caa', 'Nc', 'COMMACATEGORY', 'VL', 'Nes', 'Nc', 'VJ', 'Na', 'Na', 'Na', 'Caa', 'Na', 'Nv', 'Na', 'COMMACATEGORY', 'Na', 'Na', 'Dfa', 'VHC', 'PERIODCATEGORY', 'Nc', 'D', 'VC', 'Neu', 'Nf', 'VH', 'Na', 'Na', 'COMMACATEGORY', 'Na', 'Da', 'Neu', 'Nf', 'COMMACATEGORY', 'Na', 'VH', 'COMMACATEGORY', 'A', 'VCL', 'Na', 'COMMACATEGORY', 'VK', 'VC', 'VC', 'Na

## Convert stopwords in `spaCy` from simplified to Taiwanese traditional

spaCy comes with a built-in set of stopwords (basically words that we'd like to ignore), accessible via `spacy.lang.zh.stop_words`. To make good use of it, let's convert all the words from simplified characters to traditional ones with the help of `OpenCC`. 

In [None]:
#collapse-output

!pip install OpenCC
import opencc

`OpenCC` does not just convert characters mechanically. It has the ability to convert words from simplified characters to their equivalent phrasing in Taiwan Mandarin, which is done by `s2twp.json`.

In [17]:
from spacy.lang.zh.stop_words import STOP_WORDS
converter = opencc.OpenCC('s2twp.json')
spacy_stopwords_sim = list(STOP_WORDS)
print(spacy_stopwords_sim[:5])
spacy_stopwords_tra = [converter.convert(w) for w in spacy_stopwords_sim]
print(spacy_stopwords_tra[:5])

['因为', '奇', '嘿嘿', '其次', '偏偏']
['因為', '奇', '嘿嘿', '其次', '偏偏']


## Define a class for implementing TextRank

If you're dealing with English texts, you can implement TextRank quite easily with [`textaCy`](https://github.com/chartbeat-labs/textacy), the tagline of which is `NLP, before and after spaCy`. But I couldn't get it to work for Chinese texts, so I had to implement TextRank from scratch. Luckily, I got a jump-start from this [gist](https://gist.github.com/BrambleXu/3d47bbdbd1ee4e6fc695b0ddb88cbf99), which offers a blueprint for the following definitions. 

In [18]:
#collapse

from collections import OrderedDict
import numpy as np

class TextRank4Keyword():
    """Extract keywords from text"""
    
    def __init__(self):
        self.d = 0.85 # damping coefficient, usually is .85
        self.min_diff = 1e-5 # convergence threshold
        self.steps = 10 # iteration steps
        self.node_weight = None # save keywords and its weight

    def set_stopwords(self, custom_stopwords):  
        """Set stop words"""
        for word in set(spacy_stopwords_tra).union(set(custom_stopwords)):
            lexeme = nlp.vocab[word]
            lexeme.is_stop = True
    
    def sentence_segment(self, doc, candidate_pos, lower):
        """Store those words only in cadidate_pos"""
        sentences = []
        for sent in doc.sents:
            selected_words = []
            for token in sent:
                # Store words only with cadidate POS tag
                if token.pos_ in candidate_pos and token.is_stop is False:
                    if lower is True:
                        selected_words.append(token.text.lower())
                    else:
                        selected_words.append(token.text)
            sentences.append(selected_words)
        return sentences
        
    def get_vocab(self, sentences):
        """Get all tokens"""
        vocab = OrderedDict()
        i = 0
        for sentence in sentences:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = i
                    i += 1
        return vocab
    
    def get_token_pairs(self, window_size, sentences):
        """Build token_pairs from windows in sentences"""
        token_pairs = list()
        for sentence in sentences:
            for i, word in enumerate(sentence):
                for j in range(i+1, i+window_size):
                    if j >= len(sentence):
                        break
                    pair = (word, sentence[j])
                    if pair not in token_pairs:
                        token_pairs.append(pair)
        return token_pairs
        
    def symmetrize(self, a):
        return a + a.T - np.diag(a.diagonal())
    
    def get_matrix(self, vocab, token_pairs):
        """Get normalized matrix"""
        # Build matrix
        vocab_size = len(vocab)
        g = np.zeros((vocab_size, vocab_size), dtype='float')
        for word1, word2 in token_pairs:
            i, j = vocab[word1], vocab[word2]
            g[i][j] = 1
            
        # Get Symmeric matrix
        g = self.symmetrize(g)
        
        # Normalize matrix by column
        norm = np.sum(g, axis=0)
        g_norm = np.divide(g, norm, where=norm!=0) # this is to ignore the 0 element in norm
        
        return g_norm
    
    # I revised this function to return keywords as a list
    def get_keywords(self, number=10):
        """Print top number keywords"""
        node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))
        keywords = []
        for i, (key, value) in enumerate(node_weight.items()):
            keywords.append(key)
            if i > number:
                break
        return keywords

    def analyze(self, text, 
                candidate_pos=['NOUN', 'VERB'], 
                window_size=5, lower=False, stopwords=list()):
        """Main function to analyze text"""
        
        # Set stop words
        self.set_stopwords(stopwords)

        # Pare text with spaCy
        doc = nlp(token_str)
        
        # Filter sentences
        sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words
        
        # Build vocabulary
        vocab = self.get_vocab(sentences)
        
        # Get token_pairs from windows
        token_pairs = self.get_token_pairs(window_size, sentences)
        
        # Get normalized matrix
        g = self.get_matrix(vocab, token_pairs)
        
        # Initionlization for weight(pagerank value)
        pr = np.array([1] * len(vocab))
        
        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr = (1-self.d) + self.d * np.dot(g, pr)
            if abs(previous_pr - sum(pr))  < self.min_diff:
                break
            else:
                previous_pr = sum(pr)

        # Get weight for each node
        node_weight = dict()
        for word, index in vocab.items():
            node_weight[word] = pr[index]
        
        self.node_weight = node_weight

Now we can create an instace of the `TextRank4Keyword` class and call the `set_stopwords` function with our `CUSTOM_STOPWORDS` variable. This created a set of stopwords resulting from the union of both our custom stopwords and spaCy's built-in stopwords. And only words that meet these two criteria would become candidates for keywords:

*   they are **not** in the set of stopwords; 
*   their POS labels are one of those listed in `candidate_pos`, which includes `NOUN` and `VERB` by default. 



In [19]:
tr4w = TextRank4Keyword()
tr4w.set_stopwords(CUSTOM_STOPWORDS)

## Put it together

Let's put it all together by defining a main function for keyword extraction.

In [20]:
def extract_keys_from_str(raw_text):
  text = clean_all(raw_text) #clean the raw text
  ws  = ws_driver([text]) #tokenize the text with CKIP Transformers
  tokenized_text = " ".join(ws[0]) #join a list into a string 
  tr4w.analyze(tokenized_text) #create a spaCy Doc object with the string and calculate weights for words
  keys = tr4w.get_keywords(KW_NUM) #get top 10 keywords, as set by the KW_NUM variable
  return keys

Here're the top ten keywords for our sample text. The results are quite satisfactory. 

In [21]:
keys = extract_keys_from_str(raw_text)
keys = [k for k in keys if len(k) > 1]
keys

Tokenization: 100%|██████████| 1/1 [00:00<00:00, 221.73it/s]
Inference: 100%|██████████| 1/1 [00:05<00:00,  5.20s/it]


['土地', '公園', '地政局', '文化', '推出', '面積', '標售', '道路', '優質', '投標']

As a comparison, here're the top 10 keywords produced by Jieba's implementation of TextRank, 7 of which are identical to the list above. Although extracting keywords with Jieba is quick and easy, it tends to give rise to wrongly segmented tokens, such as `政局` in this example, which should have been `地政局` for Land Administration Bureau.

In [22]:
import jieba.analyse as KE
jieba_kw = KE.textrank(text, topK=10)
jieba_kw

['土地', '政局', '投標', '公園', '投資', '標售', '文化', '開發', '優質', '推出']

# Other libraries that failed

## [`textaCy`](https://github.com/chartbeat-labs/textacy)

In [None]:
#collapse-output

!pip install textacy

With textaCy, you can load a spaCy language model and then create a spaCy `Doc` object using that model. 

In [24]:
import textacy
zh = textacy.load_spacy_lang("zh_core_web_sm")
doc = textacy.make_spacy_doc(text, lang=zh)
doc._.preview

'Doc(612 tokens: "市府地政局109年度第4季開發區土地標售，共計推出8標9筆優質建地，訂於109年12月16日開...")'

textaCy implements four algorithms for keyword extraction, including TextRank. But I got useless results by calling the  `textacy.ke.textrank` function with `doc`.

In [25]:
import textacy.ke as ke
ke.textrank(doc)

[('     ', 6.0)]

## [`pyate`](https://pypi.org/project/pyate/)

In [None]:
#collapse-output

!pip install pyate

`pyate` has a built-in `TermExtractionPipeline` class for extracting keywords, which can be added to spaCy's pipeline. But it didn't work and this error message showed up: `TypeError: load() got an unexpected keyword argument 'parser'`.

In [27]:
#collapse-output

from pyate.term_extraction_pipeline import TermExtractionPipeline
nlp.add_pipe(TermExtractionPipeline())

TypeError: ignored

I found on the documentation page that `pyate` only supports English and Italian, which may account for the error I got.  

## [`pytextrank`](https://pypi.org/project/pytextrank/)

In [None]:
#collapse-output

!pip install pytextrank

To add TextRank to the spaCy pipeline, I followed the [instructions](https://spacy.io/universe/project/spacy-pytextrank) found on spaCy's documentation. But an error popped up. Luckily, `ValueError` offers possible ways to fix the problem.

In [29]:
#collapse-output

import pytextrank
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

ValueError: ignored

So I used the `@Language.factory` decorator to define a TextRank component, and then called the `nlp.add_pipe` function with `textrank`. But this didn't work either. The error message reads: `'Chinese' object has no attribute 'sents'`. 

In [30]:
from spacy.language import Language

tr = pytextrank.TextRank()

@Language.factory("textrank")
def create_textrank_component(nlp: Language, name: str):
    return tr.PipelineComponent(nlp)

In [31]:
#collapse-output

nlp.add_pipe('textrank')

AttributeError: ignored

## [`rake-spacy`](https://pypi.org/project/rake-spacy/)

I couldn't even install `rake-spacy`.

In [32]:
!pip install rake-spacy

[31mERROR: Could not find a version that satisfies the requirement rake-spacy[0m
[31mERROR: No matching distribution found for rake-spacy[0m


## [`rake-keyword`](https://pypi.org/project/rake-keyword/)

In [None]:
#collapse-output

!pip install rake-keyword

According to the [documentation on PYPI](https://pypi.org/project/rake-keyword/), the import is done by `from rake import Rake`, but it didn't work.  

In [34]:
#collapse-output

from rake import Rake

ImportError: ignored

However, based on the [documentation on GitHub](https://github.com/u-prashant/RAKE), this is done by `from rake import RAKE` instead. But it didn't work either.  


In [35]:
#collapse-output

from rake import RAKE

ImportError: ignored

# Recap

Integration of CKIP Transformers with spaCy and the TextRank algorithm generates decent results for extracting keywords from texts in traditional Chinese. Although there are many Python libraries out there that implement TextRank, none of them works better than the  `TextRank4Keyword` class crafted from scratch. Until I figure out how to properly add the TextRank component to the spaCy pipeline, I'll stick with my working pipeline shown here. As a final thought, spaCy recently released v3.0, which supports pretrained transformer models. I can't wait to give it a try and see how this would change the workflow of extracting keywords or other NLP tasks. But that'll have to wait until next post. 