# 完全參考 [Word2Vec-以 gensim 訓練中文詞向量](https://www.kaggle.com/code/bbqlp33/word2vec-gensim/notebook) by [HONGTW](https://www.kaggle.com/bbqlp33)

In [1]:
#安裝 簡轉繁 : zhconv
!pip install zhconv

Collecting zhconv
  Downloading zhconv-1.4.3.tar.gz (211 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.6/211.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: zhconv
  Building wheel for zhconv (setup.py) ... [?25l[?25hdone
  Created wheel for zhconv: filename=zhconv-1.4.3-py2.py3-none-any.whl size=208851 sha256=4b7f0b0ca5775ef0a41994ed28c261758ada87db2dc87a50f83dec6eb751d9a2
  Stored in directory: /root/.cache/pip/wheels/68/73/ff/95fe3e7b41a545b9701416c2178b920713b33022c3d605bdb4
Successfully built zhconv
Installing collected packages: zhconv
Successfully installed zhconv-1.4.3


## 資料下載
*   [wiki 資料](https://dumps.wikimedia.org/zhwiki/latest/)
*  [zhwiki-latest-abstract-zh-tw3.xml.gz](https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-abstract-zh-tw3.xml.gz)
*   [wget](https://eternallybored.org/misc/wget/1.21.4/64/wget.exe)

In [None]:
!wget.exe "https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-abstract-zh-tw3.xml.gz"

In [3]:
from google.colab import drive
drive.mount('/content/drive')

import gensim
import jieba
import zhconv
from gensim.corpora import WikiCorpus
from datetime import datetime as dt
from typing import List

jieba.set_dictionary('/content/drive/MyDrive/Colab Notebooks/2024.01.05/model/dict.txt.big')
print("gensim", gensim.__version__)
print("jieba", jieba.__version__)

Mounted at /content/drive
gensim 4.3.2
jieba 0.42.1


# 1.中文文本前處理
在正式訓練 Word2Vec 之前，其實涉及了文本的前處理，本篇的處理包括如下三點 (而實務上對應的不同使用情境，可能會有不同的前處理流程):

*   簡轉繁: zhconv
*   中文斷詞: jieba
*   停用詞

## 簡繁轉換
wiki 文本其實摻雜了簡體與繁體中文，比如「数学」與「數學」，這會被 word2vec 當成兩個不同的詞。[1]
所以我們在斷詞前，需要加上簡繁轉換的手續

In [4]:
zhconv.convert("这原本是一段简体中文", "zh-tw")

'這原本是一段簡體中文'

## 中文斷詞
使用 jieba jieba.cut 來進行中文斷詞，
並簡單介紹 jieba 的兩種分詞模式:

*   cut_all=False 精確模式，試圖將句子最精確地切開，適合文本分析；
*   cut_all=True 全模式，把句子中所有的可以成詞的詞語都掃描出來, 速度非常快，但是不能解決歧義；
而本篇文本訓練採用精確模式 cut_all=False

In [5]:
seg_list = jieba.cut("我来到臺北板橋中華電信", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到臺北板橋中華電信", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精確模式

Building prefix dict from /content/drive/MyDrive/Colab Notebooks/2024.01.05/model/dict.txt.big ...
DEBUG:jieba:Building prefix dict from /content/drive/MyDrive/Colab Notebooks/2024.01.05/model/dict.txt.big ...
Dumping model to file cache /tmp/jieba.u37c832946ddec8ae58f75c2e8af3e98f.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.u37c832946ddec8ae58f75c2e8af3e98f.cache
Loading model cost 2.273 seconds.
DEBUG:jieba:Loading model cost 2.273 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.


Full Mode: 我/ 来到/ 臺北/ 板橋/ 中華/ 中華電信/ 華電/ 電信
Default Mode: 我/ 来到/ 臺北/ 板橋/ 中華電信


In [6]:
print(list(jieba.cut("中英夾雜的example，Word2Vec應該很interesting吧?")))

['中', '英', '夾雜', '的', 'example', '，', 'Word2Vec', '應該', '很', 'interesting', '吧', '?']


## 引入停用詞表
停用詞就是像英文中的 the,a,this，中文的你我他，與其他詞相比顯得不怎麼重要，對文章主題也無關緊要的，
是否要使用停用詞表，其實還是要看你的應用，也有可能保留這些停用詞更能達到你的目標。[1](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)
*   Is it compulsory to remove stop words with word2vec?（https://www.quora.com/Is-it-compulsory-to-remove-stop-words-with-word2vec）
*   The Effect of Stopword Filtering prior to Word Embedding Training（https://stats.stackexchange.com/questions/201372/the-effect-of-stopword-filtering-prior-to-word-embedding-training）

In [None]:
#!pip install spacy --user

In [7]:
import spacy

# 下載語言模組
spacy.cli.download("zh_core_web_sm")  # 下載 spacy 中文模組
spacy.cli.download("en_core_web_sm")  # 下載 spacy 英文模組

nlp_zh = spacy.load("zh_core_web_sm") # 載入 spacy 中文模組
nlp_en = spacy.load("en_core_web_sm") # 載入 spacy 英文模組

# 印出前20個停用詞
print('--\n')
print(f"中文停用詞 Total={len(nlp_zh.Defaults.stop_words)}: {list(nlp_zh.Defaults.stop_words)[:20]} ...")
print("--")
print(f"英文停用詞 Total={len(nlp_en.Defaults.stop_words)}: {list(nlp_en.Defaults.stop_words)[:20]} ...")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('zh_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
--

中文停用詞 Total=1891: ['且不说', '己', '曾经', '极其', '起初', '来讲', '呜', '来着', '［③⑩］', '哼唷', '理当', '遵照', '呼啦', '每个', '据悉', '⑥', '即令', '极', '除外', '不会'] ...
--
英文停用詞 Total=326: ['using', 'is', 'a', 'ca', 'while', 'used', 'thus', 'formerly', 'mine', 'serious', '‘ve', 'even', '‘ll', 'somewhere', 'no', 'thereby', 'me', 'thence', '‘re', 'hereafter'] ...


In [None]:
STOPWORDS =  nlp_zh.Defaults.stop_words | \
             nlp_en.Defaults.stop_words | \
             set(["\n", "\r\n", "\t", " ", ""])
print(len(STOPWORDS))

# 將簡體停用詞轉成繁體，擴充停用詞表
for word in STOPWORDS.copy():
    STOPWORDS.add(zhconv.convert(word, "zh-tw"))

print(len(STOPWORDS))

2222
3005


# 讀取 wiki 語料庫，並且進行前處理和斷詞
維基百科 (wiki.xml.bz2)下載好後，先別急著解壓縮，因為這是一份 xml 文件，裏頭佈滿了各式各樣的標籤，我們得先想辦法送走這群不速之客，不過也別太擔心，gensim 早已看穿了一切，藉由調用 [wikiCorpus](https://radimrehurek.com/gensim/corpora/wikicorpus.html)，我們能很輕鬆的只取出文章的標題和內容。[1](http://zake7749.github.io/2016/08/28/word2vec-with-gensim/)

In [None]:
### 文字處理（斷詞+簡轉繁+stop word）
def preprocess_and_tokenize(
    text: str, token_min_len: int=1, token_max_len: int=15, lower: bool=True) -> List[str]:
    if lower:
        text  = text.lower()
    text = zhconv.convert(text, "zh-tw")
    return [
        token for token in jieba.cut(text, cut_all=False)
        if token_min_len <= len(token) <= token_max_len and \
            token not in STOPWORDS
    ]

In [None]:
print(preprocess_and_tokenize("歐幾里得，西元前三世紀的古希臘數學家，現在被認為是幾何之父，此畫為拉斐爾"))
print(preprocess_and_tokenize("我来到臺北板橋中華電信"))
print(preprocess_and_tokenize("the 中英夾雜的example ennn... ，Word2Vec應該很interesting吧?, right?"))

['歐幾', '裡得', '西元前', '世紀', '古希臘', '數學家', '幾何', '父', '此畫', '拉斐爾']
['來到', '臺北', '板橋', '中華電信']
['中', '英', '夾雜', 'example', 'ennn', 'word2vec', 'interesting', 'right']


In [None]:
ZhWiki = "0zhwiki-latest-abstract-zh-tw3.xml"
print(f"Parsing {ZhWiki}...")
wiki_corpus = WikiCorpus(ZhWiki, tokenizer_func=preprocess_and_tokenize, token_min_len=1)

In [None]:
g = wiki_corpus.get_texts()
print(next(g)[:10])

# 訓練 Word2Vec

In [None]:
from gensim.models import word2vec
import multiprocessing

max_cpu_counts = multiprocessing.cpu_count()
word_dim_size = 300  #  設定 word vector 維度
print(f"Use {max_cpu_counts} workers to train Word2Vec (dim={word_dim_size})")
WIKI_SEG_TXT = "data/wiki_seg.txt"

# 讀取訓練語句
sentences = word2vec.LineSentence(WIKI_SEG_TXT)

# 訓練模型
model = word2vec.Word2Vec(sentences, size=word_dim_size, workers=max_cpu_counts)

# 儲存模型
output_model = f"word2vec.zh.{word_dim_size}.model"
model.save(output_model)