# Train a Cantnonese Embedding using Wikipedia data

## 1. Download Wikipedia Dump (zh-yue)
- The Wikipedia database dump ver. 20200301: https://dumps.wikimedia.org/zh_yuewiki/20200301/
- The dump we are using here is zh_yuewiki-20200301-pages-articles-multistream.xml.bz2 (57.8MB)


In [10]:
DUMP_FILE = "zh_yuewiki-20200301-pages-articles-multistream.xml.bz2"
!wget https://dumps.wikimedia.org/zh_yuewiki/20200301/$DUMP_FILE

--2020-03-21 00:01:57--  https://dumps.wikimedia.org/zh_yuewiki/20200301/zh_yuewiki-20200301-pages-articles-multistream.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60625083 (58M) [application/octet-stream]
Saving to: ‘zh_yuewiki-20200301-pages-articles-multistream.xml.bz2.1’


2020-03-21 00:02:20 (2.50 MB/s) - ‘zh_yuewiki-20200301-pages-articles-multistream.xml.bz2.1’ saved [60625083/60625083]



<br><br>

## 2. Download <a href="https://github.com/attardi/wikiextractor" target="_blank">WikiExtractor</a>

- ```WikiExtractor.py``` is a script that extracts and cleans text from a Wikipedia database dump.



In [9]:
!wget https://github.com/attardi/wikiextractor/archive/master.zip
!unzip master.zip

--2020-03-20 23:58:03--  https://github.com/attardi/wikiextractor/archive/master.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/attardi/wikiextractor/zip/master [following]
--2020-03-20 23:58:03--  https://codeload.github.com/attardi/wikiextractor/zip/master
Resolving codeload.github.com (codeload.github.com)... 140.82.113.10
Connecting to codeload.github.com (codeload.github.com)|140.82.113.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip.1’

master.zip.1            [ <=>                ] 249.29K  --.-KB/s    in 0.1s    

2020-03-20 23:58:04 (1.70 MB/s) - ‘master.zip.1’ saved [255270]

Archive:  master.zip
16186e290d9eb0eb3a3784c6c0635a9ed7e855c3
   creating: wikiextractor-master/
  inflating: wikiextractor-master/.gitignore  
  inflating: wik

<br><br>

## 3. Extract wiki articles from the dump <br>
``` WikiExtractor.py -o <OUTPUT directory> --json -b 500K <path-to-your-dump-file> ```
<br> <br>
- `-o <OUTPUT directory>` specifies the output directory
- `--json` writes output in json format instead of the default one
- `-b 500K` specifies the maximum bytes per output file (default 1M)
- `<path-to-your-dump-file>` is the path of the .xml.bz2 file


In [None]:
OUTPUT_DIR = "./wiki_extracts"

!wikiextractor-master/WikiExtractor.py -o $OUTPUT_DIR --json -b 500K $DUMP_FILE

- Example Log Messages when finished:

*INFO: Finished 7-process extraction of 77685 articles in 41.1s (1888.5 art/s) <br>
INFO: total of page: 114455, total of articl page: 77685; total of used articl page: 77685*

- Example extracted file ("wiki_00"):

``` {"url": "https://zh-yue.wikipedia.org/wiki?curid=1", "text": "頭版/2013\n", "id": "1", "title": "頭版/2013"}
{"url": "https://zh-yue.wikipedia.org/wiki?curid=2", "text": "香港\n\n香港（，），係華南一城，亦係一埠，譽為國際大都會。...... 常見嘅街頭小食有雞蛋仔、蛋撻、咖喱魚蛋、燒賣、格仔餅等等，嘢飲就有絲襪奶茶，珍珠奶茶等等。\n", "id": "2", "title": "香港"}
...... ```

- Each line in the file is an article stored in the json format. 
- An extracted file will contain several articles (json), depends on the specified maximum bytes per output file.

In [41]:
print("=== Printing the first line of an example extracted file...\n")

OUTPUT_SUB_DIR = OUTPUT_DIR + "/AA"
!head -n 1 $OUTPUT_SUB_DIR/wiki_00

=== Printing the first line of an example extracted file...

{"text": "頭版/2013\n", "id": "1", "url": "https://zh-yue.wikipedia.org/wiki?curid=1", "title": "頭版/2013"}


<br><br>

## 4. Prepare the Training Corpus

- Gensim's word2vec expects a sequence of sentences as its input. Each sentence should be a list of tokens.

- To prepare the training corpus, we need to cut the article texts into sentences, followed by a tokenization.

- *Note: Jieba tokenization can be customized*

In [37]:
import os
import json
import jieba

def read_wiki(file_name):
    ''' read json object from an extracted wiki file '''
    data = []
    with open(os.path.join(OUTPUT_SUB_DIR, file_name), 'r') as f:
        for json_obj in f:
            data.append(json.loads(json_obj))
    return data

def read_wiki_sentences(file_name, verbose=True):
    ''' Very simple way to cut the wiki article into sentences,
        followed by jieba tokenization '''
    json_list = read_wiki(file_name)
    for json_obj in json_list:
        for line in json_obj['text'].split('\n'):
            for l in line.split('。'):
                if l:
                    yield list(jieba.cut(l, cut_all=False)) + ['。']

print("=== Printing first few sentences:")
for ind, sentence in enumerate(read_wiki_sentences("wiki_00")):
    print(ind, "|", " ".join(sentence))
    if ind > 10: break

=== Printing first few sentences:
0 | 頭版 / 2013 。
1 | 香港 。
2 | 香港 （ ， ） ， 係 華南 一城 ， 亦 係 一埠 ， 譽為 國際 大都 會 。
3 | 香港 建 於 1841 年 ， 乃 百年 之 城 ， 曾經 係 英國 人 嘅 殖民地 ， 1997 年 開始 由 中華 人民 共和 國 接管 ， 成為 特別 行政 區 。
4 | 按 《 中英 聯合 聲明 》 同 《 香港基本法 》 ， 香港 係 高度 自治 嘅 地方 ， 除 外交 同防務 ， 全面 自治 ， 有 參 與 國際 組織 權利 ， 有別 於 中國內 地體制 ， 所以 官方 稱 之 爲 一國 兩制 。
5 | 香港 三面 環海 ， 東同南 係 南 中國海 ， 西 珠江口 係 零丁洋 ， 東北 係 大鵬灣 ， 北面 同中 國大陸 隔 住條 深圳河 。
6 | 香港 有 唔 少島 ， 散 佈 東南西 。
7 | 因為 瀕海 ， 水路 發達 ， 由 太平洋 到 印度洋 ， 各國 洋船 必經 之路 。
8 | 空運 亦 係 南洋 同中 國 嘅 樞 紐 ， 全球 每日 都 有 唔 少 飛機 ， 途經 香港 。
9 | 香港 同紐 約 、 倫敦合稱 「 紐倫港 」 。
10 | 香港 有 幾個 別名 ， 好似 香江 （ ） 、 東方 之珠 （ ） 噉 。
11 | 廣東話 裏 面 ， 香江 係 香港 訛音 ， 亦 有 指維 多利 亞港 似 江水 ， 橫 臥 喺 香港 島同 九龍 半島 之間 。


<br><br>

## 5. Train the word embedding using Gensim's word2vec

- Modified from <a href="https://rare-technologies.com/word2vec-tutorial/"> this tutorial </a>.

### Training Configuration

- if set `TRAIN_ALL = True`, will use all extracted files for training

In [54]:
MODEL_NAME = "wiki_canto"
EMBEDDING_SIZE = 300
MIN_COUNT = 5


TRAIN_ALL = False

if TRAIN_ALL:
    FILES = []
    for (_, _, filenames) in os.walk(OUTPUT_SUB_DIR):
        FILES.extend(filenames)
    ITER = 50
    
else:
    FILES = ["wiki_00"]
    ITER = 1

### Start Training

In [51]:
from gensim.models import Word2Vec
import time
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


class CantoSentences(object):
    def __init__(self):
        pass

    def __iter__(self):
        for file_name in FILES:
            for sentence in read_wiki_sentences(file_name):
                yield sentence

st = time.time()

corpus = CantoSentences()
model = Word2Vec(corpus, size=EMBEDDING_SIZE, min_count=MIN_COUNT, iter=ITER)

print("Training time: %s mins." % ((time.time() - st) / 60))

model.save("%s.model" % MODEL_NAME)
model.wv.save_word2vec_format('%s.bin' % MODEL_NAME, binary=True)
print("%s saved." % MODEL_NAME)

2020-03-21 00:30:29,414 : INFO : collecting all words and their counts
2020-03-21 00:30:29,422 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-21 00:30:30,871 : INFO : collected 18627 word types from a corpus of 113134 raw words and 4224 sentences
2020-03-21 00:30:30,872 : INFO : Loading a fresh vocabulary
2020-03-21 00:30:30,886 : INFO : effective_min_count=5 retains 2701 unique words (14% of original 18627, drops 15926)
2020-03-21 00:30:30,888 : INFO : effective_min_count=5 leaves 90666 word corpus (80% of original 113134, drops 22468)
2020-03-21 00:30:30,901 : INFO : deleting the raw counts dictionary of 18627 items
2020-03-21 00:30:30,903 : INFO : sample=0.001 downsamples 44 most-common words
2020-03-21 00:30:30,904 : INFO : downsampling leaves estimated 66665 word corpus (73.5% of prior 90666)
2020-03-21 00:30:30,915 : INFO : estimated required memory for 2701 words and 300 dimensions: 7832900 bytes
2020-03-21 00:30:30,916 : INFO : resetting laye

Training time: 0.05336472193400065 mins.
wiki_canto saved.


### Example Result

In [59]:
model.wv.similar_by_word("廣東話")

[('以', 0.9989677667617798),
 ('後', 0.9989623427391052),
 ('有', 0.9989590048789978),
 ('做', 0.9989559054374695),
 ('；', 0.9989538192749023),
 ('嚟', 0.9989529848098755),
 ('綫', 0.9989511370658875),
 ('，', 0.9989491105079651),
 ('而', 0.9989476799964905),
 ('唔', 0.9989449977874756)]

<br> <br>

## 6. Load the saved model

In [56]:
SAVED_MODEL = "../../data/embedding/cantonese/custom_wiki.model"

from gensim.models import Word2Vec
saved_model = Word2Vec.load(SAVED_MODEL)

2020-03-21 00:37:06,783 : INFO : loading Word2Vec object from ../../data/embedding/cantonese/custom_wiki.model
2020-03-21 00:37:07,027 : INFO : loading wv recursively from ../../data/embedding/cantonese/custom_wiki.model.wv.* with mmap=None
2020-03-21 00:37:07,029 : INFO : loading vectors from ../../data/embedding/cantonese/custom_wiki.model.wv.vectors.npy with mmap=None
2020-03-21 00:37:07,077 : INFO : setting ignored attribute vectors_norm to None
2020-03-21 00:37:07,077 : INFO : loading trainables recursively from ../../data/embedding/cantonese/custom_wiki.model.trainables.* with mmap=None
2020-03-21 00:37:07,078 : INFO : loading syn1neg from ../../data/embedding/cantonese/custom_wiki.model.trainables.syn1neg.npy with mmap=None
2020-03-21 00:37:07,120 : INFO : loading vocabulary recursively from ../../data/embedding/cantonese/custom_wiki.model.vocabulary.* with mmap=None
2020-03-21 00:37:07,121 : INFO : setting ignored attribute cum_table to None
2020-03-21 00:37:07,121 : INFO : loa

### Example Result

In [58]:
saved_model.wv.similar_by_word("廣東話")

2020-03-21 00:37:34,563 : INFO : precomputing L2-norms of word weight vectors


[('粵語', 0.569934070110321),
 ('廣州話', 0.4621201455593109),
 ('普通話', 0.4244949221611023),
 ('閩南話', 0.4133637547492981),
 ('客家話', 0.4086306095123291),
 ('口語', 0.40600326657295227),
 ('官話', 0.4033676087856293),
 ('俗語', 0.398201584815979),
 ('中文', 0.3963436484336853),
 ('文言', 0.3910912871360779)]