Word2vec (word to vectors) approach for Japanese language using Gensim (Deep Learning skip-gram and CBOW models). The model is trained on the Japanese version of Wikipedia available at jawiki-latest-pages-articles.xml.bz2.
Definition: Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a high-dimensional space (typically of several hundred dimensions), with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
Further reading about word2vec: http://nlp.stanford.edu/projects/glove/
Generating the vectors from a wikipedia dump takes about 2~3 hours on a Core i5, with the default parameters.
git clone https://github.com/philipperemy/japanese-word-to-vectors.git cd japanese-word-to-vectors pip3 install -r requirements.txt # you can create a virtual env before. wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 # 2.4GB. It can take some time depending of your internet speed! # will use TinySegmenter3 for the tokenization (easy to install but less accurate) python3 generate_vectors.py # recommended. will use the MeCab tokenizer. Installation is available at http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/ # next section of the README called "Tokenize the text" provides the details to install it as well. python3 generate_vectors.py --mecab
generate_vectors.py does not detect the file
jawiki-latest-pages-articles.xml.bz2, it will download it automatically before running the long generation of the vectors.
Convert Wiki dump to text
The first step is to extract the text and the sentences of the dump. It is done in this function:
INPUT_FILENAME = 'jawiki-latest-pages-articles.xml.bz2' # This is the only input filename JA_WIKI_TEXT_FILENAME = 'jawiki-latest-text.txt' # first output file of the function JA_WIKI_SENTENCES_FILENAME = 'jawiki-latest-text-sentences.txt' # second output file of the function process_wiki_to_text(INPUT_FILENAME, JA_WIKI_TEXT_FILENAME, JA_WIKI_SENTENCES_FILENAME)
The output consists of two files:
JA_WIKI_TEXT_FILENAMEwhose content looks like:
trebuchet msフォントアンパサンドとはを意味する記号であるwhere each line corresponds to an article.
JA_WIKI_SENTENCES_FILENAMEwhere each line corresponds to a sentence or chunk of words in the text. This file will not be used in the word2vec algorithm but can be useful to train a sentence to vec (named skip-thoughts, available here https://github.com/ryankiros/skip-thoughts/).
Tokenize the text
Tokenizing means separating the full text into words by using spaces as delimiters. Two approaches are available here:
TinySegmenter3 (easy but less accurate in the tokenization phase)
The output is
JA_WIKI_TEXT_TOKENS_FILENAME. It looks like this:
trebuchet ms フォント アンパサンド と は を 意味 する
MeCab (advanced but very accurate)
I strongly advise you to read this tutorial first: How to install MeCab.
The installation depends on your OS:
brew install mecab brew install mecab-ipadic brew install git curl xz git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git cd mecab-ipadic-neologd ./bin/install-mecab-ipadic-neologd -n pip3 install mecab-python3
sudo apt-get install mecab mecab-ipadic libmecab-dev sudo apt-get install mecab-ipadic-utf8 sudo apt-get install git curl git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git cd mecab-ipadic-neologd sudo ./bin/install-mecab-ipadic-neologd -n pip3 install mecab-python3
Infer the vectors
Finally, the Gensim library is used to perform the word2vec algorithm with the parameters:
- size of 50 (dimensionality of the feature vectors)
- window of 5 (maximum distance between the current and predicted word within a sentence)
- min count of 5 (ignore all words with total frequency lower than this)
- iter of 5 (number of iterations or epochs over the corpus)
- number of workers equal to number of cores
While training, the console output looks like:
2016-09-04 02:54:38,354 : INFO : PROGRESS: at 99.74% examples, 482630 words/s, in_qsize 5, out_qsize 4 2016-09-04 02:54:39,346 : INFO : PROGRESS: at 99.82% examples, 482644 words/s, in_qsize 7, out_qsize 0 2016-09-04 02:54:40,356 : INFO : PROGRESS: at 99.90% examples, 482643 words/s, in_qsize 7, out_qsize 1 2016-09-04 02:54:41,390 : INFO : PROGRESS: at 99.98% examples, 482630 words/s, in_qsize 8, out_qsize 0
Once it's finished, 4 new files are generated:
ja-gensim.50d.data.model. This file contains the model in the binary format. Use
model = Word2Vec.load(fname)to get your word2vec model.
ja-gensim.50d.data.txt. This file contains the model vectors in the text format. Can be used in any other python script without the Gensim library!
ja-gensim.50d.data.model.wv.syn0.npy. Files generated automatically. Contains some numpy arrays (weights and other parameters). It must be in the same directory.
Finally, let's inspect
の 0.128774 3.631298 -3.058414 -0.434418 -0.300449 -1.211774 0.608027 -5.561740 -1.186208 -0.035129 1.709353 1.252130 -3.849393 0.390795 4.260262 0.209959 2.316592 -2.880473 -0.427741 -1.335913 4.500565 0.556813 0.585122 -0.739895 1.034633 3.786435 -1.032835 -5.697092 1.436553 -1.689847 -4.953261 -3.883135 1.730590 -3.211419 -2.154781 -1.915586 -0.283341 0.332927 -2.281737 0.440092 1.535507 0.925073 -4.101060 0.634421 -4.230011 -0.313288 -3.955676 0.009256 2.931253 -0.500217 に -2.019490 4.359702 -1.845176 -2.663986 1.774256 0.147722 1.484422 -2.984465 2.262582 -0.861214 0.804603 1.007627 -4.322638 -0.173283 2.905254 0.803300 2.850667 -3.859382 -0.214240 -1.914028 5.640825 -0.139551 0.243700 -3.234274 1.844652 6.613075 -2.586612 -7.520448 4.413483 -3.270162 -2.952101 -2.278936 7.161888 -6.830038 -2.042799 -0.559094 -2.270651 2.744259 -2.250800 0.269468 -0.153715 3.831476 -2.068467 1.833452 -4.605278 3.756418 -4.275790 1.822912 1.606565 -2.918230 は 0.296134 4.136690 -3.184480 -0.817397 0.555022 -1.181827 0.933714 -4.486689 -0.429983 0.427427 0.089208 1.415648 -2.763912 1.310283 5.143843 1.778646 2.280496 -4.852800 -1.581973 -1.364721 3.240205 1.227000 0.931791 -2.009395 1.856946 3.401864 -1.741597 -6.626904 -0.016503 -3.313225 -2.302027 -3.208004 4.541845 -4.704424 -2.073442 -1.192726 0.880771 -1.584695 0.450757 1.645549 1.212130 1.006536 -3.576060 0.142494 -4.799853 0.906162 -3.141263 1.762820 2.482034 -1.188599
Here we can see the vectors for
は. If we go deeper, we can see longer words such as
文献. The size of the vocabulary is the number of lines of this file (one line equals one word and its vector representation).
wc -l ja-gensim.50d.data.txt yields
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.