# *2Vec File-based Training: API Tutorial

This tutorial introduces a new file-based training mode for **`gensim.models.{Word2Vec, FastText, Doc2Vec}`** which leads to (much) faster training on machines with many cores. It documents how to use it, with Python examples.

## In this tutorial

1. We will show how to use the new training mode.
2. Evaluate its performance on the English Wikipedia and compare it to the existing mode.
3. Show that model quality (analogy accuracies on `question-words.txt`) are almost the same for both modes.

## Motivation

The original implementation of Word2Vec training in Gensim is already super fast (covered in [this blog series](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [benchmarks against other implementations](https://rare-technologies.com/machine-learning-hardware-benchmarks/)) and flexible, allowing you to train on arbitrary Python streams. We had to jump through some serious hoops to make it so, avoiding the Global Interpreter Lock (the dreaded GIL, the main bottleneck for any serious high performance computation in Python).

The end result worked great for modest machines (< 8 cores), but for higher-end servers, the GIL reared its ugly head again. Simply managing the input stream iterators and worker queues (which has to be done in Python) was becoming the bottleneck. Simply put, the Python implementation didn't scale linearly with cores, as the original C implementation by Tomáš Mikolov did.

**FIXME ADD IMAGE: x-axis CPU cores, y-axis performance (words/second)**

We decided to change that. After [much](https://github.com/RaRe-Technologies/gensim/pull/2127) [experimentation](https://github.com/RaRe-Technologies/gensim/pull/2048#issuecomment-401494412) and [benchmarking](https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html), including some pretty [hardcore outlandish ideas](https://github.com/RaRe-Technologies/gensim/pull/2127#issuecomment-405937741), we figured there's no way around the GIL limitations, at least not at this level of required performance. Remember, we're talking >500k words (training instances) per second, using highly optimized C code. Way past the naive "vectorize with NumPy arrays" territory.

So we decided to introduce a new code path, which has *less flexibility* in favour of *more performance*. We call this code path **`file-based training`**, and it's realized by passing a **new `corpus_file` parameter** to training (instead of the old `sentences` parameter, which is still available).

## How it works

<style>
.rendered_html tr, .rendered_html th, .rendered_html td {
    text-align: "left";
}
</style>

| *code path* | *input parameter* | *advantages* | *disadvantages*
| :-------- | :-------- | :--------- | :----------- |
| Python-stream training (existing) | `sentences` (Python iterable) | Input can be generated dynamically from any storage, or even on-the-fly. | Scaling plateaus after 8 cores. |
| file-based training (new) | `corpus_file` (file on disk) | Scales linearly with CPU cores. | Training corpus must be serialized to disk in a specific format. |

When you specify `corpus_file`, the model will read and process different portions of the file with different workers. The entire bulk of work is done outside of GIL, using no Python structures at all. The workers update the same weight matrix, but otherwise there's no communication, each worker munches on its data portion completely independently. This is the same approach the original C tool uses. 

Training with `corpus_file` yields a **significant performance boost**: for example, in the experiment belows training is 3.7x faster with 32 workers in comparison to training with `sentences` argument. It even outperforms the original Word2Vec C tool in terms of words/sec processing speed on high-core machines.

The limitation of this approach is that `corpus_file` argument accepts a path to your corpus file, which must be stored on disk in a specific format. The format is simply the well-known [gensim.models.word2vec.LineSentence](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence): one sentence per line, with words separated by spaces.

## How to use it

You only need to:

1. Save your corpus in the LineSentence format to disk (you may use [gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file)](https://radimrehurek.com/gensim/utils.html#gensim.utils.save_as_line_sentence) for convenience).
2. Change `sentences=your_corpus` argument to `corpus_file=your_corpus_file` in `Word2Vec.__init__`, `Word2Vec.build_vocab`, `Word2Vec.train` calls.


A short Word2Vec example:

In [2]:
import logging

import gensim
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.word2vec import Word2Vec

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s')

print(gensim.models.word2vec.CORPUSFILE_VERSION)  # must be >= 0, i.e. optimized compiled version

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = Word2Vec(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)

1


INFO - word2vec.py:1567 - collecting all words and their counts
INFO - word2vec.py:1552 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - word2vec.py:1575 - collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
INFO - word2vec.py:1626 - Loading a fresh vocabulary
INFO - word2vec.py:1650 - effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
INFO - word2vec.py:1656 - effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
INFO - word2vec.py:1715 - deleting the raw counts dictionary of 253854 items
INFO - word2vec.py:1718 - sample=0.001 downsamples 38 most-common words
INFO - word2vec.py:1721 - downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
INFO - base_any2vec.py:1020 - estimated required memory for 71290 words and 300 dimensions: 206741000 bytes
INFO - word2vec.py:1834 - resetting layer weights
INFO - base_any2vec.py:1208 - training mo

INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 10 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 9 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 8 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 7 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 6 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 5 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 4 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 3 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 2 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 1 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 0 more threads
INFO - base_any2vec.py:1342 - EPOCH - 5 : 

**FIXME: I see warnings in the training on my machine (macbook pro, using Gensim code at #0b0383).**

### Let's prepare the full Wikipedia dataset as training corpus

We load wikipedia dump from `gensim-data`, perform text preprocessing with Gensim functions, and finally save processed corpus in LineSentence format.

In [4]:
import itertools
from gensim.parsing.preprocessing import preprocess_string

CORPUS_FILE = 'wiki-en-20171001.txt'

def processed_corpus():
    raw_corpus = api.load('wiki-english-20171001')
    for article in raw_corpus:
        # concatenate all section titles and texts of each Wikipedia article into a single "sentence"
        doc = '\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))
        yield preprocess_string(doc)

# serialize the preprocessed corpus into a single file on disk
save_as_line_sentence(processed_corpus(), CORPUS_FILE)

IOError: [Errno socket error] [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590)

**XXX: Doesn't work for me (Radim). Some `gensim.downloader` issues on macbook, unrelated to *2vec training.**

## Word2Vec

We train two models:
* With `sentences` argument
* With `corpus_file` argument


Then, we compare the timings and accuracy on `question-words.txt`.

In [None]:
from gensim.models.word2vec import LineSentence
import time

start_time = time.time()
model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
print("Training model with `sentences` took {:.3f} seconds".format(time.time() - start_time))

start_time = time.time()
model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
print("Training model with `corpus_file` took {:.3f} seconds".format(time.time() - start_time))

**Training with `corpus_file` took 3.7x less time!**

Now, let's compare the accuracies:

In [None]:
from gensim.test.utils import datapath

In [9]:
model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f} %%".format(100.0 * model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f} %%".format(100.0 * model_corp_file_accuracy))

  if np.issubdtype(vec.dtype, np.int):


Word analogy accuracy with `sentences`: 0.754
Word analogy accuracy with `corpus_file`: 0.744


The accuracies are approximately the same.

**FIXME: why "approximately"? What is the "expected jitter" due to the randomness in training? For example, what's the accuracy spread of multiple runs in the same mode (e.g., for corpus_file only)?**

## FastText

Short example:

In [5]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.fasttext import FastText

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = FastText(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)

INFO - word2vec.py:1567 - collecting all words and their counts
INFO - word2vec.py:1552 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - word2vec.py:1575 - collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
INFO - word2vec.py:1626 - Loading a fresh vocabulary
INFO - word2vec.py:1650 - effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
INFO - word2vec.py:1656 - effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
INFO - word2vec.py:1715 - deleting the raw counts dictionary of 253854 items
INFO - word2vec.py:1718 - sample=0.001 downsamples 38 most-common words
INFO - word2vec.py:1721 - downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
INFO - fasttext.py:551 - estimated required memory for 71290 words, 306868 buckets and 300 dimensions: 591929400 bytes
INFO - word2vec.py:1834 - resetting layer weights
INFO - fasttext.py:1011 - Tota

INFO - base_any2vec.py:1303 - EPOCH 4 - PROGRESS: at 86.54% examples, 48363 words/s, in_qsize -1, out_qsize 1
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 2 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 1 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 0 more threads
INFO - base_any2vec.py:1342 - EPOCH - 4 : training on 17114282 raw words (8584475 effective words) took 153.1s, 56088 effective words/s
INFO - base_any2vec.py:1303 - EPOCH 5 - PROGRESS: at 7.17% examples, 4141 words/s, in_qsize -1, out_qsize 1
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 13 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 12 more threads
INFO - base_any2vec.py:347 - worker thread finished; awaiting finish of 11 more threads
INFO - base_any2vec.py:1303 - EPOCH 5 - PROGRESS: at 28.69% examples, 16413 words/s, in_qsize -1, out_qsize 1
INFO - base

**FIXME: I see some warnings in the output.**

#### Let's compare the timings

In [None]:
from gensim.models.word2vec import LineSentence
from gensim.models.fasttext import FastText
import time

start_time = time.time()
model_corp_file = FastText(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
print("Training model with `sentences` took {:.3f} seconds".format(time.time() - start_time))

start_time = time.time()
model_sent = FastText(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
print("Training model with `corpus_file` took {:.3f} seconds".format(time.time() - start_time))

**We see a 1.5x performance boost!**

#### Now, accuracies:

In [6]:
from gensim.test.utils import datapath

model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f} %%".format(100.0 * model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f} %%".format(100.0 * model_corp_file_accuracy))

  if np.issubdtype(vec.dtype, np.int):


Word analogy accuracy with `sentences`: 0.646
Word analogy accuracy with `corpus_file`: 0.659


# Doc2Vec

Short example:

In [15]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.doc2vec import Doc2Vec

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = Doc2Vec(corpus_file="my_corpus.txt", epochs=5, vector_size=300, workers=14)

#### Let's compare the timings

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
import time

start_time = time.time()
model_corp_file = Doc2Vec(corpus_file=CORPUS_FILE, epochs=5, vector_size=300, workers=32)
print("Training model with `sentences` took {:.3f} seconds".format(time.time() - start_time))

start_time = time.time()
model_sent = Doc2Vec(documents=TaggedLineDocument(CORPUS_FILE), epochs=5, vector_size=300, workers=32)
print("Training model with `corpus_file` took {:.3f} seconds".format(time.time() - st_time))

**A 6x speedup!**

#### Accuracies:

In [14]:
from gensim.test.utils import datapath

model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.1f} %%".format(100.0 * model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.1f} %%".format(100.0 * model_corp_file_accuracy))

  if np.issubdtype(vec.dtype, np.int):


Word analogy accuracy with `sentences`: 0.718
Word analogy accuracy with `corpus_file`: 0.685


**FIXME: That's a big difference. A bug?**

## TL;DR: Conclusion

In case your training corpus already lives on disk, you lose nothing by switching to the new `corpus_file` training mode. Training will be much faster.

In case your corpus is generated dynamically, you can either serialize it to disk first with `gensim.utils.save_as_line_sentence` (and then use the fast `corpus_file`), or if that's not possible continue using the existing `sentences` training mode.

------

This new code branch was created by [@persiyanov](https://github.com/persiyanov) as a Google Summer of Code 2018 project in the [RARE Student Incubator](https://rare-technologies.com/incubator/).

Questions, comments? Use our Gensim [mailing list](https://groups.google.com/forum/#!forum/gensim) and [twitter](https://twitter.com/gensim_py). Happy training!