# Any2Vec File-based API Tutorial

This tutorial introduces new **`corpus_file`** argument for **`gensim.models.{word2vec.Word2Vec, fasttext.FastText, doc2vec.Doc2Vec}`** models and how to use it. 

## Motivation

Because standard Word2Vec training with `sentences` argument doesn't scale so well when the number of workers is large, special `corpus_file` argument was added to tackle this problem. Training with `corpus_file` yields **significant performance boost** (training is 370% faster with 32 workers in comparison to training with `sentences` argument). Also, it outruns performance of original Word2Vec C tool in terms of words/sec processing speed.

While providing such benefits in performance, `corpus_file` argument accepts path to your corpus file which must be in a format of `gensim.models.word2vec.LineSentence` (one sentence per line, words are separated by whitespaces).


**Note**: you have to build `gensim` with Cython optimizations (`gensim.models.word2vec.CORPUSFILE_VERSION >= 0`) in order to be able to use `corpus_file` argument.

## In this tutorial

* We will show how to use the new API.
* We compare performance of `corpus_file` vs. `sentences` arguments on English Wikipedia.
* We will show that accuracies on `question-words.txt` are almost the same for both modes.

## Usage is really simple

You only need:

1. Save your corpus in LineSentence format (you may use `gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file)` to save your corpus).
2. Change `sentences=your_corpus` argument to `corpus_file=your_corpus_file` in `Word2Vec.__init__`, `Word2Vec.build_vocab`, `Word2Vec.train` calls.


Short `Word2Vec` example:

In [1]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.word2vec import Word2Vec

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = Word2Vec(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)

### Let's prepare Wikipedia dataset

We load wikipedia dump from `gensim-data`, perform preprocessing with gensim functions and save processed corpus in LineSentence format.

In [2]:
CORPUS_FILE = 'wiki-en-20171001.txt'

In [None]:
import itertools
from gensim.parsing.preprocessing import preprocess_string

def processed_corpus():
    raw_corpus = api.load('wiki-english-20171001')
    for article in raw_corpus:
        doc = '\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))
        yield preprocess_string(doc)        

save_as_line_sentence(processed_corpus(), CORPUS_FILE)

## Word2Vec

We train two models:
* With `sentences` argument
* With `corpus_file` argument


Then, we compare timings and accuracy on `question-words.txt`.

In [None]:
from gensim.models.word2vec import LineSentence
import time

st_time = time.time()
model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
model_sent_training_time = time.time() - st_time

st_time = time.time()
model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
model_corp_file_training_time = time.time() - st_time

In [7]:
print("Training model with `sentences` took {:.3f}".format(model_sent_training_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(model_corp_file_training_time))

Training model with `sentences` took 8711.613
Training model with `corpus_file` took 2367.976 seconds


#### Training with `corpus_file` took 3.7x less time!

#### Now, let's compare the accuracies.

In [None]:
from gensim.test.utils import datapath

In [9]:
model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.3f}".format(model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.3f}".format(model_corp_file_accuracy))

  if np.issubdtype(vec.dtype, np.int):


Word analogy accuracy with `sentences`: 0.754
Word analogy accuracy with `corpus_file`: 0.744


#### Accuracies are approximately the same.

## FastText

Short example:

In [17]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.fasttext import FastText

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = FastText(corpus_file="my_corpus.txt", iter=5, size=300, workers=14)

#### Let's compare the timings

In [None]:
from gensim.models.word2vec import LineSentence
from gensim.models.fasttext import FastText
import time

st_time = time.time()
model_corp_file = FastText(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)
model_corp_file_training_time = time.time() - st_time

st_time = time.time()
model_sent = FastText(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)
model_sent_training_time = time.time() - st_time

In [5]:
print("Training model with `sentences` took {:.3f}".format(model_sent_training_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(model_corp_file_training_time))

Training model with `sentences` took 16199.561
Training model with `corpus_file` took 10688.134 seconds


#### We see 1.5x boost

#### Now, accuracies:

In [6]:
from gensim.test.utils import datapath

model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.3f}".format(model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.3f}".format(model_corp_file_accuracy))

  if np.issubdtype(vec.dtype, np.int):


Word analogy accuracy with `sentences`: 0.646
Word analogy accuracy with `corpus_file`: 0.659


# Doc2Vec

Short example:

In [15]:
import gensim.downloader as api
from gensim.utils import save_as_line_sentence
from gensim.models.doc2vec import Doc2Vec

corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")

model = Doc2Vec(corpus_file="my_corpus.txt", epochs=5, vector_size=300, workers=14)

#### Let's compare the timings

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedLineDocument
import time

st_time = time.time()
model_corp_file = Doc2Vec(corpus_file=CORPUS_FILE, epochs=5, vector_size=300, workers=32)
model_corp_file_training_time = time.time() - st_time

st_time = time.time()
model_sent = Doc2Vec(documents=TaggedLineDocument(CORPUS_FILE), epochs=5, vector_size=300, workers=32)
model_sent_training_time = time.time() - st_time

In [13]:
print("Training model with `sentences` took {:.3f}".format(model_sent_training_time))
print("Training model with `corpus_file` took {:.3f} seconds".format(model_corp_file_training_time))

Training model with `sentences` took 18016.579
Training model with `corpus_file` took 2908.185 seconds


#### 6x speedup!

#### Accuracies:

In [14]:
from gensim.test.utils import datapath

model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `sentences`: {:.3f}".format(model_sent_accuracy))

model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
print("Word analogy accuracy with `corpus_file`: {:.3f}".format(model_corp_file_accuracy))

  if np.issubdtype(vec.dtype, np.int):


Word analogy accuracy with `sentences`: 0.718
Word analogy accuracy with `corpus_file`: 0.685
