A repository for benchmarking popular Neural Networks frameworks used to learn unsupervised word embeddings, on different hardware platforms. This repository is a result of this issue raised on the Gensim repo which achieves the following objectives:
- Compare word2vec implementations of popular neural network frameworks by reporting metrics like time to train, peak memory, word vectors quality etc.
- Benchmark these implementaions across various cloud platforms
- Carry out the benchmarking in self-contained fully reproducible scripts for repeatable deployment
Below is a short description of the frameworks supported and the code used to train the word vectors.
Framework | Description | Code used | References |
---|---|---|---|
Tensorflow | TensorFlow is an open source software library for numerical computation using data flow graphs. Widely used for conducting machine learning and deep neural networks research. Can run on both CPU or GPU. | The code used in benchmarking has been directly picked up and modified from the tutorials on the official github page. | link to github code |
OriginalC | This tool has been made available by the original author(Tomas Mikolov)and provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. | Use unmodified code and simply run the executable by setting relevant parameters | link to code link to toolkit |
Gensim | Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. | Refer to the code given on the word2vec tutorial on the official github repo | link to reference code |
DL4J | Deeplearning4j is a commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments on distributed GPUs and CPUs. | Modified word2vec example given on the official github repo | link to github example |
Spark | Apache Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis | Tweaked official word2vec example to exploit multiple computing nodes | link to example code |
For the purpose of reproducibilty, the benchmarks need to be run inside a docker.
- Build your own docker using the
Docker
file provided in the repo. This step assumes you have installed docker. Run the following command from the directory containing the this repository.
docker build -f Dockerfile-cpu-tfsource -t manneshiva/playground:benchmarkword2vec-cpu-tfsource .
- Download the pre-built docker image from Docker's public registry.
docker pull manneshiva/playground:benchmarkword2vec-cpu-tfsource
Exploiting GPU to train word vectors inside docker requires nvidia-docker
. Requirements, installation guide and why it is necessary can be found here. Follow the same steps mentioned above replacing docker
with nvidia-docker
and benchmarkword2vec-cpu-tfsource
with benchmarkword2vec-gpu-tfsource
.
docker build -f Dockerfile-gpu-tfsource -t manneshiva/playground:benchmarkword2vec-gpu-tfsource .
P.S.: In order to benefit from sse4.2/avx/fma optimizations (which results in a faster training time for tensorflow), tensorflow has been built from source. In case your machine doesn't support this, use benchmarkword2vec-{c/g}pu
.
Run the docker image:
docker run -v absPathTo/persistent/:/benchmark-word2vec-frameworks/persistent/ -w /benchmark-word2vec-frameworks/ --rm -it -p 8888:8888 manneshiva/playground:benchmarkword2vec-cpu-tfsource /bin/bash
nvidia-docker run -v absPathTo/persistent/:/benchmark-word2vec-frameworks/persistent/ -w /benchmark-word2vec-frameworks/ --rm -it -p 8888:8888 manneshiva/playground:benchmarkword2vec-gpu /bin/bash
To run all the benchmarks, eg:
python benchmark.py --fname /benchmark-word2vec-frameworks/persistent/enwiki-20170501-2M.cor \
--frameworks tensorflow originalc dl4j gensim --epochs 4 --size 100 --window 5 --min_count 30 \
--negative 5 --workers 4 --sample 0.001 --sg 1 --batch_size 64 --alpha 0.025 --platform aws \
|& tee persistent/benchmark.log
usage help : python benchmark.py -h
Available options:
Parameter | Description |
---|---|
--frameworks | Specify frameworks to run the benchmarks on(demilited by space). If None provided, benchmarks will be run on all supported frameworks. |
--file | Path to text corpus |
--epochs | Number of iterations (epochs) over the corpus. |
--size | Dimensionality of the embeddings/feature vectors. |
--window | Maximum distance between the current and predicted word within a sentence. |
--min_count | This will discard words that appear less than MIN_COUNT times. |
--workers | Use these many worker threads to train the model. |
--sample | Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5) |
--sg | Use the skip-gram model; default is 1 (use 0 for continuous bag of words model) |
--negative | Number of negative examples; common values are 3 - 10 (0 = not used) |
--alpha | The initial learning rate. eg. 0.025 |
On completion, the report, trained vectors and logs file can be found in persistent/
.
To generate graphics from the report,fire up the notebook, go to localhost:8888 and open visualize_report.ipynb
jupyter notebook --allow-root
Modify the hosts
file in the ansible/
folder and run the following command:
ansible-playbook -i hosts benchmark.yml -v
This will install Docker, pull the Docker image and download the preprocessed wiki corpus(can be in this repository's release) without manual intervention. Once this command is complete, start the docker and run benchmark.