Skip to content
openFrameworks addon for working with word2vec embedding, implemented in pure C++
C C++ Shell Makefile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
base_code
example_analogy
src
.gitignore
README.md
addon_config.mk
ofxaddons_thumbnail.png

README.md

ofxWord2Vec

ofxWord2Vec is an addon for openframeworks, which implements word2vec word implemented in pure C++, with ready-to-use example for computing word arithmetics such as

  • 'king - man + woman'

  • 'god - man'

and so on!

It's based on native C Google code:

The addon is inspired by the great addon ofxMSAWord2Vec by Memo Akten, but faster and with training implemented on C/C++ (TODO, currently training is on C, but in base_code folder).

Features

  • It uses only openFrameworks; no Python, ML libraries or other addons are required.

  • It allows to train and use word embeddings directly from openFrameworks project.

  • It works on CPU; it's fast.

  • It contains ready to use embedding files, one small for tests and other huge for production, see 'Embeddings files' section below.

Examples

  • example_analogy - it's ready-to-use example which works with words arithmetic. It allows to find nearest words to the combinations of words separated by ' + ' and ' - ', such as 'man - animal'. You can use any number of words in equation.

Example is shipped with vec_text8.bin embeddings vectors obtained in the way described in base_code/demo-analogy.sh

Embeddings files

  • Addon's example example_analogy contains embedding file vec_text8.bin made as described in base_code/demo-analogy.sh. About text corpus used for training this see http://mattmahoney.net/dc/textdata.html, 100000000 bytes from English Wikipedia dump on Mar. 3, 2006.

Words: 71291, dimensions: 200. This file is fast to use and is recommended for fast developing.

  • Addon's github Releases tab contains ZIP file with "GoogleNews-vectors-negative300.bin" embedding (Mikolov et al's GoogleNews model, https://code.google.com/archive/p/word2vec/). Words: 3 000 000, dimensions: 300. This is a huge file. It requires 8GB in CPU (because we store original and normalized vectors). It's works slow, but great.
You can’t perform that action at this time.