What ?

Gensim has a word2vec interface which can access and train word2vec models. But, this cannot work with rather huge models such as GoogleNewsVectors. This is an interface to deal with such large models. This does not read the whole file into RAM but indexes the model file and does random access. It cannot train w2v models.

How ?

Convert binary into txt

If you get pre-trained vectors in binary format or you have stored them into binary format, You must convert it into txt format. convert.c does exactly this.

 # Compile
 $ gcc convert.c  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -o convert -g3

 # Convert
 $ ./convert {filename}.bin > {filename}.txt

PS : This might take a while

Prerequisites

Install indexer_python it from https://github.com/neshkatrapati/indexer_python
Install marisa & marisa-trie

    $ sudo apt-get install marisa
    $ sudo pip install marisa-trie

Make Key-Index and Trie

    $ python make_index.py {filename}

This creates {filename}.kidx (Key-Linenumber index) and {filename}.trie (Trie of the previous file)

Make index of the W2V file

    $ python index_w2vfile.py {filename}

This creates {filename}.idx

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
__init__.py		__init__.py
access_w2v.py		access_w2v.py
convert.c		convert.c
index_w2vfile.py		index_w2vfile.py
make_index.py		make_index.py
requirements.txt		requirements.txt
w2vmmap.py		w2vmmap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What ?

How ?

Convert binary into txt

Prerequisites

Make Key-Index and Trie

Make index of the W2V file

Accessing

About

Releases

Packages

Languages

neshkatrapati/w2v-mmap

Folders and files

Latest commit

History

Repository files navigation

What ?

How ?

Convert binary into txt

Prerequisites

Make Key-Index and Trie

Make index of the W2V file

Accessing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages