Training

Python Environment

Requirements

Set up a python environment with gensim installed. More detailed instructions here. You can also follow this video tutorial about Python virtualenv.

pip install -r requirements.txt

Train the model

Clone this repository or download this python script

git clone https://github.com/ml5js/training-word2vec/

The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run train.py with the name of the file or folder.

Example:

python train.py file.xt
python train.py files/

The script will output a vectors.txt and vectors.json file, however, if you would like to specify an output file name you can use the additional argument -o for that.

python train.py data.txt -o output.json

The output JSON file can be used now with the ml5.js word2vec examples.

Advanced tokenization

The default tokenizer is very basic. You can ask the script to use NLTK's tokenizer with the --tokenizer argument.

Additionally, the script can remove stop words.

python train.py files/ -t nltk --remove-stop-words

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert.py		convert.py
gensim_to_ml5.py		gensim_to_ml5.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training

Python Environment

Requirements

Train the model

Advanced tokenization

About

Releases

Packages

Contributors 4

Languages

License

ml5js/training-word2vec

Folders and files

Latest commit

History

Repository files navigation

Training

Python Environment

Requirements

Train the model

Advanced tokenization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages