- Set up a python environment with gensim installed. More detailed instructions here. You can also follow this video tutorial about Python virtualenv.
pip install -r requirements.txt
- Clone this repository or download this python script
git clone https://github.com/ml5js/training-word2vec/
- The script supports training from a single text file or directory of files. Create a text file or folder of multiple files. Now run
train.py
with the name of the file or folder.
Example:
python train.py file.xt
python train.py files/
- The script will output a
vectors.txt
andvectors.json
file, however, if you would like to specify an output file name you can use the additional argument-o
for that.
python train.py data.txt -o output.json
- The output JSON file can be used now with the ml5.js word2vec examples.
The default tokenizer is very basic. You can ask the script to use NLTK's
tokenizer with the --tokenizer
argument.
Additionally, the script can remove stop words.
python train.py files/ -t nltk --remove-stop-words