This repo contains scripts to train NLP models using the text data.
- pytorch
- numpy
- nltk.tokenize
glove.py
contains a GloVe model written in pytorch
. dataset.py
contains a Dataset class - it is written in a way so that torch.utils.data.DataLoader
utility class of pytorch
can be used for training.
$ python3 glove.py --input wiki_data.txt --batch_size 512
Trained word vectors are available on the releases page.
Let's check if the closest words make sense.
$ python3 test_word_vectors.py --word IRA
roth, iras, sep, 401, contribute
$ python3 test_word_vectors.py --word option
call, options, put, exercise, underlying
$ python3 test_word_vectors.py --word stock
shares, share, market, stocks, price
This CPU-only implementation is not yet optimized. For training on CPU, it might be best to download the Glove software from here.
- GloVe Paper
- TorchGlove repo
MIT