Skip to content

lintseju/word_embedding

Repository files navigation

Word Embedding

Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding.

For technical details, please read my blog: Chinese version English version

Environment Setup

I tested the code using Python 3.9, it may work on other Python version, but not guaranteed. Use poetry to setup the environment is recommended.

Poetry (recommended)

pip install poetry
poetry install

Pip

virtualenv .venv -p python3
source .venv/bin/activate
pip install -r requirement.txt

Train Word Embedding on Latest Wikidump

poetry run python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--model: word2vec or fasttext
--size: number of dimension of trained word embedding
--output: path to save trained word embedding

If you are using pip, please run:

python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt

Visualize the Trained Embedding:

The visualization supports only Chinese and English.

poetry run python demo.py --lang en --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--output: path for trained word embedding

If you are using pip, please run:

python demo.py --lang en --output data/en_wiki_word2vec_300.txt

Pretrained Word Embedding:

Chinese English
Word2Vec Download Download
FastText Download Download

About

Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding.

Topics

Resources

License

Stars

Watchers

Forks

Languages