Word Embedding

Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding.

For technical details, please read my blog: Chinese version English version

Environment Setup

I tested the code using Python 3.9, it may work on other Python version, but not guaranteed. Use poetry to setup the environment is recommended.

Poetry (recommended)

pip install poetry
poetry install

Pip

virtualenv .venv -p python3
source .venv/bin/activate
pip install -r requirement.txt

Train Word Embedding on Latest Wikidump

poetry run python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--model: word2vec or fasttext
--size: number of dimension of trained word embedding
--output: path to save trained word embedding

If you are using pip, please run:

python train.py --lang en --model word2vec --size 300 --output data/en_wiki_word2vec_300.txt

Visualize the Trained Embedding:

The visualization supports only Chinese and English.

poetry run python demo.py --lang en --output data/en_wiki_word2vec_300.txt
--lang: en for English, zh for Chinese
--output: path for trained word embedding

If you are using pip, please run:

python demo.py --lang en --output data/en_wiki_word2vec_300.txt

Pretrained Word Embedding:

	Chinese	English
Word2Vec	Download	Download
FastText	Download	Download

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
demo.py		demo.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt
train.py		train.py
wiki.py		wiki.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

demo.py

demo.py

poetry.lock

poetry.lock

poetry.toml

poetry.toml

pyproject.toml

pyproject.toml

requirement.txt

requirement.txt

train.py

train.py

wiki.py

wiki.py

Repository files navigation

Word Embedding

Environment Setup

Poetry (recommended)

Pip

Train Word Embedding on Latest Wikidump

Visualize the Trained Embedding:

Pretrained Word Embedding:

About

Contributors 3

Languages

License

lintseju/word_embedding

Folders and files

Latest commit

History

Repository files navigation

Word Embedding

Environment Setup

Poetry (recommended)

Pip

Train Word Embedding on Latest Wikidump

Visualize the Trained Embedding:

Pretrained Word Embedding:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages