doc2vec in gensim – support pretrained word2vec

This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.

The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a forked gensim verstion to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork).

Features and notice

1.Support pretrained word2vec when train doc2vec.
2.Support Python 3.
3.Support gensim 3.8.
4.The pretrainned word2vec model should be C text format.
5.The dimension of the pretrained word2vec and the to be trained doc2vec should be the same.

Use the model

1.Install the forked gensim

Clone gensim to your machine

git clone https://github.com/maohbao/gensim.git

install gensim

python setup.py install

2.Train your doc2vec model

pretrained_emb = "word2vec_pretrained.txt" # This is a pretrained word2vec model of C text format

model = gensim.models.doc2vec.Doc2Vec(
corpus_train, # This is the documents corpus to be trained which should meet gensim's format
vector_size=300,
min_count=1, epochs=20,
dm=0,
pretrained_emb=pretrained_emb)

Publications

1.Jey Han Lau and Timothy Baldwin. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.
2.The initial forked gensim version

Name		Name	Last commit message	Last commit date
Latest commit History 3,897 Commits
.circleci		.circleci
ci		ci
continuous_integration/appveyor		continuous_integration/appveyor
docker		docker
docs		docs
gensim		gensim
release		release
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
COPYING		COPYING
HACKTOBERFEST.md		HACKTOBERFEST.md
ISSUE_TEMPLATE.md		ISSUE_TEMPLATE.md
MANIFEST.in		MANIFEST.in
README.md		README.md
appveyor.yml		appveyor.yml
gensim Quick Start.ipynb		gensim Quick Start.ipynb
jupyter_execute_cell.png		jupyter_execute_cell.png
jupyter_home.png		jupyter_home.png
pip.sh		pip.sh
requirements_docs.txt		requirements_docs.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc2vec in gensim – support pretrained word2vec

Features and notice

Use the model

1.Install the forked gensim

2.Train your doc2vec model

Publications

About

Releases

Packages

Languages

License

maohbao/gensim

Folders and files

Latest commit

History

Repository files navigation

doc2vec in gensim – support pretrained word2vec

Features and notice

Use the model

1.Install the forked gensim

2.Train your doc2vec model

Publications

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages